kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* R/W HG memory mappings with kvm?
@ 2009-07-05 22:41 Stephen Donnelly
  2009-07-06  7:38 ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Stephen Donnelly @ 2009-07-05 22:41 UTC (permalink / raw)
  To: kvm

I am looking at how to do memory mapped IO between host and guests
under kvm. I expect to use the PCI emulation layer to present a PCI
device to the guest.

I see virtio_pci uses cpu_physical_memory_map() which provides either
read or write mappings and notes "Use only for reads OR writes - not
for read-modify-write operations."

Is there an alternative method that allows large (Several MB)
persistent hg memory mappings that are r/w? I would only be using this
under kvm, not kqemu or plain qemu.

Also it appears that PCI IO memory (cpu_register_io_memory) is
provided via access functions, like the pci config space? Does this
cause a page fault/vm_exit on each read or write, or is it more
efficient than that?

Thanks,
Stephen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-07-05 22:41 R/W HG memory mappings with kvm? Stephen Donnelly
@ 2009-07-06  7:38 ` Avi Kivity
  2009-07-07 22:23   ` Stephen Donnelly
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2009-07-06  7:38 UTC (permalink / raw)
  To: Stephen Donnelly; +Cc: kvm, Cam Macdonell

On 07/06/2009 01:41 AM, Stephen Donnelly wrote:
> I am looking at how to do memory mapped IO between host and guests
> under kvm. I expect to use the PCI emulation layer to present a PCI
> device to the guest.
>
> I see virtio_pci uses cpu_physical_memory_map() which provides either
> read or write mappings and notes "Use only for reads OR writes - not
> for read-modify-write operations."
>    

Right, these are for unidirectional transient DMA.

> Is there an alternative method that allows large (Several MB)
> persistent hg memory mappings that are r/w? I would only be using this
> under kvm, not kqemu or plain qemu.
>    

All of guest memory is permanently mapped in the host.  You can use 
accessors like cpu_physical_memory_rw() or cpu_physical_memory_map() to 
access it.  What exactly do you need that is not provided by these 
accessors?

> Also it appears that PCI IO memory (cpu_register_io_memory) is
> provided via access functions, like the pci config space?

It can also use ordinary RAM (for example, vga maps its framebuffer as a 
PCI BAR).

> Does this
> cause a page fault/vm_exit on each read or write, or is it more
> efficient than that?
>    

It depends on how you configure it.  Look at the vga code (hw/vga.c, 
hw/cirrus_vga.c).  Also Cam (copied) wrote a PCI card that provides 
shared memory across guests, you may want to look at that.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-07-06  7:38 ` Avi Kivity
@ 2009-07-07 22:23   ` Stephen Donnelly
  2009-07-08  4:36     ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Stephen Donnelly @ 2009-07-07 22:23 UTC (permalink / raw)
  To: Avi Kivity, kvm

On Mon, Jul 6, 2009 at 7:38 PM, Avi Kivity<avi@redhat.com> wrote:

>> I see virtio_pci uses cpu_physical_memory_map() which provides either
>> read or write mappings and notes "Use only for reads OR writes - not
>> for read-modify-write operations."
>
> Right, these are for unidirectional transient DMA.

Okay, as I thought. I would rather have 'relatively' persistent
mappings, multi-use, and preferably bi-directional.

>> Is there an alternative method that allows large (Several MB)
>> persistent hg memory mappings that are r/w? I would only be using this
>> under kvm, not kqemu or plain qemu.
>
> All of guest memory is permanently mapped in the host.  You can use
> accessors like cpu_physical_memory_rw() or cpu_physical_memory_map() to
> access it.  What exactly do you need that is not provided by these
> accessors?

I have an existing software system that provides high speed
communication between processes on a single host using shared memory.
I would like to extend the system to provide communication between
processes on the host and guest. Unfortunately the transport is
optimised for speed and is not highly abstracted so I cannot easily
substitute a virtio-ring for example.

The system uses two memory spaces, one is a control area which is
register-like and contains R/W values at various offsets. The second
area is for data transport and is divided into rings. Each ring is
unidirectional so I could map these separately with
cpu_physical_memory_map(), but there seems to be no simple solution
for the control area. Target ring performance is perhaps 1-2
gigabytes/second with rings approx 32-512MB in size.

>> Also it appears that PCI IO memory (cpu_register_io_memory) is
>> provided via access functions, like the pci config space?
>
> It can also use ordinary RAM (for example, vga maps its framebuffer as a PCI
> BAR).

So host memory is exported as a PCI_BAR to the guest via
cpu_register_physical_memory(). It looks like the code has to
explicitly manage marking pages dirty and synchronising at appropriate
times. Is the coherency problem bidirectional, e.g. writes from either
host or guest to the shared memory need to mark pages dirty, and
ensure sync is called before the other side reads those areas?

>> Does this
>> cause a page fault/vm_exit on each read or write, or is it more
>> efficient than that?
>
> It depends on how you configure it.  Look at the vga code (hw/vga.c,
> hw/cirrus_vga.c).  Also Cam (copied) wrote a PCI card that provides shared
> memory across guests, you may want to look at that.

I will look into the vga code and see if I get inspired. The 'copied'
driver sounds interesting, the code is not in kvm git?

Thanks for the response!

Stephen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-07-07 22:23   ` Stephen Donnelly
@ 2009-07-08  4:36     ` Avi Kivity
  2009-07-08 21:33       ` Stephen Donnelly
  2009-07-08 21:45       ` Cam Macdonell
  0 siblings, 2 replies; 32+ messages in thread
From: Avi Kivity @ 2009-07-08  4:36 UTC (permalink / raw)
  To: Stephen Donnelly; +Cc: kvm

On 07/08/2009 01:23 AM, Stephen Donnelly wrote:
>
>>> Also it appears that PCI IO memory (cpu_register_io_memory) is
>>> provided via access functions, like the pci config space?
>>>        
>> It can also use ordinary RAM (for example, vga maps its framebuffer as a PCI
>> BAR).
>>      
>
> So host memory is exported as a PCI_BAR to the guest via
> cpu_register_physical_memory(). It looks like the code has to
> explicitly manage marking pages dirty and synchronising at appropriate
> times. Is the coherency problem bidirectional, e.g. writes from either
> host or guest to the shared memory need to mark pages dirty, and
> ensure sync is called before the other side reads those areas?
>    

Shared memory is fully coherent.  You can use the ordinary x86 bus lock 
operations for concurrent read-modify-write access, and the memory 
barrier instructions to prevent reordering.  Just like ordinary shared 
memory.

>>> Does this
>>> cause a page fault/vm_exit on each read or write, or is it more
>>> efficient than that?
>>>        
>> It depends on how you configure it.  Look at the vga code (hw/vga.c,
>> hw/cirrus_vga.c).  Also Cam (copied) wrote a PCI card that provides shared
>> memory across guests, you may want to look at that.
>>      
>
> I will look into the vga code and see if I get inspired. The 'copied'
> driver sounds interesting, the code is not in kvm git?
>    

(copied) means Cam was copied (cc'ed) on the email, not the name of the 
driver.  It hasn't been merged but copies (of the driver, not Cam) are 
floating around on the Internet.

The relevant parts of cirrus_vga.c are:

static void cirrus_pci_lfb_map(PCIDevice *d, int region_num,
                                uint32_t addr, uint32_t size, int type)
{

...

     /* XXX: add byte swapping apertures */
     cpu_register_physical_memory(addr, s->vga.vram_size,
                                  s->cirrus_linear_io_addr);

This function is called whenever the guest updates the BAR.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-07-08  4:36     ` Avi Kivity
@ 2009-07-08 21:33       ` Stephen Donnelly
  2009-07-09  8:10         ` Avi Kivity
  2009-07-08 21:45       ` Cam Macdonell
  1 sibling, 1 reply; 32+ messages in thread
From: Stephen Donnelly @ 2009-07-08 21:33 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

> Shared memory is fully coherent.  You can use the ordinary x86 bus lock
> operations for concurrent read-modify-write access, and the memory barrier
> instructions to prevent reordering.  Just like ordinary shared memory.

Okay, I think I was confused by the 'dirty' code. Is that just to do
with migration?

> (copied) means Cam was copied (cc'ed) on the email, not the name of the
> driver.  It hasn't been merged but copies (of the driver, not Cam) are
> floating around on the Internet.

Thanks, I'll ask him for a pointer.

> The relevant parts of cirrus_vga.c are:
>
> static void cirrus_pci_lfb_map(PCIDevice *d, int region_num,
>                               uint32_t addr, uint32_t size, int type)
> {
>
> ...
>
>    /* XXX: add byte swapping apertures */
>    cpu_register_physical_memory(addr, s->vga.vram_size,
>                                 s->cirrus_linear_io_addr);
>
> This function is called whenever the guest updates the BAR.

So guest accesses to the LFB PCI_BAR trigger the cirrus_linear
functions, which set dirty on writes and allow 'side effect' handling
for reads if required? In my case there should be no side effects, so
it could be quite simple. I wonder about the cost of the callbacks on
each access though, am I still missing something?

Thank you for your patience, I really appreciate the assistance and
look forward to using kvm more widely.

Stephen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-07-08  4:36     ` Avi Kivity
  2009-07-08 21:33       ` Stephen Donnelly
@ 2009-07-08 21:45       ` Cam Macdonell
  2009-07-08 22:01         ` Stephen Donnelly
  1 sibling, 1 reply; 32+ messages in thread
From: Cam Macdonell @ 2009-07-08 21:45 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Stephen Donnelly, kvm

Avi Kivity wrote:
> On 07/08/2009 01:23 AM, Stephen Donnelly wrote:
>>
>>>> Also it appears that PCI IO memory (cpu_register_io_memory) is
>>>> provided via access functions, like the pci config space?
>>>>        
>>> It can also use ordinary RAM (for example, vga maps its framebuffer 
>>> as a PCI
>>> BAR).
>>>      
>>
>> So host memory is exported as a PCI_BAR to the guest via
>> cpu_register_physical_memory(). It looks like the code has to
>> explicitly manage marking pages dirty and synchronising at appropriate
>> times. Is the coherency problem bidirectional, e.g. writes from either
>> host or guest to the shared memory need to mark pages dirty, and
>> ensure sync is called before the other side reads those areas?
>>    
> 
> Shared memory is fully coherent.  You can use the ordinary x86 bus lock 
> operations for concurrent read-modify-write access, and the memory 
> barrier instructions to prevent reordering.  Just like ordinary shared 
> memory.
> 
>>>> Does this
>>>> cause a page fault/vm_exit on each read or write, or is it more
>>>> efficient than that?
>>>>        
>>> It depends on how you configure it.  Look at the vga code (hw/vga.c,
>>> hw/cirrus_vga.c).  Also Cam (copied) wrote a PCI card that provides 
>>> shared
>>> memory across guests, you may want to look at that.
>>>      
>>
>> I will look into the vga code and see if I get inspired. The 'copied'
>> driver sounds interesting, the code is not in kvm git?
>>    
> 
> (copied) means Cam was copied (cc'ed) on the email, not the name of the 
> driver.  It hasn't been merged but copies (of the driver, not Cam) are 
> floating around on the Internet.

Hi Stephen,

Here is the latest patch that supports interrupts.  I am currently 
working on a broadcast mechanism that should be ready fairly soon.

http://patchwork.kernel.org/patch/22368/

I have some test scripts that can demonstrate how to use the memory 
between guest/host and guest/guest.  Let me know if you would like me to 
send them to you.

Cheers,
Cam

> 
> The relevant parts of cirrus_vga.c are:
> 
> static void cirrus_pci_lfb_map(PCIDevice *d, int region_num,
>                                uint32_t addr, uint32_t size, int type)
> {
> 
> ...
> 
>     /* XXX: add byte swapping apertures */
>     cpu_register_physical_memory(addr, s->vga.vram_size,
>                                  s->cirrus_linear_io_addr);
> 
> This function is called whenever the guest updates the BAR.
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-07-08 21:45       ` Cam Macdonell
@ 2009-07-08 22:01         ` Stephen Donnelly
  2009-07-09  6:01           ` Cam Macdonell
  0 siblings, 1 reply; 32+ messages in thread
From: Stephen Donnelly @ 2009-07-08 22:01 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: Avi Kivity, kvm

On Thu, Jul 9, 2009 at 9:45 AM, Cam Macdonell<cam@cs.ualberta.ca> wrote:
> Hi Stephen,
>
> Here is the latest patch that supports interrupts.  I am currently working
> on a broadcast mechanism that should be ready fairly soon.
>
> http://patchwork.kernel.org/patch/22368/
>
> I have some test scripts that can demonstrate how to use the memory between
> guest/host and guest/guest.  Let me know if you would like me to send them
> to you.

Hi Cam,

Thanks for the pointer. That makes perfect sense, I'm familiar with
PCI drivers so that's fine.

Is there a corresponding qemu patch for the backend to the guest pci
driver? I'm curious how the buffer memory is allocated and how BAR
accesses are handled from the host side.

Regards,
Stephen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-07-08 22:01         ` Stephen Donnelly
@ 2009-07-09  6:01           ` Cam Macdonell
  2009-07-09 22:38             ` Stephen Donnelly
       [not found]             ` <5f370d430907262256rd7f9fdalfbbec1f9492ce86@mail.gmail.com>
  0 siblings, 2 replies; 32+ messages in thread
From: Cam Macdonell @ 2009-07-09  6:01 UTC (permalink / raw)
  To: Stephen Donnelly; +Cc: Avi Kivity, kvm


On 8-Jul-09, at 4:01 PM, Stephen Donnelly wrote:

> On Thu, Jul 9, 2009 at 9:45 AM, Cam Macdonell<cam@cs.ualberta.ca>  
> wrote:
>> Hi Stephen,
>>
>> Here is the latest patch that supports interrupts.  I am currently  
>> working
>> on a broadcast mechanism that should be ready fairly soon.
>>
>> http://patchwork.kernel.org/patch/22368/
>>
>> I have some test scripts that can demonstrate how to use the memory  
>> between
>> guest/host and guest/guest.  Let me know if you would like me to  
>> send them
>> to you.
>
> Hi Cam,
>
> Thanks for the pointer. That makes perfect sense, I'm familiar with
> PCI drivers so that's fine.
>
> Is there a corresponding qemu patch for the backend to the guest pci
> driver?

Oops right.   For some reason I can't my driver patch in patchwork.

http://kerneltrap.org/mailarchive/linux-kvm/2009/5/7/5665734

> I'm curious how the buffer memory is allocated and how BAR
> accesses are handled from the host side.

The memory for the device allocated as a POSIX shared memory object  
and then mmapped on to the allocated BAR region in Qemu's allocated  
memory.  That's actually one spot that needs a bit of fixing by  
passing the already allocated memory object to qemu instead of  
mmapping on to it.

Cam



-----------------------------------------------
A. Cameron Macdonell
Ph.D. Student
Department of Computing Science
University of Alberta
cam@cs.ualberta.ca




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-07-08 21:33       ` Stephen Donnelly
@ 2009-07-09  8:10         ` Avi Kivity
  0 siblings, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2009-07-09  8:10 UTC (permalink / raw)
  To: Stephen Donnelly; +Cc: kvm

On 07/09/2009 12:33 AM, Stephen Donnelly wrote:
>> Shared memory is fully coherent.  You can use the ordinary x86 bus lock
>> operations for concurrent read-modify-write access, and the memory barrier
>> instructions to prevent reordering.  Just like ordinary shared memory.
>>      
>
> Okay, I think I was confused by the 'dirty' code. Is that just to do
> with migration?
>    

Migration and reducing vga updates.

>> static void cirrus_pci_lfb_map(PCIDevice *d, int region_num,
>>                                uint32_t addr, uint32_t size, int type)
>> {
>>
>> ...
>>
>>     /* XXX: add byte swapping apertures */
>>     cpu_register_physical_memory(addr, s->vga.vram_size,
>>                                  s->cirrus_linear_io_addr);
>>
>> This function is called whenever the guest updates the BAR.
>>      
>
> So guest accesses to the LFB PCI_BAR trigger the cirrus_linear
> functions, which set dirty on writes and allow 'side effect' handling
> for reads if required? In my case there should be no side effects, so
> it could be quite simple. I wonder about the cost of the callbacks on
> each access though, am I still missing something?
>    

Sorry, I quoted the wrong part.  vga is complicated because some modes 
do need callbacks on reads and writes, and others can use normal RAM 
(with dirty tracking).

The real direct mapping code is:

     static void map_linear_vram(CirrusVGAState *s)
     {
         vga_dirty_log_stop(&s->vga);
         if (!s->vga.map_addr && s->vga.lfb_addr && s->vga.lfb_end) {
             s->vga.map_addr = s->vga.lfb_addr;
             s->vga.map_end = s->vga.lfb_end;
             cpu_register_physical_memory(s->vga.map_addr, 
s->vga.map_end - s->vga.map_addr, s->vga.vram_offset);
         }

s->vga.vram_offset is a ram_addr_t describing the the vga framebuffer.  
You're much better off reading Cam's code as it's much simpler and 
closer to what you want to do (possibly you can use it as is).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-07-09  6:01           ` Cam Macdonell
@ 2009-07-09 22:38             ` Stephen Donnelly
  2009-07-10 17:03               ` Cam Macdonell
       [not found]             ` <5f370d430907262256rd7f9fdalfbbec1f9492ce86@mail.gmail.com>
  1 sibling, 1 reply; 32+ messages in thread
From: Stephen Donnelly @ 2009-07-09 22:38 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: Avi Kivity, kvm

On Thu, Jul 9, 2009 at 6:01 PM, Cam Macdonell<cam@cs.ualberta.ca> wrote:

>> Is there a corresponding qemu patch for the backend to the guest pci
>> driver?
>
> Oops right.   For some reason I can't my driver patch in patchwork.
>
> http://kerneltrap.org/mailarchive/linux-kvm/2009/5/7/5665734

Thanks for the link, I have read through the thread now. It seems very
relevant to what I am doing. Have you found a link to your qemu-kvm
backend patches? Or are you running your own git tree? I don't really
know where to look.

>> I'm curious how the buffer memory is allocated and how BAR
>> accesses are handled from the host side.
>
> The memory for the device allocated as a POSIX shared memory object and then
> mmapped on to the allocated BAR region in Qemu's allocated memory.  That's
> actually one spot that needs a bit of fixing by passing the already
> allocated memory object to qemu instead of mmapping on to it.

Right, I would be passing the memory in pre-allocated as well, but
should be a relatively simple change.

Regards,
Stephen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-07-09 22:38             ` Stephen Donnelly
@ 2009-07-10 17:03               ` Cam Macdonell
  2009-07-12 21:28                 ` Stephen Donnelly
  0 siblings, 1 reply; 32+ messages in thread
From: Cam Macdonell @ 2009-07-10 17:03 UTC (permalink / raw)
  To: Stephen Donnelly; +Cc: kvm

Stephen Donnelly wrote:
> On Thu, Jul 9, 2009 at 6:01 PM, Cam Macdonell<cam@cs.ualberta.ca> wrote:
> 
>>> Is there a corresponding qemu patch for the backend to the guest pci
>>> driver?
>> Oops right.   For some reason I can't my driver patch in patchwork.
>>
>> http://kerneltrap.org/mailarchive/linux-kvm/2009/5/7/5665734
> 
> Thanks for the link, I have read through the thread now. It seems very
> relevant to what I am doing. Have you found a link to your qemu-kvm
> backend patches? Or are you running your own git tree? I don't really
> know where to look.

Oops, I realize now that I passed the driver patch both times.  Here is 
the old patch.

http://patchwork.kernel.org/patch/22363/

What are you compiling against?  the git tree or a particular version? 
The above patch won't compile against the latest git tree due to changes 
to how BARs are setup in Qemu.  I can send you a patch for the latest 
tree if you need it.

Cam

> 
>>> I'm curious how the buffer memory is allocated and how BAR
>>> accesses are handled from the host side.
>> The memory for the device allocated as a POSIX shared memory object and then
>> mmapped on to the allocated BAR region in Qemu's allocated memory.  That's
>> actually one spot that needs a bit of fixing by passing the already
>> allocated memory object to qemu instead of mmapping on to it.
> 
> Right, I would be passing the memory in pre-allocated as well, but
> should be a relatively simple change.
> 
> Regards,
> Stephen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-07-10 17:03               ` Cam Macdonell
@ 2009-07-12 21:28                 ` Stephen Donnelly
  2009-07-14 22:25                   ` [PATCH] Support shared memory device PCI device Cam Macdonell
  0 siblings, 1 reply; 32+ messages in thread
From: Stephen Donnelly @ 2009-07-12 21:28 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: kvm

On Sat, Jul 11, 2009 at 5:03 AM, Cam Macdonell<cam@cs.ualberta.ca> wrote:
> Oops, I realize now that I passed the driver patch both times.  Here is the
> old patch.
>
> http://patchwork.kernel.org/patch/22363/
>
> What are you compiling against?  the git tree or a particular version? The
> above patch won't compile against the latest git tree due to changes to how
> BARs are setup in Qemu.  I can send you a patch for the latest tree if you
> need it.

Thanks Cam, I will take a look at this code.

At the moment I have cloned the tree so am intending to work at the
tip. If you have a patch for the latest tree that would be great.

Regards,
Stephen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] Support shared memory device PCI device
  2009-07-12 21:28                 ` Stephen Donnelly
@ 2009-07-14 22:25                   ` Cam Macdonell
  0 siblings, 0 replies; 32+ messages in thread
From: Cam Macdonell @ 2009-07-14 22:25 UTC (permalink / raw)
  To: Stephen Donnelly; +Cc: kvm, Cam Macdonell

This patch is an updated version of a previous one (http://patchwork.kernel.org/patch/22363/) that supports adding a shared memory PCI device.  To be clear, there is no new functionality in this patch, just a fix for changes to the master branch.

The device's memory is mappable into user-level to provide zero-copy messaging between guest and between guest and host.  Please see previous patch for a more detailed description.

---
 Makefile.target |    3 +
 hw/ivshmem.c    |  421 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/pc.c         |    6 +
 hw/pc.h         |    3 +
 qemu-options.hx |   14 ++
 sysemu.h        |    8 +
 vl.c            |   14 ++
 7 files changed, 469 insertions(+), 0 deletions(-)
 create mode 100644 hw/ivshmem.c

diff --git a/Makefile.target b/Makefile.target
index 660a855..323a935 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -611,6 +611,9 @@ obj-y += pcnet.o
 obj-y += rtl8139.o
 obj-y += e1000.o
 
+# Inter-VM PCI shared memory
+obj-y += ivshmem.o
+
 # Generic watchdog support and some watchdog devices
 obj-y += wdt_ib700.o wdt_i6300esb.o
 
diff --git a/hw/ivshmem.c b/hw/ivshmem.c
new file mode 100644
index 0000000..8b66d0f
--- /dev/null
+++ b/hw/ivshmem.c
@@ -0,0 +1,421 @@
+/*
+ * Inter-VM Shared Memory PCI device.
+ *
+ * Author:
+ *      Cam Macdonell <cam@cs.ualberta.ca>
+ *
+ * Based On: cirrus_vga.c and rtl8139.c
+ *
+ * This code is licensed under the GNU GPL v2.
+ */
+
+#include "hw.h"
+#include "console.h"
+#include "pc.h"
+#include "pci.h"
+#include "sysemu.h"
+
+#include "qemu-common.h"
+#include <sys/mman.h>
+
+#define PCI_COMMAND_IOACCESS                0x0001
+#define PCI_COMMAND_MEMACCESS               0x0002
+#define PCI_COMMAND_BUSMASTER               0x0004
+
+//#define DEBUG_IVSHMEM
+
+#ifdef DEBUG_IVSHMEM
+#define IVSHMEM_DPRINTF(fmt, args...)        \
+    do {printf("IVSHMEM: " fmt, ##args); } while (0)
+#else
+#define IVSHMEM_DPRINTF(fmt, args...)
+#endif
+
+typedef struct IVShmemState {
+    uint16_t intrmask;
+    uint16_t intrstatus;
+    uint16_t doorbell;
+    uint8_t *ivshmem_ptr;
+    unsigned long ivshmem_offset;
+    unsigned int ivshmem_size;
+    unsigned long bios_offset;
+    unsigned int bios_size;
+    target_phys_addr_t base_ctrl;
+    int it_shift;
+    PCIDevice *pci_dev;
+    CharDriverState * chr;
+    unsigned long map_addr;
+    unsigned long map_end;
+    int ivshmem_mmio_io_addr;
+} IVShmemState;
+
+typedef struct PCI_IVShmemState {
+    PCIDevice dev;
+    IVShmemState ivshmem_state;
+} PCI_IVShmemState;
+
+typedef struct IVShmemDesc {
+    char name[1024];
+    char * chrdev;
+    int size;
+} IVShmemDesc;
+
+
+/* registers for the Inter-VM shared memory device */
+enum ivshmem_registers {
+    IntrMask = 0,
+    IntrStatus = 16,
+    Doorbell = 32
+};
+
+static int num_ivshmem_devices = 0;
+static IVShmemDesc ivshmem_desc;
+
+static void ivshmem_map(PCIDevice *pci_dev, int region_num,
+                    uint32_t addr, uint32_t size, int type)
+{
+    PCI_IVShmemState *d = (PCI_IVShmemState *)pci_dev;
+    IVShmemState *s = &d->ivshmem_state;
+
+    IVSHMEM_DPRINTF("addr = %u size = %u\n", addr, size);
+    cpu_register_physical_memory(addr, s->ivshmem_size, s->ivshmem_offset);
+
+}
+
+void ivshmem_init(const char * optarg) {
+
+    char * temp;
+    char * ivshmem_sz;
+    int size;
+
+    num_ivshmem_devices++;
+
+    /* currently we only support 1 device */
+    if (num_ivshmem_devices > MAX_IVSHMEM_DEVICES) {
+        return;
+    }
+
+    temp = strdup(optarg);
+    snprintf(ivshmem_desc.name, 1024, "/%s", strsep(&temp,","));
+    ivshmem_sz=strsep(&temp,",");
+    if (ivshmem_sz != NULL){
+        size = atol(ivshmem_sz);
+    } else {
+        size = -1;
+    }
+
+    ivshmem_desc.chrdev = strsep(&temp,"\0");
+
+    if ( size == -1) {
+        ivshmem_desc.size = TARGET_PAGE_SIZE;
+    } else {
+        ivshmem_desc.size = size*1024*1024;
+    }
+    IVSHMEM_DPRINTF("optarg is %s, name is %s, size is %d, chrdev is %s\n",
+                                        optarg, ivshmem_desc.name,
+                                        ivshmem_desc.size, ivshmem_desc.chrdev);
+}
+
+int ivshmem_get_size(void) {
+    return ivshmem_desc.size;
+}
+
+/* accessing registers - based on rtl8139 */
+static void ivshmem_update_irq(IVShmemState *s)
+{
+    int isr;
+    isr = (s->intrstatus & s->intrmask) & 0xffff;
+
+    /* don't print ISR resets */
+    if (isr) {
+        IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
+           isr ? 1 : 0, s->intrstatus, s->intrmask);
+    }
+
+    qemu_set_irq(s->pci_dev->irq[0], (isr != 0));
+}
+
+static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num,
+                       uint32_t addr, uint32_t size, int type)
+{
+    PCI_IVShmemState *d = (PCI_IVShmemState *)pci_dev;
+    IVShmemState *s = &d->ivshmem_state;
+
+    cpu_register_physical_memory(addr + 0, 0x100, s->ivshmem_mmio_io_addr);
+}
+
+static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val);
+
+    s->intrmask = val;
+
+    ivshmem_update_irq(s);
+}
+
+static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrmask;
+
+    IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret);
+
+    return ret;
+}
+
+static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val)
+{
+    IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val);
+
+    s->intrstatus = val;
+
+    ivshmem_update_irq(s);
+    return;
+}
+
+static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
+{
+    uint32_t ret = s->intrstatus;
+
+    /* reading ISR clears all interrupts */
+    s->intrstatus = 0;
+
+    ivshmem_update_irq(s);
+
+    return ret;
+}
+
+static void ivshmem_io_writew(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVShmemState *s = opaque;
+
+    IVSHMEM_DPRINTF("writing 0x%x to 0x%lx\n", addr, (unsigned long) opaque);
+
+    addr &= 0xfe;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ivshmem_IntrMask_write(s, val);
+            break;
+
+        case IntrStatus:
+            ivshmem_IntrStatus_write(s, val);
+            break;
+
+        default:
+            IVSHMEM_DPRINTF("why are we writing 0x%x\n", addr);
+    }
+}
+
+static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVSHMEM_DPRINTF("We shouldn't be writing longs\n");
+}
+
+static void ivshmem_io_writeb(void *opaque, uint8_t addr, uint32_t val)
+{
+    IVShmemState *s = opaque;
+    uint8_t writebyte = val & 0xff; //write the lower 8-bits of 'val'
+
+    switch (addr)
+    {   // in future, we will probably want to support more types of doorbells
+        case Doorbell:
+            // wake up the other side
+            qemu_chr_write(s->chr, &writebyte, 1);
+            IVSHMEM_DPRINTF("Writing to the other side 0x%x\n", writebyte);
+            break;
+        default:
+            IVSHMEM_DPRINTF("Unhandled write (0x%x)\n", addr);
+    }
+}
+
+static uint32_t ivshmem_io_readw(void *opaque, uint8_t addr)
+{
+
+    IVShmemState *s = opaque;
+    uint32_t ret;
+
+    switch (addr)
+    {
+        case IntrMask:
+            ret = ivshmem_IntrMask_read(s);
+            break;
+        case IntrStatus:
+            ret = ivshmem_IntrStatus_read(s);
+            break;
+        default:
+            IVSHMEM_DPRINTF("why are we reading 0x%x\n", addr);
+            ret = 0;
+    }
+
+    return ret;
+}
+
+static uint32_t ivshmem_io_readl(void *opaque, uint8_t addr)
+{
+    IVSHMEM_DPRINTF("We shouldn't be reading longs\n");
+    return 0;
+}
+
+static uint32_t ivshmem_io_readb(void *opaque, uint8_t addr)
+{
+    IVSHMEM_DPRINTF("We shouldn't be reading bytes\n");
+
+    return 0;
+}
+
+static void ivshmem_mmio_writeb(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writeb(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writew(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writew(opaque, addr & 0xFF, val);
+}
+
+static void ivshmem_mmio_writel(void *opaque,
+                                target_phys_addr_t addr, uint32_t val)
+{
+    ivshmem_io_writel(opaque, addr & 0xFF, val);
+}
+
+static uint32_t ivshmem_mmio_readb(void *opaque, target_phys_addr_t addr)
+{
+    return ivshmem_io_readb(opaque, addr & 0xFF);
+}
+
+static uint32_t ivshmem_mmio_readw(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readw(opaque, addr & 0xFF);
+    return val;
+}
+
+static uint32_t ivshmem_mmio_readl(void *opaque, target_phys_addr_t addr)
+{
+    uint32_t val = ivshmem_io_readl(opaque, addr & 0xFF);
+    return val;
+}
+
+static CPUReadMemoryFunc *ivshmem_mmio_read[3] = {
+    ivshmem_mmio_readb,
+    ivshmem_mmio_readw,
+    ivshmem_mmio_readl,
+};
+
+static CPUWriteMemoryFunc *ivshmem_mmio_write[3] = {
+    ivshmem_mmio_writeb,
+    ivshmem_mmio_writew,
+    ivshmem_mmio_writel,
+};
+
+static int ivshmem_can_receive(void * opaque)
+{
+    return 1;
+}
+
+static void ivshmem_receive(void *opaque, const uint8_t *buf, int size)
+{
+    IVShmemState *s = opaque;
+
+    ivshmem_IntrStatus_write(s, *buf);
+
+    IVSHMEM_DPRINTF("ivshmem_receive 0x%02x\n", *buf);
+}
+
+static void ivshmem_event(void *opaque, int event)
+{
+    IVShmemState *s = opaque;
+    IVSHMEM_DPRINTF("ivshmem_event %d\n", event);
+}
+
+int pci_ivshmem_init(PCIBus *bus)
+{
+    PCI_IVShmemState *d;
+    IVShmemState *s;
+    uint8_t *pci_conf;
+    int ivshmem_fd;
+
+    IVSHMEM_DPRINTF("shared file is %s\n", ivshmem_desc.name);
+    d = (PCI_IVShmemState *)pci_register_device(bus, "kvm_ivshmem",
+                                           sizeof(PCI_IVShmemState),
+                                           -1, NULL, NULL);
+    if (!d) {
+        return -1;
+    }
+
+    s = &d->ivshmem_state;
+
+    /* allocate shared memory RAM */
+    s->ivshmem_offset = qemu_ram_alloc(ivshmem_desc.size);
+    IVSHMEM_DPRINTF("size is = %d\n", ivshmem_desc.size);
+    IVSHMEM_DPRINTF("ivshmem ram offset = %ld\n", s->ivshmem_offset);
+
+    s->ivshmem_ptr = qemu_get_ram_ptr(s->ivshmem_offset);
+
+    s->pci_dev = &d->dev;
+    s->ivshmem_size = ivshmem_desc.size;
+
+    pci_conf = d->dev.config;
+    pci_conf[0x00] = 0xf4; // Qumranet vendor ID 0x5002
+    pci_conf[0x01] = 0x1a;
+    pci_conf[0x02] = 0x10;
+    pci_conf[0x03] = 0x11;
+    pci_conf[0x04] = PCI_COMMAND_IOACCESS | PCI_COMMAND_MEMACCESS;
+    pci_conf[0x0a] = 0x00; // RAM controller
+    pci_conf[0x0b] = 0x05;
+    pci_conf[0x0e] = 0x00; // header_type
+
+    pci_conf[PCI_INTERRUPT_PIN] = 1; // we are going to support interrupts
+
+    /* XXX: ivshmem_desc.size must be a power of two */
+
+    s->ivshmem_mmio_io_addr = cpu_register_io_memory(ivshmem_mmio_read,
+                                    ivshmem_mmio_write, s);
+
+    /* region for registers*/
+    pci_register_bar(&d->dev, 0, 0x100,
+                           PCI_ADDRESS_SPACE_MEM, ivshmem_mmio_map);
+
+    /* region for shared memory */
+    pci_register_bar(&d->dev, 1, ivshmem_desc.size,
+                           PCI_ADDRESS_SPACE_MEM, ivshmem_map);
+
+    /* open shared memory file  */
+    if ((ivshmem_fd = shm_open(ivshmem_desc.name, O_CREAT|O_RDWR, S_IRWXU)) < 0)
+    {
+        fprintf(stderr, "kvm_ivshmem: could not open shared file\n");
+        exit(-1);
+    }
+
+    ftruncate(ivshmem_fd, ivshmem_desc.size);
+
+    /* mmap onto PCI device's memory */
+    if (mmap(s->ivshmem_ptr, ivshmem_desc.size, PROT_READ|PROT_WRITE,
+                        MAP_SHARED|MAP_FIXED, ivshmem_fd, 0) == MAP_FAILED)
+    {
+        fprintf(stderr, "kvm_ivshmem: could not mmap shared file\n");
+        exit(-1);
+    }
+
+    IVSHMEM_DPRINTF("shared object mapped to 0x%p\n", s->ivshmem_ptr);
+
+    /* setup character device channel */
+
+    if (ivshmem_desc.chrdev != NULL) {
+        char label[32];
+        snprintf(label, 32, "ivshmem_chardev");
+        s->chr = qemu_chr_open(label, ivshmem_desc.chrdev, NULL);
+        if (s->chr == NULL) {
+            fprintf(stderr, "No server listening on %s\n", ivshmem_desc.chrdev);
+            exit(-1);
+        }
+        qemu_chr_add_handlers(s->chr, ivshmem_can_receive, ivshmem_receive,
+                          ivshmem_event, s);
+    }
+
+    return 0;
+}
+
diff --git a/hw/pc.c b/hw/pc.c
index cf84416..f20dc83 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -91,6 +91,8 @@ static void option_rom_setup_reset(target_phys_addr_t addr, unsigned size)
     qemu_register_reset(option_rom_reset, rrd);
 }
 
+extern int ivshmem_enabled;
+
 static void ioport80_write(void *opaque, uint32_t addr, uint32_t data)
 {
 }
@@ -1289,6 +1291,10 @@ static void pc_init1(ram_addr_t ram_size,
         }
     }
 
+    if (pci_enabled && ivshmem_enabled) {
+        pci_ivshmem_init(pci_bus);
+    }
+
     rtc_state = rtc_init(0x70, i8259[8], 2000);
 
     qemu_register_boot_set(pc_boot_set, rtc_state);
diff --git a/hw/pc.h b/hw/pc.h
index 10bf002..5d19a0f 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -185,4 +185,7 @@ void extboot_init(BlockDriverState *bs, int cmd);
 
 int cpu_is_bsp(CPUState *env);
 
+/* ivshmem.c */
+int pci_ivshmem_init(PCIBus *bus);
+
 #endif
diff --git a/qemu-options.hx b/qemu-options.hx
index 7e98053..2411372 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -1309,6 +1309,20 @@ The default device is @code{vc} in graphical mode and @code{stdio} in
 non graphical mode.
 ETEXI
 
+DEF("ivshmem", HAS_ARG, QEMU_OPTION_ivshmem, \
+    "-ivshmem name,size[,unix:path][,server]  creates or opens a shared file 'name' of size \
+    'size' (in MB) and exposes it as a PCI device in the guest\n")
+STEXI
+@item -ivshmem @var{file},@var{size}
+Creates a POSIX shared file named @var{file} of size @var{size} and creates a
+PCI device of the same size that maps the shared file into the device for guests
+to access.  The created file on the host is located in /dev/shm/
+
+@item unix:@var{path}[,server]
+A unix domain socket is used to send and receive interrupts between VMs.  The unix domain socket
+@var{path} is used for connections.
+ETEXI
+
 DEF("pidfile", HAS_ARG, QEMU_OPTION_pidfile, \
     "-pidfile file   write PID to 'file'\n")
 STEXI
diff --git a/sysemu.h b/sysemu.h
index 5582633..24abda1 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -239,6 +239,14 @@ extern CharDriverState *parallel_hds[MAX_PARALLEL_PORTS];
 
 extern CharDriverState *virtcon_hds[MAX_VIRTIO_CONSOLES];
 
+/* inter-VM shared memory devices */
+
+#define MAX_IVSHMEM_DEVICES 1
+
+extern CharDriverState * ivshmem_chardev;
+void ivshmem_init(const char * optarg);
+int ivshmem_get_size(void);
+
 #define TFR(expr) do { if ((expr) != -1) break; } while (errno == EINTR)
 
 #ifdef NEED_CPU_H
diff --git a/vl.c b/vl.c
index 5d86e69..553cf5c 100644
--- a/vl.c
+++ b/vl.c
@@ -217,6 +217,7 @@ static int rtc_date_offset = -1; /* -1 means no change */
 int cirrus_vga_enabled = 1;
 int std_vga_enabled = 0;
 int vmsvga_enabled = 0;
+int ivshmem_enabled = 0;
 int xenfb_enabled = 0;
 #ifdef TARGET_SPARC
 int graphic_width = 1024;
@@ -235,6 +236,8 @@ int no_quit = 0;
 CharDriverState *serial_hds[MAX_SERIAL_PORTS];
 CharDriverState *parallel_hds[MAX_PARALLEL_PORTS];
 CharDriverState *virtcon_hds[MAX_VIRTIO_CONSOLES];
+CharDriverState *ivshmem_chardev;
+const char * ivshmem_device;
 #ifdef TARGET_I386
 int win2k_install_hack = 0;
 int rtc_td_hack = 0;
@@ -5139,6 +5142,8 @@ int main(int argc, char **argv, char **envp)
     cyls = heads = secs = 0;
     translation = BIOS_ATA_TRANSLATION_AUTO;
     monitor_device = "vc:80Cx24C";
+    ivshmem_device = NULL;
+    ivshmem_chardev = NULL;
 
     serial_devices[0] = "vc:80Cx24C";
     for(i = 1; i < MAX_SERIAL_PORTS; i++)
@@ -5592,6 +5597,10 @@ int main(int argc, char **argv, char **envp)
                 parallel_devices[parallel_device_index] = optarg;
                 parallel_device_index++;
                 break;
+            case QEMU_OPTION_ivshmem:
+                ivshmem_device = optarg;
+                ivshmem_enabled = 1;
+                break;
 	    case QEMU_OPTION_loadvm:
 		loadvm = optarg;
 		break;
@@ -6049,6 +6058,11 @@ int main(int argc, char **argv, char **envp)
 	    }
     }
 
+    if (ivshmem_enabled) {
+        ivshmem_init(ivshmem_device);
+        ram_size += ivshmem_get_size();
+    }
+
 #ifdef CONFIG_KQEMU
     /* FIXME: This is a nasty hack because kqemu can't cope with dynamic
        guest ram allocation.  It needs to go away.  */
-- 
1.6.3.2.198.g6096d


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
       [not found]             ` <5f370d430907262256rd7f9fdalfbbec1f9492ce86@mail.gmail.com>
@ 2009-07-27 14:48               ` Cam Macdonell
  2009-07-27 21:32                 ` Stephen Donnelly
  0 siblings, 1 reply; 32+ messages in thread
From: Cam Macdonell @ 2009-07-27 14:48 UTC (permalink / raw)
  To: Stephen Donnelly; +Cc: Avi Kivity, kvm@vger.kernel.org list

Stephen Donnelly wrote:
> Hi Cam,

Hi Steve,

Sorry I haven't answered your email from last Thursday.  I'll answer it 
shortly.

> 
> On Thu, Jul 9, 2009 at 6:01 PM, Cam Macdonell<cam@cs.ualberta.ca> wrote:
> 
>> The memory for the device allocated as a POSIX shared memory object and then
>> mmapped on to the allocated BAR region in Qemu's allocated memory.  That's
>> actually one spot that needs a bit of fixing by passing the already
>> allocated memory object to qemu instead of mmapping on to it.
> 
> If you work out how to use pre-existing host memory rather than
> allocating it inside qemu I would be interested.

How is the host memory pre-existing?

> 
> I would like to have qemu mmap memory from a host char driver, and
> then in turn register that mapping as a PCI BAR for the guest device.
> (I know this sounds like pci pass-through, but it isn't.)

In my setup, qemu just calls mmap on the shared memory object that was 
opened.  So I *think* that switching the shm_open(...) to 
open("/dev/chardev"), might be all that's necessary as long as your char 
device handles mmapping.

> What I don't understand is how to turn the host address returned from
> mmap into a ram_addr_t to pass to pci_register_bar.

Memory must be allocated using the qemu RAM functions.  Look at 
qemu_ram_alloc() and qemu_get_ram_ptr() which are a two step process 
that allocate the memory.  Then notice that the ivshmem_ptr is mmapped 
on to the memory that is returned from the qemu_get_ram_ptr.

pci_register_bar calls a function (the last parameter passed to it) that 
in turn calls cpu_register_physical_memory which registers the allocated 
memory (accessed a s->ivshmem_ptr) as the BAR.

Let me know if you have any more questions,
Cam


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-07-27 14:48               ` R/W HG memory mappings with kvm? Cam Macdonell
@ 2009-07-27 21:32                 ` Stephen Donnelly
  2009-07-28  8:54                   ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Stephen Donnelly @ 2009-07-27 21:32 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: Avi Kivity, kvm@vger.kernel.org list

Hi Cam,

> Sorry I haven't answered your email from last Thursday.  I'll answer it
> shortly.

Thanks, I'm still chipping away at it slowly.

>> On Thu, Jul 9, 2009 at 6:01 PM, Cam Macdonell<cam@cs.ualberta.ca> wrote:
>>
>>> The memory for the device allocated as a POSIX shared memory object and
>>> then
>>> mmapped on to the allocated BAR region in Qemu's allocated memory.
>>>  That's
>>> actually one spot that needs a bit of fixing by passing the already
>>> allocated memory object to qemu instead of mmapping on to it.
>>
>> If you work out how to use pre-existing host memory rather than
>> allocating it inside qemu I would be interested.
>
> How is the host memory pre-existing?

It comes from outside qemu, it is mapped in via mmap.

>> I would like to have qemu mmap memory from a host char driver, and
>> then in turn register that mapping as a PCI BAR for the guest device.
>> (I know this sounds like pci pass-through, but it isn't.)
>
> In my setup, qemu just calls mmap on the shared memory object that was
> opened.  So I *think* that switching the shm_open(...) to
> open("/dev/chardev"), might be all that's necessary as long as your char
> device handles mmapping.

It does, but it maps memory into the user program rather than out.

>> What I don't understand is how to turn the host address returned from
>> mmap into a ram_addr_t to pass to pci_register_bar.
>
> Memory must be allocated using the qemu RAM functions.

That seems to be the problem. The memory cannot be allocated by
qemu_ram_alloc, because it is coming from the mmap call. The memory is
already allocated outside the qemu process. mmap can indicate where in
the qemu process address space the local mapping should be, but
mapping it 'on top' of memory allocated with qemu_ram_alloc doesn't
seem to work (I get a BUG in gfn_to_pfn).

>  Look at
> qemu_ram_alloc() and qemu_get_ram_ptr() which are a two step process that
> allocate the memory.  Then notice that the ivshmem_ptr is mmapped on to the
> memory that is returned from the qemu_get_ram_ptr.
>
> pci_register_bar calls a function (the last parameter passed to it) that in
> turn calls cpu_register_physical_memory which registers the allocated memory
> (accessed a s->ivshmem_ptr) as the BAR.

Right, that seems to make sense for your application where you
allocate the memory in qemu and then share it externally via shm.

Have you thought about how to use a shm file that has already been
allocated by another application? I think you mentioned this as a
feature you were going to look at in one of your list posts.

Regards,
Stephen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-07-27 21:32                 ` Stephen Donnelly
@ 2009-07-28  8:54                   ` Avi Kivity
  2009-07-28 23:06                     ` Stephen Donnelly
  2009-07-29 23:52                     ` Cam Macdonell
  0 siblings, 2 replies; 32+ messages in thread
From: Avi Kivity @ 2009-07-28  8:54 UTC (permalink / raw)
  To: Stephen Donnelly; +Cc: Cam Macdonell, kvm@vger.kernel.org list

On 07/28/2009 12:32 AM, Stephen Donnelly wrote:
>>> What I don't understand is how to turn the host address returned from
>>> mmap into a ram_addr_t to pass to pci_register_bar.
>>>        
>> Memory must be allocated using the qemu RAM functions.
>>      
>
> That seems to be the problem. The memory cannot be allocated by
> qemu_ram_alloc, because it is coming from the mmap call. The memory is
> already allocated outside the qemu process. mmap can indicate where in
> the qemu process address space the local mapping should be, but
> mapping it 'on top' of memory allocated with qemu_ram_alloc doesn't
> seem to work (I get a BUG in gfn_to_pfn).
>    

You need a variant of qemu_ram_alloc() that accepts an fd and offset and 
mmaps that.  A less intrusive, but uglier, alternative is to call 
qemu_ram_alloc() and them mmap(MAP_FIXED) on top of that.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-07-28  8:54                   ` Avi Kivity
@ 2009-07-28 23:06                     ` Stephen Donnelly
  2009-08-13  4:07                       ` Stephen Donnelly
  2009-07-29 23:52                     ` Cam Macdonell
  1 sibling, 1 reply; 32+ messages in thread
From: Stephen Donnelly @ 2009-07-28 23:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Cam Macdonell, kvm@vger.kernel.org list

On Tue, Jul 28, 2009 at 8:54 PM, Avi Kivity<avi@redhat.com> wrote:
> On 07/28/2009 12:32 AM, Stephen Donnelly wrote:
>>>>
>>>> What I don't understand is how to turn the host address returned from
>>>> mmap into a ram_addr_t to pass to pci_register_bar.
>>>
>>> Memory must be allocated using the qemu RAM functions.
>>
>> That seems to be the problem. The memory cannot be allocated by
>> qemu_ram_alloc, because it is coming from the mmap call. The memory is
>> already allocated outside the qemu process. mmap can indicate where in
>> the qemu process address space the local mapping should be, but
>> mapping it 'on top' of memory allocated with qemu_ram_alloc doesn't
>> seem to work (I get a BUG in gfn_to_pfn).
>
> You need a variant of qemu_ram_alloc() that accepts an fd and offset and
> mmaps that.

Okay, it sounds like a function to do this is not currently available.
That confirms my understanding at least. I will take a look but I
don't think I understand the memory management well enough to write
this myself.

> A less intrusive, but uglier, alternative is to call
> qemu_ram_alloc() and them mmap(MAP_FIXED) on top of that.

I did try this, but ended up with a BUG on the host in
/var/lib/dkms/kvm/84/build/x86/kvm_main.c:1266 gfn_to_pfn() on the
line "BUG_ON(!kvm_is_mmio_pfn(pfn));" when the guest accesses the bar.

[1847926.363458] ------------[ cut here ]------------
[1847926.363464] kernel BUG at /var/lib/dkms/kvm/84/build/x86/kvm_main.c:1266!
[1847926.363466] invalid opcode: 0000 [#1] SMP
[1847926.363470] last sysfs file:
/sys/devices/pci0000:00/0000:00:1c.5/0000:02:00.0/net/eth0/statistics/collisions
[1847926.363473] Dumping ftrace buffer:
[1847926.363476]    (ftrace buffer empty)
[1847926.363478] Modules linked in: softcard_driver(P) nls_iso8859_1
vfat fat usb_storage tun nls_utf8 nls_cp437 cifs nfs lockd nfs_acl
sunrpc binfmt_misc ppdev bnep ipt_MASQUERADE iptable_nat nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT
xt_tcpudp iptable_filter ip_tables x_tables bridge stp kvm_intel kvm
video output input_polldev dm_crypt sbp2 lp parport snd_usb_audio
snd_pcm_oss snd_hda_intel snd_mixer_oss snd_pcm snd_seq_dummy
snd_usb_lib snd_seq_oss snd_seq_midi snd_seq_midi_event uvcvideo
compat_ioctl32 snd_rawmidi snd_seq iTCO_wdt videodev snd_timer
snd_seq_device iTCO_vendor_support ftdi_sio usbhid v4l1_compat
snd_hwdep intel_agp nvidia(P) usbserial snd soundcore snd_page_alloc
agpgart pcspkr ohci1394 ieee1394 atl1 mii floppy fbcon tileblit font
bitblit softcursor [last unloaded: softcard_driver]
[1847926.363539]
[1847926.363542] Pid: 31516, comm: qemu-system-x86 Tainted: P
 (2.6.28-13-generic #44-Ubuntu) P5K
[1847926.363544] EIP: 0060:[<f7f5961f>] EFLAGS: 00010246 CPU: 1
[1847926.363556] EIP is at gfn_to_pfn+0xff/0x110 [kvm]
[1847926.363558] EAX: 00000000 EBX: 00000000 ECX: f40d30c8 EDX: 00000000
[1847926.363560] ESI: d0baa000 EDI: 00000001 EBP: f2cddbbc ESP: f2cddbac
[1847926.363562]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[1847926.363564] Process qemu-system-x86 (pid: 31516, ti=f2cdc000
task=f163d7f0 task.ti=f2cdc000)
[1847926.363566] Stack:
[1847926.363567]  f2cddbb0 f2cddbc8 00000000 000f2010 f2cddc7c
f7f65f00 00000004 f2cddbd4
[1847926.363573]  f7f5829f 00000004 f2cddbf4 f7f582ec 00000df4
00000004 d0baa000 f185a370
[1847926.363579]  df402c00 0001f719 f2cddc4c f7f66858 f2cddc40
00000004 0001f95f 00000000
[1847926.363585] Call Trace:
[1847926.363588]  [<f7f65f00>] ? kvm_mmu_pte_write+0x160/0x9a0 [kvm]
[1847926.363598]  [<f7f5829f>] ? kvm_read_guest_page+0x2f/0x40 [kvm]
[1847926.363607]  [<f7f582ec>] ? kvm_read_guest+0x3c/0x70 [kvm]
[1847926.363616]  [<f7f66858>] ? paging32_walk_addr+0x118/0x2d0 [kvm]
[1847926.363625]  [<f7f59360>] ? mark_page_dirty+0x10/0x70 [kvm]
[1847926.363634]  [<f7f59412>] ? kvm_write_guest_page+0x52/0x60 [kvm]
[1847926.363643]  [<f7f5becf>] ? emulator_write_phys+0x4f/0x70 [kvm]
[1847926.363652]  [<f7f5dcc8>] ?
emulator_write_emulated_onepage+0x58/0x130 [kvm]
[1847926.363661]  [<f7f5ddf9>] ? emulator_write_emulated+0x59/0x70 [kvm]
[1847926.363674]  [<f7f69d84>] ? x86_emulate_insn+0x414/0x2650 [kvm]
[1847926.363684]  [<c011f714>] ? handle_vm86_fault+0x4c4/0x740
[1847926.363690]  [<c011f714>] ? handle_vm86_fault+0x4c4/0x740
[1847926.363699]  [<f7f681e6>] ? do_insn_fetch+0x76/0xd0 [kvm]
[1847926.363712]  [<c011f716>] ? handle_vm86_fault+0x4c6/0x740
[1847926.363715]  [<c011f716>] ? handle_vm86_fault+0x4c6/0x740
[1847926.363719]  [<f7f6909a>] ? x86_decode_insn+0x54a/0xe20 [kvm]
[1847926.363732]  [<f7f5ecfc>] ? emulate_instruction+0x12c/0x2a0 [kvm]
[1847926.363741]  [<f7f65988>] ? kvm_mmu_page_fault+0x58/0xa0 [kvm]
[1847926.363750]  [<f7e8797a>] ? handle_exception+0x35a/0x400 [kvm_intel]
[1847926.363755]  [<f7e83e97>] ? handle_interrupt_window+0x27/0xc0 [kvm_intel]
[1847926.363760]  [<c011f714>] ? handle_vm86_fault+0x4c4/0x740
[1847926.363763]  [<f7e864e9>] ? kvm_handle_exit+0xd9/0x270 [kvm_intel]
[1847926.363768]  [<f7e87c87>] ? vmx_vcpu_run+0x137/0xa4a [kvm_intel]
[1847926.363772]  [<f7f6d767>] ? kvm_apic_has_interrupt+0x37/0xb0 [kvm]
[1847926.363781]  [<f7f6c0b7>] ? kvm_cpu_has_interrupt+0x27/0x40 [kvm]
[1847926.363790]  [<f7f61306>] ? kvm_arch_vcpu_ioctl_run+0x626/0xb20 [kvm]
[1847926.363799]  [<c015da68>] ? futex_wait+0x358/0x440
[1847926.363804]  [<f7f576e5>] ? kvm_vcpu_ioctl+0x395/0x490 [kvm]
[1847926.363812]  [<c04fec68>] ? _spin_lock+0x8/0x10
[1847926.363815]  [<c015d508>] ? futex_wake+0xc8/0xf0
[1847926.363819]  [<f7f57350>] ? kvm_vcpu_ioctl+0x0/0x490 [kvm]
[1847926.363827]  [<c01ca1d8>] ? vfs_ioctl+0x28/0x90
[1847926.363831]  [<c01ca6be>] ? do_vfs_ioctl+0x5e/0x200
[1847926.363834]  [<c01ca8c3>] ? sys_ioctl+0x63/0x70
[1847926.363836]  [<c0103f6b>] ? sysenter_do_call+0x12/0x2f
[1847926.363840] Code: 29 d3 c1 eb 0c 03 58 44 64 a1 00 e0 7a c0 8b 80
cc 01 00 00 83 c0 34 e8 b0 9b 1f c8 89 d8 e8 89 fc ff ff 85 c0 0f 85
50 ff ff ff <0f> 0b eb fe 8d b6 00 00 00 00 8d bc 27 00 00 00 00 55 89
e5 e8
[1847926.363873] EIP: [<f7f5961f>] gfn_to_pfn+0xff/0x110 [kvm] SS:ESP
0068:f2cddbac
[1847926.363885] ---[ end trace 314ce851a956cf3c ]---

pseudo code in my pci init function is:
{
offset = qemu_ram_alloc(64*1024);
ptr = qemu_get_ram_ptr(offset);

fd = open(charfile, O_RDWR);

mmap(ptr, 64*1024, PROT_READ | PROT_WRITE, MAP_SHARED|MAP_FIXED, fd, 0))

pci_register_bar((PCIDevice *)d, 0, 1024*64, PCI_ADDRESS_SPACE_MEM, mmio_map);
}

mmio_map() {
    cpu_register_physical_memory(addr + 0, 64*1024, offset);
}

Regards,
Stephen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-07-28  8:54                   ` Avi Kivity
  2009-07-28 23:06                     ` Stephen Donnelly
@ 2009-07-29 23:52                     ` Cam Macdonell
  2009-07-30  9:31                       ` Avi Kivity
  1 sibling, 1 reply; 32+ messages in thread
From: Cam Macdonell @ 2009-07-29 23:52 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Stephen Donnelly, kvm@vger.kernel.org list

Avi Kivity wrote:
> On 07/28/2009 12:32 AM, Stephen Donnelly wrote:
>>>> What I don't understand is how to turn the host address returned from
>>>> mmap into a ram_addr_t to pass to pci_register_bar.
>>>>        
>>> Memory must be allocated using the qemu RAM functions.
>>>      
>>
>> That seems to be the problem. The memory cannot be allocated by
>> qemu_ram_alloc, because it is coming from the mmap call. The memory is
>> already allocated outside the qemu process. mmap can indicate where in
>> the qemu process address space the local mapping should be, but
>> mapping it 'on top' of memory allocated with qemu_ram_alloc doesn't
>> seem to work (I get a BUG in gfn_to_pfn).
>>    
> 
> You need a variant of qemu_ram_alloc() that accepts an fd and offset and 
> mmaps that.  A less intrusive, but uglier, alternative is to call 
> qemu_ram_alloc() and them mmap(MAP_FIXED) on top of that.

Hi Avi,

I noticed that the region of memory being allocated for shared memory 
using qemu_ram_alloc gets added to the total RAM of the system 
(according to /proc/meminfo).  I'm wondering if this is normal/OK since 
memory for the shared memory device (and similarly VGA RAM) is not 
intended to be used as regular RAM.

Should memory of devices be reported as part of MemTotal or is something 
wrong in my use of qemu_ram_alloc()?

Thanks,
Cam

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-07-29 23:52                     ` Cam Macdonell
@ 2009-07-30  9:31                       ` Avi Kivity
  0 siblings, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2009-07-30  9:31 UTC (permalink / raw)
  To: Cam Macdonell; +Cc: Stephen Donnelly, kvm@vger.kernel.org list

On 07/30/2009 02:52 AM, Cam Macdonell wrote:
>> You need a variant of qemu_ram_alloc() that accepts an fd and offset 
>> and mmaps that.  A less intrusive, but uglier, alternative is to call 
>> qemu_ram_alloc() and them mmap(MAP_FIXED) on top of that.
>
>
> Hi Avi,
>
> I noticed that the region of memory being allocated for shared memory 
> using qemu_ram_alloc gets added to the total RAM of the system 
> (according to /proc/meminfo).  I'm wondering if this is normal/OK 
> since memory for the shared memory device (and similarly VGA RAM) is 
> not intended to be used as regular RAM.

qemu_ram_alloc() and the guets /proc/meminfo are totally disconnected.  
I don't understand how that happened.

>
> Should memory of devices be reported as part of MemTotal or is 
> something wrong in my use of qemu_ram_alloc()?

You can call qemu_ram_alloc() all you like.  Guest memory is determined 
by the e820 map, which is in turn determined by the -m parameter.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-07-28 23:06                     ` Stephen Donnelly
@ 2009-08-13  4:07                       ` Stephen Donnelly
  2009-08-19 12:14                         ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Stephen Donnelly @ 2009-08-13  4:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Cam Macdonell, kvm@vger.kernel.org list

On Wed, Jul 29, 2009 at 11:06 AM, Stephen Donnelly<sfdonnelly@gmail.com> wrote:
> On Tue, Jul 28, 2009 at 8:54 PM, Avi Kivity<avi@redhat.com> wrote:
>> On 07/28/2009 12:32 AM, Stephen Donnelly wrote:

>> You need a variant of qemu_ram_alloc() that accepts an fd and offset and
>> mmaps that.

I had a go at this, creating qemu_ram_mmap() using qemu_ram_alloc() as
a template, but I'm still seeing the same BUG.

>> A less intrusive, but uglier, alternative is to call
>> qemu_ram_alloc() and them mmap(MAP_FIXED) on top of that.
>
> I did try this, but ended up with a BUG on the host in
> /var/lib/dkms/kvm/84/build/x86/kvm_main.c:1266 gfn_to_pfn() on the
> line "BUG_ON(!kvm_is_mmio_pfn(pfn));" when the guest accesses the bar.

It looks to me from the call trace like the guest is writing to the
memory, gfn_to_pfn() from mmu_guess_page_from_pte_write() gets
confused because of the mapping.

Inside gfn_to_pfn:

addr = gfn_to_hva(kvm, gfn); correctly returns the host virtual
address of the external memory mapping.

npages = get_user_pages_fast(addr, 1, 1, page); returns -EFAULT,
presumably because (vma->vm_flags & (VM_IO | VM_PFNMAP)).

It takes then unlikely branch, and checks the vma, but I don't
understand what it is doing here: pfn = ((addr - vma->vm_start) >>
PAGE_SHIFT) + vma->vm_pgoff;

In my case addr == vma->vm_start, and vma->vm_pgoff == 0, so pfn ==0.
BUG_ON(!kvm_is_mmio_pfn(pfn)) then triggers.

Instrumenting inside gfn_to_pfn I see:
gfn_to_pfn: gfn f2010 gpte f2010000 hva 7f3eac2b0000 pfn 0 npages -14
gfn_to_pfn: vma ffff88022142af18 start 7f3eac2b0000 pgoff 0

Any suggestions what should be happening here?

[ 1826.807846] ------------[ cut here ]------------
[ 1826.807907] kernel BUG at
/build/buildd/linux-2.6.28/arch/x86/kvm/../../../virt/kvm/kvm_main.c:1001!
[ 1826.807985] invalid opcode: 0000 [#1] SMP
[ 1826.808102] last sysfs file: /sys/module/nf_nat/initstate
[ 1826.808159] Dumping ftrace buffer:
[ 1826.808213]    (ftrace buffer empty)
[ 1826.808266] CPU 3
[ 1826.808347] Modules linked in: tun softcard_driver(P)
ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ip
v4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables
x_tables kvm_intel kvm input_polldev video output
 bridge stp lp parport iTCO_wdt iTCO_vendor_support psmouse pcspkr
serio_raw joydev i5000_edac edac_core shpchp e1000 us
bhid usb_storage e1000e floppy raid10 raid456 async_xor async_memcpy
async_tx xor raid1 raid0 multipath linear fbcon til
eblit font bitblit softcursor
[ 1826.810269] Pid: 9353, comm: qemu-system-x86 Tainted: P
2.6.28-13-server #45-Ubuntu
[ 1826.810344] RIP: 0010:[<ffffffffa01da853>]  [<ffffffffa01da853>]
gfn_to_pfn+0x153/0x160 [kvm]
[ 1826.810463] RSP: 0018:ffff88022d857958  EFLAGS: 00010246
[ 1826.810518] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88022d4d32a0
[ 1826.810577] RDX: 0000000000000000 RSI: 0000000000000282 RDI: 0000000000000000
[ 1826.810636] RBP: ffff88022d857978 R08: 0000000000000001 R09: ffff88022d857958
[ 1826.810694] R10: 0000000000000003 R11: 0000000000000001 R12: 00000000000f2010
[ 1826.810753] R13: ffff880212cb0000 R14: ffff880212cb0000 R15: ffff880212cb0000
[ 1826.810812] FS:  00007f5253bfd950(0000) GS:ffff88022f1fa380(0000)
knlGS:0000000000000000
[ 1826.810887] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
[ 1826.810943] CR2: 00000000b7eb2044 CR3: 0000000212cac000 CR4: 00000000000026a0
[ 1826.811002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1826.811061] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1826.811120] Process qemu-system-x86 (pid: 9353, threadinfo
ffff88022d856000, task ffff88022e0cd980)
[ 1826.811196] Stack:
[ 1826.811246]  ffff88022d857968 0000000000000004 0000000000000004
0000000000000000
[ 1826.811401]  ffff88022d8579b8 ffffffffa01e7ccf ffff88022d8579b8
00000000f2010073
[ 1826.811634]  0000000000000004 ffff880212cb15b0 000000001f402b00
ffff880212cb0000
[ 1826.811913] Call Trace:
[ 1826.811964]  [<ffffffffa01e7ccf>]
mmu_guess_page_from_pte_write+0xaf/0x190 [kvm]
[ 1826.812076]  [<ffffffffa01e820f>] kvm_mmu_pte_write+0x3f/0x4f0 [kvm]
[ 1826.812172]  [<ffffffffa01da9f1>] ? mark_page_dirty+0x11/0x90 [kvm]
[ 1826.812268]  [<ffffffffa01dabe8>] ? kvm_write_guest+0x48/0x90 [kvm]
[ 1826.812364]  [<ffffffffa01de427>] emulator_write_phys+0x47/0x70 [kvm]
[ 1826.812460]  [<ffffffffa01e0e26>]
emulator_write_emulated_onepage+0x66/0x120 [kvm]
[ 1826.812571]  [<ffffffffa01e0f50>] emulator_write_emulated+0x70/0x90 [kvm]
[ 1826.812668]  [<ffffffffa01eb36f>] x86_emulate_insn+0x4ef/0x32e0 [kvm]
[ 1826.812764]  [<ffffffffa01e950e>] ? do_insn_fetch+0x8e/0x100 [kvm]
[ 1826.812860]  [<ffffffffa01e9454>] ? seg_override_base+0x24/0x50 [kvm]
[ 1826.812955]  [<ffffffffa01eacb0>] ? x86_decode_insn+0x7a0/0x970 [kvm]
[ 1826.813051]  [<ffffffffa01e221f>] emulate_instruction+0x15f/0x2f0 [kvm]
[ 1826.813148]  [<ffffffffa01e7bd5>] kvm_mmu_page_fault+0x65/0xb0 [kvm]
[ 1826.813243]  [<ffffffffa020ac5f>] handle_exception+0x2ef/0x360 [kvm_intel]
[ 1826.813338]  [<ffffffffa01eb0a3>] ? x86_emulate_insn+0x223/0x32e0 [kvm]
[ 1826.813434]  [<ffffffffa0209c25>] kvm_handle_exit+0xb5/0x1d0 [kvm_intel]
[ 1826.813526]  [<ffffffff80699643>] ? __down_read+0xc3/0xce
[ 1826.813618]  [<ffffffffa01dd958>] vcpu_enter_guest+0x1f8/0x400 [kvm]
[ 1826.813714]  [<ffffffffa01dfc29>] __vcpu_run+0x69/0x2d0 [kvm]
[ 1826.813751]  [<ffffffffa01e38ea>] kvm_arch_vcpu_ioctl_run+0x8a/0x1f0 [kvm]
[ 1826.813751]  [<ffffffffa01d8582>] kvm_vcpu_ioctl+0x2e2/0x5a0 [kvm]
[ 1826.813751]  [<ffffffff802f6091>] vfs_ioctl+0x31/0xa0
[ 1826.813751]  [<ffffffff802f6445>] do_vfs_ioctl+0x75/0x230
[ 1826.813751]  [<ffffffff802e8216>] ? generic_file_llseek+0x56/0x70
[ 1826.813751]  [<ffffffff802f6699>] sys_ioctl+0x99/0xa0
[ 1826.813751]  [<ffffffff802e70d2>] ? sys_lseek+0x52/0x90
[ 1826.813751]  [<ffffffff8021253a>] system_call_fastpath+0x16/0x1b
[ 1826.813751] Code: 00 00 65 48 8b 04 25 00 00 00 00 48 8b b8 38 02
00 00 48 83 c7 60 e8 dd 23 09 e0 48 89 df e8 45 fe ff ff 85 c0 0f 85
08 ff ff ff <0f> 0b eb fe 66 0f 1f 84 00 00 00 00 00 55 65 8b 14 25 24
00 00
[ 1826.813751] RIP  [<ffffffffa01da853>] gfn_to_pfn+0x153/0x160 [kvm]
[ 1826.813751]  RSP <ffff88022d857958>
[ 1826.816899] ---[ end trace 2437a1197b66fb45 ]---

Stephen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-08-13  4:07                       ` Stephen Donnelly
@ 2009-08-19 12:14                         ` Avi Kivity
  2009-08-23 21:59                           ` Stephen Donnelly
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2009-08-19 12:14 UTC (permalink / raw)
  To: Stephen Donnelly; +Cc: Cam Macdonell, kvm@vger.kernel.org list

On 08/13/2009 07:07 AM, Stephen Donnelly wrote:
>>> A less intrusive, but uglier, alternative is to call
>>> qemu_ram_alloc() and them mmap(MAP_FIXED) on top of that.
>>>        
>> I did try this, but ended up with a BUG on the host in
>> /var/lib/dkms/kvm/84/build/x86/kvm_main.c:1266 gfn_to_pfn() on the
>> line "BUG_ON(!kvm_is_mmio_pfn(pfn));" when the guest accesses the bar.
>>      
> It looks to me from the call trace like the guest is writing to the
> memory, gfn_to_pfn() from mmu_guess_page_from_pte_write() gets
> confused because of the mapping.
>
> Inside gfn_to_pfn:
>
> addr = gfn_to_hva(kvm, gfn); correctly returns the host virtual
> address of the external memory mapping.
>
> npages = get_user_pages_fast(addr, 1, 1, page); returns -EFAULT,
> presumably because (vma->vm_flags&  (VM_IO | VM_PFNMAP)).
>
> It takes then unlikely branch, and checks the vma, but I don't
> understand what it is doing here: pfn = ((addr - vma->vm_start)>>
> PAGE_SHIFT) + vma->vm_pgoff;
>    

It's calculating the pfn according to pfnmap rules.

> In my case addr == vma->vm_start, and vma->vm_pgoff == 0, so pfn ==0.
>    

How did you set up that vma?  It should point to the first pfn of your 
special memory area.

> BUG_ON(!kvm_is_mmio_pfn(pfn)) then triggers.
>    

That's correct behaviour.  We expect a page that is not controlled by 
the kernel here.

> Instrumenting inside gfn_to_pfn I see:
> gfn_to_pfn: gfn f2010 gpte f2010000 hva 7f3eac2b0000 pfn 0 npages -14
> gfn_to_pfn: vma ffff88022142af18 start 7f3eac2b0000 pgoff 0
>
> Any suggestions what should be happening here?
>    

Well, we need to understand how that vma came into being and why pgoff == 0.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-08-19 12:14                         ` Avi Kivity
@ 2009-08-23 21:59                           ` Stephen Donnelly
  2009-08-24  4:55                             ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Stephen Donnelly @ 2009-08-23 21:59 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Cam Macdonell, kvm@vger.kernel.org list

On Thu, Aug 20, 2009 at 12:14 AM, Avi Kivity<avi@redhat.com> wrote:
> On 08/13/2009 07:07 AM, Stephen Donnelly wrote:
>>
>> npages = get_user_pages_fast(addr, 1, 1, page); returns -EFAULT,
>> presumably because (vma->vm_flags&  (VM_IO | VM_PFNMAP)).
>>
>> It takes then unlikely branch, and checks the vma, but I don't
>> understand what it is doing here: pfn = ((addr - vma->vm_start)>>
>> PAGE_SHIFT) + vma->vm_pgoff;
>
> It's calculating the pfn according to pfnmap rules.

From what I understand this will only work when remapping 'main
memory', e.g. where the pgoff is equal to the physical page offset?
VMAs that remap IO memory will usually set pgoff to 0 for the start of
the mapping.

>> In my case addr == vma->vm_start, and vma->vm_pgoff == 0, so pfn ==0.
>
> How did you set up that vma?  It should point to the first pfn of your
> special memory area.

The vma was created with a remap_pfn_range call from another driver.
Because this call sets VM_PFNMAP and VM_IO any get_user_pages(_fast)
calls will fail.

In this case the host driver was actually just remapping host memory,
so I replaced the remap_pfn_range call with a nopage/fault vm_op. This
allows the get_user_pages_fast call to succeed, and the mapping now
works as expected. This is sufficient for my work at the moment.

I'm still not sure how genuine IO memory (mapped from a driver to
userspace with remap_pfn_range or io_remap_page_range) could be mapped
into kvm though.

Regards,
Stephen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-08-23 21:59                           ` Stephen Donnelly
@ 2009-08-24  4:55                             ` Avi Kivity
  2009-08-26 10:22                               ` Avi Kivity
  2009-08-27  2:34                               ` Stephen Donnelly
  0 siblings, 2 replies; 32+ messages in thread
From: Avi Kivity @ 2009-08-24  4:55 UTC (permalink / raw)
  To: Stephen Donnelly; +Cc: Cam Macdonell, kvm@vger.kernel.org list

On 08/24/2009 12:59 AM, Stephen Donnelly wrote:
> On Thu, Aug 20, 2009 at 12:14 AM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> On 08/13/2009 07:07 AM, Stephen Donnelly wrote:
>>      
>>> npages = get_user_pages_fast(addr, 1, 1, page); returns -EFAULT,
>>> presumably because (vma->vm_flags&    (VM_IO | VM_PFNMAP)).
>>>
>>> It takes then unlikely branch, and checks the vma, but I don't
>>> understand what it is doing here: pfn = ((addr - vma->vm_start)>>
>>> PAGE_SHIFT) + vma->vm_pgoff;
>>>        
>> It's calculating the pfn according to pfnmap rules.
>>      
>  From what I understand this will only work when remapping 'main
> memory', e.g. where the pgoff is equal to the physical page offset?
> VMAs that remap IO memory will usually set pgoff to 0 for the start of
> the mapping.
>    

If so, how do they calculate the pfn when mapping pages?  kvm needs to 
be able to do the same thing.

>>> In my case addr == vma->vm_start, and vma->vm_pgoff == 0, so pfn ==0.
>>>        
>> How did you set up that vma?  It should point to the first pfn of your
>> special memory area.
>>      
> The vma was created with a remap_pfn_range call from another driver.
> Because this call sets VM_PFNMAP and VM_IO any get_user_pages(_fast)
> calls will fail.
>
> In this case the host driver was actually just remapping host memory,
> so I replaced the remap_pfn_range call with a nopage/fault vm_op. This
> allows the get_user_pages_fast call to succeed, and the mapping now
> works as expected. This is sufficient for my work at the moment.
>
>    

Well if the fix is correct we need it too.

> I'm still not sure how genuine IO memory (mapped from a driver to
> userspace with remap_pfn_range or io_remap_page_range) could be mapped
> into kvm though.
>    

If it can be mapped to userspace, it can be mapped to kvm.  We just need 
to synchronize the rules.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-08-24  4:55                             ` Avi Kivity
@ 2009-08-26 10:22                               ` Avi Kivity
  2009-08-27  2:39                                 ` Stephen Donnelly
  2009-08-27  2:34                               ` Stephen Donnelly
  1 sibling, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2009-08-26 10:22 UTC (permalink / raw)
  To: Stephen Donnelly
  Cc: Cam Macdonell, kvm@vger.kernel.org list, Marcelo Tosatti, Chris Wright

On 08/24/2009 07:55 AM, Avi Kivity wrote:
> On 08/24/2009 12:59 AM, Stephen Donnelly wrote:
>> On Thu, Aug 20, 2009 at 12:14 AM, Avi Kivity<avi@redhat.com>  wrote:
>>> On 08/13/2009 07:07 AM, Stephen Donnelly wrote:
>>>> npages = get_user_pages_fast(addr, 1, 1, page); returns -EFAULT,
>>>> presumably because (vma->vm_flags&    (VM_IO | VM_PFNMAP)).
>>>>
>>>> It takes then unlikely branch, and checks the vma, but I don't
>>>> understand what it is doing here: pfn = ((addr - vma->vm_start)>>
>>>> PAGE_SHIFT) + vma->vm_pgoff;
>>> It's calculating the pfn according to pfnmap rules.
>>  From what I understand this will only work when remapping 'main
>> memory', e.g. where the pgoff is equal to the physical page offset?
>> VMAs that remap IO memory will usually set pgoff to 0 for the start of
>> the mapping.
>
> If so, how do they calculate the pfn when mapping pages?  kvm needs to 
> be able to do the same thing.

Maybe the simplest thing is to call vma->vm_ops->fault here.  
Marcelo/Chris?  Context is improving gfn_to_pfn() on the mmio path.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-08-24  4:55                             ` Avi Kivity
  2009-08-26 10:22                               ` Avi Kivity
@ 2009-08-27  2:34                               ` Stephen Donnelly
  2009-08-27  4:08                                 ` Avi Kivity
  1 sibling, 1 reply; 32+ messages in thread
From: Stephen Donnelly @ 2009-08-27  2:34 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Cam Macdonell, kvm@vger.kernel.org list

On Mon, Aug 24, 2009 at 4:55 PM, Avi Kivity<avi@redhat.com> wrote:
> On 08/24/2009 12:59 AM, Stephen Donnelly wrote:
>>
>> On Thu, Aug 20, 2009 at 12:14 AM, Avi Kivity<avi@redhat.com>  wrote:
>>> On 08/13/2009 07:07 AM, Stephen Donnelly wrote:
>>>>
>>>> npages = get_user_pages_fast(addr, 1, 1, page); returns -EFAULT,
>>>> presumably because (vma->vm_flags&    (VM_IO | VM_PFNMAP)).
>>>>
>>>> It takes then unlikely branch, and checks the vma, but I don't
>>>> understand what it is doing here: pfn = ((addr - vma->vm_start)>>
>>>> PAGE_SHIFT) + vma->vm_pgoff;
>>>
>>> It's calculating the pfn according to pfnmap rules.
>>
>>  From what I understand this will only work when remapping 'main
>> memory', e.g. where the pgoff is equal to the physical page offset?
>> VMAs that remap IO memory will usually set pgoff to 0 for the start of
>> the mapping.
>
> If so, how do they calculate the pfn when mapping pages?  kvm needs to be
> able to do the same thing.

If the vma->vm_file is /dev/mem, then the pg_off will map to physical
addresses directly (at least on x86), and the calculation works. If
the vma is remapping io memory from a driver, then vma->vm_file will
point to the device node for that driver. Perhaps we can do a check
for this at least?

>>>> In my case addr == vma->vm_start, and vma->vm_pgoff == 0, so pfn ==0.
>>>
>>> How did you set up that vma?  It should point to the first pfn of your
>>> special memory area.
>>
>> The vma was created with a remap_pfn_range call from another driver.
>> Because this call sets VM_PFNMAP and VM_IO any get_user_pages(_fast)
>> calls will fail.
>>
>> In this case the host driver was actually just remapping host memory,
>> so I replaced the remap_pfn_range call with a nopage/fault vm_op. This
>> allows the get_user_pages_fast call to succeed, and the mapping now
>> works as expected. This is sufficient for my work at the moment.
>
> Well if the fix is correct we need it too.

The change is to the external (host) driver. If I submit my device for
inclusion upstream then the changes for that driver will be needed as
well but would not be part of the qemu-kvm tree.

>> I'm still not sure how genuine IO memory (mapped from a driver to
>> userspace with remap_pfn_range or io_remap_page_range) could be mapped
>> into kvm though.
>
> If it can be mapped to userspace, it can be mapped to kvm.  We just need to
> synchronize the rules.

We can definitely map it into userspace. The problem seems to be how
the kvm kernel module translates the guest pfn back to a host physical
address.

Is there a kernel equivalent of mmap?

Stephen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-08-26 10:22                               ` Avi Kivity
@ 2009-08-27  2:39                                 ` Stephen Donnelly
  0 siblings, 0 replies; 32+ messages in thread
From: Stephen Donnelly @ 2009-08-27  2:39 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Cam Macdonell, kvm@vger.kernel.org list, Marcelo Tosatti, Chris Wright

On Wed, Aug 26, 2009 at 10:22 PM, Avi Kivity<avi@redhat.com> wrote:
> On 08/24/2009 07:55 AM, Avi Kivity wrote:
>>
>> On 08/24/2009 12:59 AM, Stephen Donnelly wrote:
>>>
>>> On Thu, Aug 20, 2009 at 12:14 AM, Avi Kivity<avi@redhat.com>  wrote:
>>>>
>>>> On 08/13/2009 07:07 AM, Stephen Donnelly wrote:
>>>>>
>>>>> npages = get_user_pages_fast(addr, 1, 1, page); returns -EFAULT,
>>>>> presumably because (vma->vm_flags&    (VM_IO | VM_PFNMAP)).
>>>>>
>>>>> It takes then unlikely branch, and checks the vma, but I don't
>>>>> understand what it is doing here: pfn = ((addr - vma->vm_start)>>
>>>>> PAGE_SHIFT) + vma->vm_pgoff;
>>>>
>>>> It's calculating the pfn according to pfnmap rules.
>>>
>>>  From what I understand this will only work when remapping 'main
>>> memory', e.g. where the pgoff is equal to the physical page offset?
>>> VMAs that remap IO memory will usually set pgoff to 0 for the start of
>>> the mapping.
>>
>> If so, how do they calculate the pfn when mapping pages?  kvm needs to be
>> able to do the same thing.
>
> Maybe the simplest thing is to call vma->vm_ops->fault here.  Marcelo/Chris?
>  Context is improving gfn_to_pfn() on the mmio path.

If the mapping is made using remap_pfn_range (or io_remap_pfn_range)
then there is are vm_ops attached by default.

gfn_to_pfn: vma 0xffff88022c50d498 start 0x7f4b0de9f000 pgoff 0x0
flags 0x844fb vm_ops 0x0000000000000000 fault 0x0000000000000000 file
0xffff88022e408000 major 250 minor 32

From linux/mm.h:

#define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct
page", just pure PFN */

Stephen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-08-27  2:34                               ` Stephen Donnelly
@ 2009-08-27  4:08                                 ` Avi Kivity
  2009-08-30 22:33                                   ` Stephen Donnelly
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2009-08-27  4:08 UTC (permalink / raw)
  To: Stephen Donnelly; +Cc: Cam Macdonell, kvm@vger.kernel.org list

On 08/27/2009 05:34 AM, Stephen Donnelly wrote:
> On Mon, Aug 24, 2009 at 4:55 PM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> On 08/24/2009 12:59 AM, Stephen Donnelly wrote:
>>      
>>> On Thu, Aug 20, 2009 at 12:14 AM, Avi Kivity<avi@redhat.com>    wrote:
>>>        
>>>> On 08/13/2009 07:07 AM, Stephen Donnelly wrote:
>>>>          
>>>>> npages = get_user_pages_fast(addr, 1, 1, page); returns -EFAULT,
>>>>> presumably because (vma->vm_flags&      (VM_IO | VM_PFNMAP)).
>>>>>
>>>>> It takes then unlikely branch, and checks the vma, but I don't
>>>>> understand what it is doing here: pfn = ((addr - vma->vm_start)>>
>>>>> PAGE_SHIFT) + vma->vm_pgoff;
>>>>>            
>>>> It's calculating the pfn according to pfnmap rules.
>>>>          
>>>   From what I understand this will only work when remapping 'main
>>> memory', e.g. where the pgoff is equal to the physical page offset?
>>> VMAs that remap IO memory will usually set pgoff to 0 for the start of
>>> the mapping.
>>>        
>> If so, how do they calculate the pfn when mapping pages?  kvm needs to be
>> able to do the same thing.
>>      
> If the vma->vm_file is /dev/mem, then the pg_off will map to physical
> addresses directly (at least on x86), and the calculation works. If
> the vma is remapping io memory from a driver, then vma->vm_file will
> point to the device node for that driver. Perhaps we can do a check
> for this at least?
>    

We can't duplicate mm/ in kvm.  However, mm/memory.c says:


  * The way we recognize COWed pages within VM_PFNMAP mappings is 
through the
  * rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP bit
  * set, and the vm_pgoff will point to the first PFN mapped: thus every 
special
  * mapping will always honor the rule
  *
  *      pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >> 
PAGE_SHIFT)
  *
  * And for normal mappings this is false.

So it seems the kvm calculation is right and you should set vm_pgoff in 
your driver.

>
>
>>> I'm still not sure how genuine IO memory (mapped from a driver to
>>> userspace with remap_pfn_range or io_remap_page_range) could be mapped
>>> into kvm though.
>>>        
>> If it can be mapped to userspace, it can be mapped to kvm.  We just need to
>> synchronize the rules.
>>      
> We can definitely map it into userspace. The problem seems to be how
> the kvm kernel module translates the guest pfn back to a host physical
> address.
>
> Is there a kernel equivalent of mmap?
>    

do_mmap(), but don't use it.  Use mmap() from userspace like everyone else.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-08-27  4:08                                 ` Avi Kivity
@ 2009-08-30 22:33                                   ` Stephen Donnelly
  2009-08-31  8:44                                     ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Stephen Donnelly @ 2009-08-30 22:33 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Cam Macdonell, kvm@vger.kernel.org list

On Thu, Aug 27, 2009 at 4:08 PM, Avi Kivity<avi@redhat.com> wrote:
> On 08/27/2009 05:34 AM, Stephen Donnelly wrote:
>>
>> On Mon, Aug 24, 2009 at 4:55 PM, Avi Kivity<avi@redhat.com>  wrote:
>>
>>>
>>> On 08/24/2009 12:59 AM, Stephen Donnelly wrote:
>>>
>>>>
>>>> On Thu, Aug 20, 2009 at 12:14 AM, Avi Kivity<avi@redhat.com>    wrote:
>>>>
>>>>>
>>>>> On 08/13/2009 07:07 AM, Stephen Donnelly wrote:
>>>>>
>>>>>>
>>>>>> npages = get_user_pages_fast(addr, 1, 1, page); returns -EFAULT,
>>>>>> presumably because (vma->vm_flags&      (VM_IO | VM_PFNMAP)).
>>>>>>
>>>>>> It takes then unlikely branch, and checks the vma, but I don't
>>>>>> understand what it is doing here: pfn = ((addr - vma->vm_start)>>
>>>>>> PAGE_SHIFT) + vma->vm_pgoff;
>>>>>>
>>>>>
>>>>> It's calculating the pfn according to pfnmap rules.
>>>>>
>>>>
>>>>  From what I understand this will only work when remapping 'main
>>>> memory', e.g. where the pgoff is equal to the physical page offset?
>>>> VMAs that remap IO memory will usually set pgoff to 0 for the start of
>>>> the mapping.
>>>>
>>>
>>> If so, how do they calculate the pfn when mapping pages?  kvm needs to be
>>> able to do the same thing.
>>>
>>
>> If the vma->vm_file is /dev/mem, then the pg_off will map to physical
>> addresses directly (at least on x86), and the calculation works. If
>> the vma is remapping io memory from a driver, then vma->vm_file will
>> point to the device node for that driver. Perhaps we can do a check
>> for this at least?
>>
>
> We can't duplicate mm/ in kvm.  However, mm/memory.c says:
>
>
>  * The way we recognize COWed pages within VM_PFNMAP mappings is through the
>  * rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP bit
>  * set, and the vm_pgoff will point to the first PFN mapped: thus every
> special
>  * mapping will always honor the rule
>  *
>  *      pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >>
> PAGE_SHIFT)
>  *
>  * And for normal mappings this is false.
>
> So it seems the kvm calculation is right and you should set vm_pgoff in your
> driver.

That may be true for COW pages, which are main memory, but I don't
think it is true for device drivers.

In a device driver the mmap function receives the vma from the OS. The
vm_pgoff field contains the offset area in the file. For drivers this
is used to determine where to start the map compared to the io base
address.

If the driver is mapping io memory to user space it calls
io_remap_pfn_range with the pfn for the io memory. The remap_pfn_range
call sets the VM_IO and VM_PFNMAP bits in vm_flags. It does not alter
the vm_pgoff value.

A simple example is hpet_mmap() in drivers/char/hpet.c, or
mbcs_gscr_mmap() in drivers/char/mbcs.c.

>>>> I'm still not sure how genuine IO memory (mapped from a driver to
>>>> userspace with remap_pfn_range or io_remap_page_range) could be mapped
>>>> into kvm though.
>>>>
>>>
>>> If it can be mapped to userspace, it can be mapped to kvm.  We just need
>>> to
>>> synchronize the rules.
>>>
>>
>> We can definitely map it into userspace. The problem seems to be how
>> the kvm kernel module translates the guest pfn back to a host physical
>> address.
>>
>> Is there a kernel equivalent of mmap?
>
> do_mmap(), but don't use it.  Use mmap() from userspace like everyone else.

Of course you are right, gfn_to_pfn is in user space. There is already
a mapping of the memory to the process (from qemu_ram_mmap), the
question is how to look it up.

Regards,
Stephen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-08-30 22:33                                   ` Stephen Donnelly
@ 2009-08-31  8:44                                     ` Avi Kivity
  2009-08-31 21:13                                       ` Stephen Donnelly
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2009-08-31  8:44 UTC (permalink / raw)
  To: Stephen Donnelly; +Cc: Cam Macdonell, kvm@vger.kernel.org list

On 08/31/2009 01:33 AM, Stephen Donnelly wrote:
>
>> We can't duplicate mm/ in kvm.  However, mm/memory.c says:
>>
>>
>>   * The way we recognize COWed pages within VM_PFNMAP mappings is through the
>>   * rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP bit
>>   * set, and the vm_pgoff will point to the first PFN mapped: thus every
>> special
>>   * mapping will always honor the rule
>>   *
>>   *      pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start)>>
>> PAGE_SHIFT)
>>   *
>>   * And for normal mappings this is false.
>>
>> So it seems the kvm calculation is right and you should set vm_pgoff in your
>> driver.
>>      
> That may be true for COW pages, which are main memory, but I don't
> think it is true for device drivers.
>    

No, COW pages have no linear pfn mapping.  It's only true for 
remap_pfn_range).

> In a device driver the mmap function receives the vma from the OS. The
> vm_pgoff field contains the offset area in the file. For drivers this
> is used to determine where to start the map compared to the io base
> address.
>
> If the driver is mapping io memory to user space it calls
> io_remap_pfn_range with the pfn for the io memory. The remap_pfn_range
> call sets the VM_IO and VM_PFNMAP bits in vm_flags. It does not alter
> the vm_pgoff value.
>
> A simple example is hpet_mmap() in drivers/char/hpet.c, or
> mbcs_gscr_mmap() in drivers/char/mbcs.c.
>    

io_remap_pfn_range() is remap_pfn_range(), which has this:

         if (addr == vma->vm_start && end == vma->vm_end) {
                 vma->vm_pgoff = pfn;
                 vma->vm_flags |= VM_PFN_AT_MMAP;
         }

So remap_pfn_range() will alter the pgoff.

>> do_mmap(), but don't use it.  Use mmap() from userspace like everyone else.
>>      
> Of course you are right, gfn_to_pfn is in user space. There is already
> a mapping of the memory to the process (from qemu_ram_mmap), the
> question is how to look it up.
>    

I'm totally confused now.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-08-31  8:44                                     ` Avi Kivity
@ 2009-08-31 21:13                                       ` Stephen Donnelly
  2009-09-09 12:50                                         ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Stephen Donnelly @ 2009-08-31 21:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Cam Macdonell, kvm@vger.kernel.org list

On Mon, Aug 31, 2009 at 8:44 PM, Avi Kivity<avi@redhat.com> wrote:
> On 08/31/2009 01:33 AM, Stephen Donnelly wrote:
>>
>>> We can't duplicate mm/ in kvm.  However, mm/memory.c says:
>>>
>>>  * The way we recognize COWed pages within VM_PFNMAP mappings is through
>>> the
>>>  * rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP
>>> bit
>>>  * set, and the vm_pgoff will point to the first PFN mapped: thus every
>>> special
>>>  * mapping will always honor the rule
>>>  *
>>>  *      pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start)>>
>>> PAGE_SHIFT)
>>>  *
>>>  * And for normal mappings this is false.
>>>
>>> So it seems the kvm calculation is right and you should set vm_pgoff in
>>> your
>>> driver.
>>>
>>
>> That may be true for COW pages, which are main memory, but I don't
>> think it is true for device drivers.
>>
>
> No, COW pages have no linear pfn mapping.  It's only true for
> remap_pfn_range).
>
>> In a device driver the mmap function receives the vma from the OS. The
>> vm_pgoff field contains the offset area in the file. For drivers this
>> is used to determine where to start the map compared to the io base
>> address.
>>
>> If the driver is mapping io memory to user space it calls
>> io_remap_pfn_range with the pfn for the io memory. The remap_pfn_range
>> call sets the VM_IO and VM_PFNMAP bits in vm_flags. It does not alter
>> the vm_pgoff value.
>>
>> A simple example is hpet_mmap() in drivers/char/hpet.c, or
>> mbcs_gscr_mmap() in drivers/char/mbcs.c.
>>
>
> io_remap_pfn_range() is remap_pfn_range(), which has this:
>
>        if (addr == vma->vm_start && end == vma->vm_end) {
>                vma->vm_pgoff = pfn;
>                vma->vm_flags |= VM_PFN_AT_MMAP;
>        }
>
> So remap_pfn_range() will alter the pgoff.

Aha! We are looking at different kernels. I should have mentioned I
was looking at 2.6.28. In mm/memory.c remap_pfn_range() this has:

	 * There's a horrible special case to handle copy-on-write
	 * behaviour that some programs depend on. We mark the "original"
	 * un-COW'ed pages by matching them up with "vma->vm_pgoff".
	 */
	if (is_cow_mapping(vma->vm_flags)) {
		if (addr != vma->vm_start || end != vma->vm_end)
			return -EINVAL;
		vma->vm_pgoff = pfn;
	}

The macro is:

static inline int is_cow_mapping(unsigned int flags)
{
	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}

Because my vma is marked shared, this clause does not operate and
vm_pgoff is not modified (it is still 0).

> I'm totally confused now.

Sorry about that. The issue is the BUG in gfn_to_pgn where the pfn is
not calculated correctly after looking up the vma.

I still don't see how to get the physical address from the vma, since
vm_pgoff is zero, and the vm_ops are not filled. The vma does not seem
to store the physical base address.

Regards,
Stephen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
  2009-08-31 21:13                                       ` Stephen Donnelly
@ 2009-09-09 12:50                                         ` Avi Kivity
  0 siblings, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2009-09-09 12:50 UTC (permalink / raw)
  To: Stephen Donnelly; +Cc: Cam Macdonell, kvm@vger.kernel.org list

On 09/01/2009 12:13 AM, Stephen Donnelly wrote:
>
>> I'm totally confused now.
>>      
> Sorry about that. The issue is the BUG in gfn_to_pgn where the pfn is
> not calculated correctly after looking up the vma.
>
> I still don't see how to get the physical address from the vma, since
> vm_pgoff is zero, and the vm_ops are not filled. The vma does not seem
> to store the physical base address.
>    

So it seems the only place the pfns are stored are in the ptes 
themselves.  Is there an API to recover the ptes from a virtual 
address?  We could use that instead.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: R/W HG memory mappings with kvm?
@ 2009-09-28 18:27 Tsuyoshi Ozawa
  0 siblings, 0 replies; 32+ messages in thread
From: Tsuyoshi Ozawa @ 2009-09-28 18:27 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

Hello,

>>  Sorry about that. The issue is the BUG in gfn_to_pgn where the pfn is
>>  not calculated correctly after looking up the vma.

>>  I still don't see how to get the physical address from the vma, since
>>  vm_pgoff is zero, and the vm_ops are not filled. The vma does not seem
>>  to store the physical base address.

> So it seems the only place the pfns are stored are in the ptes themselves. Is there an API to recover the ptes from a virtual address? We could use that instead.

I'm also trying to share H/G memory with another solution -
by overwriting shadow page table.

It seems that gfn_to_pfn is the key function which associate
guest memoy with host memory. So I changed gfn_to_pfn
as follows:

pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
{
    ...
    } else
        if ( shared_gfn && shared_gfn == gfn ){
            return shared_pfn;  // return pfn which is wanted to share
        }else {
            pfn = page_to_pfn(page[0]);
        }
    }
    ...
}

Here, shared_gfn is registered by walking soft mmu with gva.
And shared_pfn is the page frame number which is hostside.
By rewriting adobe, kvm is foxed and make up new shadow
page table with new mapping after zap all pages.

But I failed to share the memory. Do I have any misunderstanding?

Regards,
Tsuyoshi Ozawa

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2009-09-28 18:27 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-05 22:41 R/W HG memory mappings with kvm? Stephen Donnelly
2009-07-06  7:38 ` Avi Kivity
2009-07-07 22:23   ` Stephen Donnelly
2009-07-08  4:36     ` Avi Kivity
2009-07-08 21:33       ` Stephen Donnelly
2009-07-09  8:10         ` Avi Kivity
2009-07-08 21:45       ` Cam Macdonell
2009-07-08 22:01         ` Stephen Donnelly
2009-07-09  6:01           ` Cam Macdonell
2009-07-09 22:38             ` Stephen Donnelly
2009-07-10 17:03               ` Cam Macdonell
2009-07-12 21:28                 ` Stephen Donnelly
2009-07-14 22:25                   ` [PATCH] Support shared memory device PCI device Cam Macdonell
     [not found]             ` <5f370d430907262256rd7f9fdalfbbec1f9492ce86@mail.gmail.com>
2009-07-27 14:48               ` R/W HG memory mappings with kvm? Cam Macdonell
2009-07-27 21:32                 ` Stephen Donnelly
2009-07-28  8:54                   ` Avi Kivity
2009-07-28 23:06                     ` Stephen Donnelly
2009-08-13  4:07                       ` Stephen Donnelly
2009-08-19 12:14                         ` Avi Kivity
2009-08-23 21:59                           ` Stephen Donnelly
2009-08-24  4:55                             ` Avi Kivity
2009-08-26 10:22                               ` Avi Kivity
2009-08-27  2:39                                 ` Stephen Donnelly
2009-08-27  2:34                               ` Stephen Donnelly
2009-08-27  4:08                                 ` Avi Kivity
2009-08-30 22:33                                   ` Stephen Donnelly
2009-08-31  8:44                                     ` Avi Kivity
2009-08-31 21:13                                       ` Stephen Donnelly
2009-09-09 12:50                                         ` Avi Kivity
2009-07-29 23:52                     ` Cam Macdonell
2009-07-30  9:31                       ` Avi Kivity
2009-09-28 18:27 Tsuyoshi Ozawa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).