All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC] Memory API
@ 2011-05-18 13:12 Avi Kivity
  2011-05-18 14:05 ` Jan Kiszka
                   ` (5 more replies)
  0 siblings, 6 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 13:12 UTC (permalink / raw)
  To: qemu-devel

The current memory APIs (cpu_register_io_memory, 
cpu_register_physical_memory) suffer from a number of drawbacks:

- lack of centralized bookkeeping
    - a cpu_register_physical_memory() that overlaps an existing region 
will overwrite the preexisting region; but a following 
cpu_unregister_physical_memory() will not restore things to status quo ante.
    - coalescing and dirty logging are all cleared on unregistering, so 
the client has to re-issue them when re-registering
- lots of opaques
- no nesting
    - if a region has multiple subregions that need different handling 
(different callbacks, RAM interspersed with MMIO) then client code has 
to deal with that manually
    - we can't interpose code between a device and global memory handling

To fix that, I propose an new API to replace the existing one:


#ifndef MEMORY_H
#define MEMORY_H

typedef struct MemoryRegionOps MemoryRegionOps;
typedef struct MemoryRegion MemoryRegion;

typedef uint32_t (*MemoryReadFunc)(MemoryRegion *mr, target_phys_addr_t 
addr);
typedef void (*MemoryWriteFunc)(MemoryRegion *mr, target_phys_addr_t addr,
                                 uint32_t data);

struct MemoryRegionOps {
     MemoryReadFunc readb, readw, readl;
     MemoryWriteFunc writeb, writew, writel;
};

struct MemoryRegion {
     const MemoryRegionOps *ops;
     target_phys_addr_t size;
     target_phys_addr_t addr;
};

void memory_region_init(MemoryRegion *mr,
                         target_phys_addr_t size);
void memory_region_init_io(MemoryRegion *mr,
                            const MemoryRegionOps *ops,
                            target_phys_addr_t size);
void memory_region_init_ram(MemoryRegion *mr,
                             target_phys_addr_t size);
void memory_region_init_ram_ptr(MemoryRegion *mr,
                                 target_phys_addr_t size,
                                 void *ptr);
void memory_region_destroy(MemoryRegion *mr);
void memory_region_set_offset(MemoryRegion *mr, target_phys_addr_t offset);
void memory_region_set_log(MemoryRegion *mr, bool log);
void memory_region_clear_coalescing(MemoryRegion *mr);
void memory_region_add_coalescing(MemoryRegion *mr,
                                   target_phys_addr_t offset,
                                   target_phys_addr_t size);

void memory_region_add_subregion(MemoryRegion *mr,
                                  target_phys_addr_t offset,
                                  MemoryRegion *subregion);
void memory_region_del_subregion(MemoryRegion *mr,
                                  target_phys_addr_t offset,
                                  MemoryRegion *subregion);

void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t addr);
void cpu_unregister_memory_region(MemoryRegion *mr);

#endif

The API is nested: you can define, say, a PCI BAR containing RAM and 
MMIO, and give it to the PCI subsystem.  PCI can then enable/disable the 
BAR and move it to different addresses without calling any callbacks; 
the client code can enable or disable logging or coalescing without 
caring if the BAR is mapped or not.  For example:

   MemoryRegion mr, mr_mmio, mr_ram;

   memory_region_init(&mr);
   memory_region_init_io(&mr_mmio, &mmio_ops, 0x1000);
   memory_region_init_ram(&mr_ram, 0x100000);
   memory_region_add_subregion(&mr, 0, &mr_ram);
   memory_region_add_subregion(&mr, 0x10000, &mr_io);
   memory_region_add_coalescing(&mr_ram, 0, 0x100000);
   pci_register_bar(&pci_dev, 0, &mr);

at this point the PCI subsystem knows everything about the BAR and can 
enable or disable it, or move it around, without further help from the 
device code.  On the other hand, the device code can change logging or 
coalescing, or even change the structure of the region, without caring 
about whether the region is currently registered or not.

If we can agree on the API, then I think the way forward is to implement 
it in terms of the old API, change over all devices, then fold the old 
API into the new one.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 13:12 [Qemu-devel] [RFC] Memory API Avi Kivity
@ 2011-05-18 14:05 ` Jan Kiszka
  2011-05-18 14:36   ` Avi Kivity
  2011-05-18 15:14   ` Anthony Liguori
  2011-05-18 15:08 ` Anthony Liguori
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-18 14:05 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

On 2011-05-18 15:12, Avi Kivity wrote:
> The current memory APIs (cpu_register_io_memory,
> cpu_register_physical_memory) suffer from a number of drawbacks:
> 
> - lack of centralized bookkeeping
>    - a cpu_register_physical_memory() that overlaps an existing region
> will overwrite the preexisting region; but a following
> cpu_unregister_physical_memory() will not restore things to status quo
> ante.

Restoring is not the problem. The problem is that the current API
deletes or truncates regions implicitly by overwriting. That makes
tracking the layout hard, and it is also error-prone as the device that
accidentally overlaps with some other device won't receive a
notification of this potential conflict.

Such implicite truncation or deletion must be avoided in a new API,
forcing the users to explicitly reference an existing region when
dropping or modifying it. But your API goes in the right direction.

>    - coalescing and dirty logging are all cleared on unregistering, so
> the client has to re-issue them when re-registering
> - lots of opaques
> - no nesting
>    - if a region has multiple subregions that need different handling
> (different callbacks, RAM interspersed with MMIO) then client code has
> to deal with that manually
>    - we can't interpose code between a device and global memory handling

I would add another drawback:

 - Inability to identify the origin of a region accesses and handle them
   differently based on the source.

   That is at least problematic for the x86 APIC which is CPU local. Our
   current way do deal with it is, well, very creative and falls to
   dust if a guest actually tries to remap the APIC.

However, I'm unsure if that can easily be addressed. As long as only x86
is affected, it's tricky to ask for a big infrastructure to handle this
special case. Maybe there some other use cases, don't know.

> 
> To fix that, I propose an new API to replace the existing one:
> 
> 
> #ifndef MEMORY_H
> #define MEMORY_H
> 
> typedef struct MemoryRegionOps MemoryRegionOps;
> typedef struct MemoryRegion MemoryRegion;
> 
> typedef uint32_t (*MemoryReadFunc)(MemoryRegion *mr, target_phys_addr_t
> addr);
> typedef void (*MemoryWriteFunc)(MemoryRegion *mr, target_phys_addr_t addr,
>                                 uint32_t data);
> 
> struct MemoryRegionOps {
>     MemoryReadFunc readb, readw, readl;
>     MemoryWriteFunc writeb, writew, writel;
> };
> 
> struct MemoryRegion {
>     const MemoryRegionOps *ops;
>     target_phys_addr_t size;
>     target_phys_addr_t addr;
> };
> 
> void memory_region_init(MemoryRegion *mr,
>                         target_phys_addr_t size);

What use case would this abstract region cover?

> void memory_region_init_io(MemoryRegion *mr,
>                            const MemoryRegionOps *ops,
>                            target_phys_addr_t size);
> void memory_region_init_ram(MemoryRegion *mr,
>                             target_phys_addr_t size);
> void memory_region_init_ram_ptr(MemoryRegion *mr,
>                                 target_phys_addr_t size,
>                                 void *ptr);
> void memory_region_destroy(MemoryRegion *mr);
> void memory_region_set_offset(MemoryRegion *mr, target_phys_addr_t offset);
> void memory_region_set_log(MemoryRegion *mr, bool log);
> void memory_region_clear_coalescing(MemoryRegion *mr);
> void memory_region_add_coalescing(MemoryRegion *mr,
>                                   target_phys_addr_t offset,
>                                   target_phys_addr_t size);
> 
> void memory_region_add_subregion(MemoryRegion *mr,
>                                  target_phys_addr_t offset,
>                                  MemoryRegion *subregion);
> void memory_region_del_subregion(MemoryRegion *mr,
>                                  target_phys_addr_t offset,
>                                  MemoryRegion *subregion);
> 
> void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t addr);

This could create overlaps. I would suggest to reject them, so we need a
return code.

> void cpu_unregister_memory_region(MemoryRegion *mr);
> 
> #endif
> 
> The API is nested: you can define, say, a PCI BAR containing RAM and
> MMIO, and give it to the PCI subsystem.  PCI can then enable/disable the
> BAR and move it to different addresses without calling any callbacks;
> the client code can enable or disable logging or coalescing without
> caring if the BAR is mapped or not.  For example:

Interesting feature.

> 
>   MemoryRegion mr, mr_mmio, mr_ram;
> 
>   memory_region_init(&mr);
>   memory_region_init_io(&mr_mmio, &mmio_ops, 0x1000);
>   memory_region_init_ram(&mr_ram, 0x100000);
>   memory_region_add_subregion(&mr, 0, &mr_ram);
>   memory_region_add_subregion(&mr, 0x10000, &mr_io);
>   memory_region_add_coalescing(&mr_ram, 0, 0x100000);
>   pci_register_bar(&pci_dev, 0, &mr);
> 
> at this point the PCI subsystem knows everything about the BAR and can
> enable or disable it, or move it around, without further help from the
> device code.  On the other hand, the device code can change logging or
> coalescing, or even change the structure of the region, without caring
> about whether the region is currently registered or not.
> 
> If we can agree on the API, then I think the way forward is to implement
> it in terms of the old API, change over all devices, then fold the old
> API into the new one.

There are more aspects that should be clarified before moving forward:
 - How to maintain memory regions internally?
 - Get rid of wasteful PhysPageDesc at this chance?
 - How to hook into the region maintenance (CPUPhysMemoryClient,
   listening vs. filtering or modifying changes)? How to simplify
   memory clients this way?

BTW, any old API should be removed ASAP once the new one demonstrated
its feasibility. IMHO, we can afford carrying yet another set of legacy
interface around.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 14:05 ` Jan Kiszka
@ 2011-05-18 14:36   ` Avi Kivity
  2011-05-18 15:11     ` Jan Kiszka
  2011-05-18 15:14   ` Anthony Liguori
  1 sibling, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 14:36 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel

On 05/18/2011 05:05 PM, Jan Kiszka wrote:
> On 2011-05-18 15:12, Avi Kivity wrote:
> >  The current memory APIs (cpu_register_io_memory,
> >  cpu_register_physical_memory) suffer from a number of drawbacks:
> >
> >  - lack of centralized bookkeeping
> >     - a cpu_register_physical_memory() that overlaps an existing region
> >  will overwrite the preexisting region; but a following
> >  cpu_unregister_physical_memory() will not restore things to status quo
> >  ante.
>
> Restoring is not the problem. The problem is that the current API
> deletes or truncates regions implicitly by overwriting. That makes
> tracking the layout hard, and it is also error-prone as the device that
> accidentally overlaps with some other device won't receive a
> notification of this potential conflict.
>
> Such implicite truncation or deletion must be avoided in a new API,
> forcing the users to explicitly reference an existing region when
> dropping or modifying it. But your API goes in the right direction.

It is avoided.  The unregistering/deleting APIs do not take a range, 
just an object.  This implies that the range is stored in the object.

The initial implementation will probably have the same problem as it 
will still be backed by the phys_desc array, but we'll have all of the 
information in MemoryRegions (we can have a single MemoryRegion that is 
all of memory) so we can avoid it with a better information.

>
> >     - coalescing and dirty logging are all cleared on unregistering, so
> >  the client has to re-issue them when re-registering
> >  - lots of opaques
> >  - no nesting
> >     - if a region has multiple subregions that need different handling
> >  (different callbacks, RAM interspersed with MMIO) then client code has
> >  to deal with that manually
> >     - we can't interpose code between a device and global memory handling
>
> I would add another drawback:
>
>   - Inability to identify the origin of a region accesses and handle them
>     differently based on the source.
>
>     That is at least problematic for the x86 APIC which is CPU local. Our
>     current way do deal with it is, well, very creative and falls to
>     dust if a guest actually tries to remap the APIC.
>
> However, I'm unsure if that can easily be addressed. As long as only x86
> is affected, it's tricky to ask for a big infrastructure to handle this
> special case. Maybe there some other use cases, don't know.

We could implement it with a per-cpu MemoryRegion, with each cpu's 
MemoryRegion populated by a different APIC sub-region.

>
> >
> >  To fix that, I propose an new API to replace the existing one:
> >
> >
> >  #ifndef MEMORY_H
> >  #define MEMORY_H
> >
> >  typedef struct MemoryRegionOps MemoryRegionOps;
> >  typedef struct MemoryRegion MemoryRegion;
> >
> >  typedef uint32_t (*MemoryReadFunc)(MemoryRegion *mr, target_phys_addr_t
> >  addr);
> >  typedef void (*MemoryWriteFunc)(MemoryRegion *mr, target_phys_addr_t addr,
> >                                  uint32_t data);
> >
> >  struct MemoryRegionOps {
> >      MemoryReadFunc readb, readw, readl;
> >      MemoryWriteFunc writeb, writew, writel;
> >  };
> >
> >  struct MemoryRegion {
> >      const MemoryRegionOps *ops;
> >      target_phys_addr_t size;
> >      target_phys_addr_t addr;
> >  };
> >
> >  void memory_region_init(MemoryRegion *mr,
> >                          target_phys_addr_t size);
>
> What use case would this abstract region cover?

An empty container, fill it with memory_region_add_subregion().

>
> >  void memory_region_init_io(MemoryRegion *mr,
> >                             const MemoryRegionOps *ops,
> >                             target_phys_addr_t size);
> >  void memory_region_init_ram(MemoryRegion *mr,
> >                              target_phys_addr_t size);
> >  void memory_region_init_ram_ptr(MemoryRegion *mr,
> >                                  target_phys_addr_t size,
> >                                  void *ptr);
> >  void memory_region_destroy(MemoryRegion *mr);
> >  void memory_region_set_offset(MemoryRegion *mr, target_phys_addr_t offset);
> >  void memory_region_set_log(MemoryRegion *mr, bool log);
> >  void memory_region_clear_coalescing(MemoryRegion *mr);
> >  void memory_region_add_coalescing(MemoryRegion *mr,
> >                                    target_phys_addr_t offset,
> >                                    target_phys_addr_t size);
> >
> >  void memory_region_add_subregion(MemoryRegion *mr,
> >                                   target_phys_addr_t offset,
> >                                   MemoryRegion *subregion);
> >  void memory_region_del_subregion(MemoryRegion *mr,
> >                                   target_phys_addr_t offset,
> >                                   MemoryRegion *subregion);
> >
> >  void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t addr);
>
> This could create overlaps. I would suggest to reject them, so we need a
> return code.

There is nothing we can do with a return code.  You can't fail an mmio 
that causes overlapping physical memory map.


>
> >  void cpu_unregister_memory_region(MemoryRegion *mr);

Instead, we need cpu_unregister_memory_region() to restore any 
previously hidden ranges.

> >
> >  #endif
> >
> >  The API is nested: you can define, say, a PCI BAR containing RAM and
> >  MMIO, and give it to the PCI subsystem.  PCI can then enable/disable the
> >  BAR and move it to different addresses without calling any callbacks;
> >  the client code can enable or disable logging or coalescing without
> >  caring if the BAR is mapped or not.  For example:
>
> Interesting feature.
>
> >
> >    MemoryRegion mr, mr_mmio, mr_ram;
> >
> >    memory_region_init(&mr);
> >    memory_region_init_io(&mr_mmio,&mmio_ops, 0x1000);
> >    memory_region_init_ram(&mr_ram, 0x100000);
> >    memory_region_add_subregion(&mr, 0,&mr_ram);
> >    memory_region_add_subregion(&mr, 0x10000,&mr_io);
> >    memory_region_add_coalescing(&mr_ram, 0, 0x100000);
> >    pci_register_bar(&pci_dev, 0,&mr);
> >
> >  at this point the PCI subsystem knows everything about the BAR and can
> >  enable or disable it, or move it around, without further help from the
> >  device code.  On the other hand, the device code can change logging or
> >  coalescing, or even change the structure of the region, without caring
> >  about whether the region is currently registered or not.
> >
> >  If we can agree on the API, then I think the way forward is to implement
> >  it in terms of the old API, change over all devices, then fold the old
> >  API into the new one.
>
> There are more aspects that should be clarified before moving forward:
>   - How to maintain memory regions internally?

Not sure what you mean by the question, but my plan was to have the 
client responsible for allocating the objects (and later use 
container_of() in the callbacks - note there are no void *s any longer).

>   - Get rid of wasteful PhysPageDesc at this chance?

That's the intent, but not at this chance, rather later on.  But I want 
the API to be compatible with the goal so we don't have to touch all 
devices again.

>   - How to hook into the region maintenance (CPUPhysMemoryClient,
>     listening vs. filtering or modifying changes)? How to simplify
>     memory clients this way?

I'd leave things as is, at least for the beginning.  CPUPhysMemoryClient 
is global in nature, whereas MemoryRegion is local (offsets are relative 
to the containing region).

>
> BTW, any old API should be removed ASAP once the new one demonstrated
> its feasibility. IMHO, we can afford carrying yet another set of legacy
> interface around.

That's the plan.  New API implemented on top of old API, convert all 
devices,  fold old API into new API.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 13:12 [Qemu-devel] [RFC] Memory API Avi Kivity
  2011-05-18 14:05 ` Jan Kiszka
@ 2011-05-18 15:08 ` Anthony Liguori
  2011-05-18 15:37   ` Avi Kivity
  2011-05-18 15:47   ` Stefan Weil
  2011-05-18 15:58 ` Avi Kivity
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-18 15:08 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

On 05/18/2011 08:12 AM, Avi Kivity wrote:
> The current memory APIs (cpu_register_io_memory,
> cpu_register_physical_memory) suffer from a number of drawbacks:
>
> - lack of centralized bookkeeping
> - a cpu_register_physical_memory() that overlaps an existing region will
> overwrite the preexisting region; but a following
> cpu_unregister_physical_memory() will not restore things to status quo
> ante.
> - coalescing and dirty logging are all cleared on unregistering, so the
> client has to re-issue them when re-registering
> - lots of opaques
> - no nesting
> - if a region has multiple subregions that need different handling
> (different callbacks, RAM interspersed with MMIO) then client code has
> to deal with that manually
> - we can't interpose code between a device and global memory handling
>
> To fix that, I propose an new API to replace the existing one:
>
>
> #ifndef MEMORY_H
> #define MEMORY_H
>
> typedef struct MemoryRegionOps MemoryRegionOps;
> typedef struct MemoryRegion MemoryRegion;
>
> typedef uint32_t (*MemoryReadFunc)(MemoryRegion *mr, target_phys_addr_t
> addr);
> typedef void (*MemoryWriteFunc)(MemoryRegion *mr, target_phys_addr_t addr,
> uint32_t data);


The API should be 64-bit and needs to have a size associated with it.

> struct MemoryRegionOps {
> MemoryReadFunc readb, readw, readl;
> MemoryWriteFunc writeb, writew, writel;
> };

I'm not a fan of having per-access type function pointers.

> struct MemoryRegion {
> const MemoryRegionOps *ops;
> target_phys_addr_t size;
> target_phys_addr_t addr;
> };

Should include a flags argument for future expansion.

>
> void memory_region_init(MemoryRegion *mr,
> target_phys_addr_t size);

What is this used for?  It's not clear to me.

> void memory_region_init_io(MemoryRegion *mr,
> const MemoryRegionOps *ops,
> target_phys_addr_t size);

How does one test a MemoryRegion to determine it's type?

> void memory_region_init_ram(MemoryRegion *mr,
> target_phys_addr_t size);
> void memory_region_init_ram_ptr(MemoryRegion *mr,
> target_phys_addr_t size,
> void *ptr);
> void memory_region_destroy(MemoryRegion *mr);
> void memory_region_set_offset(MemoryRegion *mr, target_phys_addr_t offset);

What's "offset" mean?

> void memory_region_set_log(MemoryRegion *mr, bool log);
> void memory_region_clear_coalescing(MemoryRegion *mr);
> void memory_region_add_coalescing(MemoryRegion *mr,
> target_phys_addr_t offset,
> target_phys_addr_t size);

I don't think it's worth while to try to fit coalescing into this API. 
It's a KVM specific hack.  I think it's fine to be a hacked on API.

>
> void memory_region_add_subregion(MemoryRegion *mr,
> target_phys_addr_t offset,
> MemoryRegion *subregion);
> void memory_region_del_subregion(MemoryRegion *mr,
> target_phys_addr_t offset,
> MemoryRegion *subregion);
>
> void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t addr);
> void cpu_unregister_memory_region(MemoryRegion *mr);
>
> #endif
>
> The API is nested: you can define, say, a PCI BAR containing RAM and
> MMIO, and give it to the PCI subsystem. PCI can then enable/disable the
> BAR and move it to different addresses without calling any callbacks;
> the client code can enable or disable logging or coalescing without
> caring if the BAR is mapped or not. For example:
>
> MemoryRegion mr, mr_mmio, mr_ram;
>
> memory_region_init(&mr);
> memory_region_init_io(&mr_mmio, &mmio_ops, 0x1000);
> memory_region_init_ram(&mr_ram, 0x100000);
> memory_region_add_subregion(&mr, 0, &mr_ram);
> memory_region_add_subregion(&mr, 0x10000, &mr_io);
> memory_region_add_coalescing(&mr_ram, 0, 0x100000);
> pci_register_bar(&pci_dev, 0, &mr);
>
> at this point the PCI subsystem knows everything about the BAR and can
> enable or disable it, or move it around, without further help from the
> device code. On the other hand, the device code can change logging or
> coalescing, or even change the structure of the region, without caring
> about whether the region is currently registered or not.
>
> If we can agree on the API, then I think the way forward is to implement
> it in terms of the old API, change over all devices, then fold the old
> API into the new one.

I think it pretty much makes sense to me.  It might be worth while to 
allow memory region definitions to be static.  For instance:

MemoryRegion e1000_bar = { ... };

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 14:36   ` Avi Kivity
@ 2011-05-18 15:11     ` Jan Kiszka
  2011-05-18 15:17       ` Peter Maydell
  2011-05-18 15:23       ` Avi Kivity
  0 siblings, 2 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-18 15:11 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

On 2011-05-18 16:36, Avi Kivity wrote:
>> I would add another drawback:
>>
>>   - Inability to identify the origin of a region accesses and handle them
>>     differently based on the source.
>>
>>     That is at least problematic for the x86 APIC which is CPU local. Our
>>     current way do deal with it is, well, very creative and falls to
>>     dust if a guest actually tries to remap the APIC.
>>
>> However, I'm unsure if that can easily be addressed. As long as only x86
>> is affected, it's tricky to ask for a big infrastructure to handle this
>> special case. Maybe there some other use cases, don't know.
> 
> We could implement it with a per-cpu MemoryRegion, with each cpu's 
> MemoryRegion populated by a different APIC sub-region.

The tricky part is wiring this up efficiently for TCG, ie. in QEMU's
softmmu. I played with passing the issuing CPUState (or NULL for
devices) down the MMIO handler chain. Not totally beautiful as
decentralized dispatching was still required, but at least only
moderately invasive. Maybe your API allows for cleaning up the
management and dispatching part, need to rethink...

> 
>>
>>>
>>>  To fix that, I propose an new API to replace the existing one:
>>>
>>>
>>>  #ifndef MEMORY_H
>>>  #define MEMORY_H
>>>
>>>  typedef struct MemoryRegionOps MemoryRegionOps;
>>>  typedef struct MemoryRegion MemoryRegion;
>>>
>>>  typedef uint32_t (*MemoryReadFunc)(MemoryRegion *mr, target_phys_addr_t
>>>  addr);
>>>  typedef void (*MemoryWriteFunc)(MemoryRegion *mr, target_phys_addr_t addr,
>>>                                  uint32_t data);
>>>
>>>  struct MemoryRegionOps {
>>>      MemoryReadFunc readb, readw, readl;
>>>      MemoryWriteFunc writeb, writew, writel;
>>>  };
>>>
>>>  struct MemoryRegion {
>>>      const MemoryRegionOps *ops;
>>>      target_phys_addr_t size;
>>>      target_phys_addr_t addr;
>>>  };
>>>
>>>  void memory_region_init(MemoryRegion *mr,
>>>                          target_phys_addr_t size);
>>
>> What use case would this abstract region cover?
> 
> An empty container, fill it with memory_region_add_subregion().

Yeah, of course.

> 
>>
>>>  void memory_region_init_io(MemoryRegion *mr,
>>>                             const MemoryRegionOps *ops,
>>>                             target_phys_addr_t size);
>>>  void memory_region_init_ram(MemoryRegion *mr,
>>>                              target_phys_addr_t size);
>>>  void memory_region_init_ram_ptr(MemoryRegion *mr,
>>>                                  target_phys_addr_t size,
>>>                                  void *ptr);
>>>  void memory_region_destroy(MemoryRegion *mr);
>>>  void memory_region_set_offset(MemoryRegion *mr, target_phys_addr_t offset);
>>>  void memory_region_set_log(MemoryRegion *mr, bool log);
>>>  void memory_region_clear_coalescing(MemoryRegion *mr);
>>>  void memory_region_add_coalescing(MemoryRegion *mr,
>>>                                    target_phys_addr_t offset,
>>>                                    target_phys_addr_t size);
>>>
>>>  void memory_region_add_subregion(MemoryRegion *mr,
>>>                                   target_phys_addr_t offset,
>>>                                   MemoryRegion *subregion);
>>>  void memory_region_del_subregion(MemoryRegion *mr,
>>>                                   target_phys_addr_t offset,
>>>                                   MemoryRegion *subregion);
>>>
>>>  void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t addr);
>>
>> This could create overlaps. I would suggest to reject them, so we need a
>> return code.
> 
> There is nothing we can do with a return code.  You can't fail an mmio 
> that causes overlapping physical memory map.

We must fail such requests to make progress with the API. That may
happen either on caller side or in cpu_register_memory_region itself
(hwerror). Otherwise the new API will just be a shiny new facade for on
old and still fragile building.

> 
> 
>>
>>>  void cpu_unregister_memory_region(MemoryRegion *mr);
> 
> Instead, we need cpu_unregister_memory_region() to restore any 
> previously hidden ranges.

I disagree. Both approaches, rejecting overlaps or restoring them, imply
subtle semantical changes that exiting device models have to deal with.
We can't use any of both without some review and conversion work. So
better head for the clearer and, thus, cleaner approach.

> 
>>>
>>>  #endif
>>>
>>>  The API is nested: you can define, say, a PCI BAR containing RAM and
>>>  MMIO, and give it to the PCI subsystem.  PCI can then enable/disable the
>>>  BAR and move it to different addresses without calling any callbacks;
>>>  the client code can enable or disable logging or coalescing without
>>>  caring if the BAR is mapped or not.  For example:
>>
>> Interesting feature.
>>
>>>
>>>    MemoryRegion mr, mr_mmio, mr_ram;
>>>
>>>    memory_region_init(&mr);
>>>    memory_region_init_io(&mr_mmio,&mmio_ops, 0x1000);
>>>    memory_region_init_ram(&mr_ram, 0x100000);
>>>    memory_region_add_subregion(&mr, 0,&mr_ram);
>>>    memory_region_add_subregion(&mr, 0x10000,&mr_io);
>>>    memory_region_add_coalescing(&mr_ram, 0, 0x100000);
>>>    pci_register_bar(&pci_dev, 0,&mr);
>>>
>>>  at this point the PCI subsystem knows everything about the BAR and can
>>>  enable or disable it, or move it around, without further help from the
>>>  device code.  On the other hand, the device code can change logging or
>>>  coalescing, or even change the structure of the region, without caring
>>>  about whether the region is currently registered or not.
>>>
>>>  If we can agree on the API, then I think the way forward is to implement
>>>  it in terms of the old API, change over all devices, then fold the old
>>>  API into the new one.
>>
>> There are more aspects that should be clarified before moving forward:
>>   - How to maintain memory regions internally?
> 
> Not sure what you mean by the question, but my plan was to have the 
> client responsible for allocating the objects (and later use 
> container_of() in the callbacks - note there are no void *s any longer).
> 
>>   - Get rid of wasteful PhysPageDesc at this chance?
> 
> That's the intent, but not at this chance, rather later on.

The features you expose to the users somehow have to be mapped on data
structures internally. Those need to support both the
registration/deregistration as well as the lookup efficiently. By
postponing that internal design to the point when we already switched to
facade, the risk arises that a suboptimal interface was exposed and
conversion was done in vain.

>  But I want 
> the API to be compatible with the goal so we don't have to touch all 
> devices again.

We can't perform any proper change in the area without touching all
users, some a bit more, some only minimally.

> 
>>   - How to hook into the region maintenance (CPUPhysMemoryClient,
>>     listening vs. filtering or modifying changes)? How to simplify
>>     memory clients this way?
> 
> I'd leave things as is, at least for the beginning.  CPUPhysMemoryClient 
> is global in nature, whereas MemoryRegion is local (offsets are relative 
> to the containing region).

See [1]: We really need to get rid of slot management on
CPUPhysMemoryClient side. Your API provides a perfect opportunity to
establish the infrastructure of slot tracking at a central place. We can
then switch from reporting cpu_registering_memory events to reporting
coalesced changes to slots, those slot that also the core uses. So a new
CPUPhysMemoryClient API needs to be considered in this API change as
well - or we change twice in the end.

Jan

[1] http://thread.gmane.org/gmane.comp.emulators.qemu/102893

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 14:05 ` Jan Kiszka
  2011-05-18 14:36   ` Avi Kivity
@ 2011-05-18 15:14   ` Anthony Liguori
  2011-05-18 15:26     ` Avi Kivity
  1 sibling, 1 reply; 187+ messages in thread
From: Anthony Liguori @ 2011-05-18 15:14 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, qemu-devel

On 05/18/2011 09:05 AM, Jan Kiszka wrote:
> On 2011-05-18 15:12, Avi Kivity wrote:
>> The current memory APIs (cpu_register_io_memory,
>> cpu_register_physical_memory) suffer from a number of drawbacks:
>>
>> - lack of centralized bookkeeping
>>     - a cpu_register_physical_memory() that overlaps an existing region
>> will overwrite the preexisting region; but a following
>> cpu_unregister_physical_memory() will not restore things to status quo
>> ante.
>
> Restoring is not the problem. The problem is that the current API
> deletes or truncates regions implicitly by overwriting. That makes
> tracking the layout hard, and it is also error-prone as the device that
> accidentally overlaps with some other device won't receive a
> notification of this potential conflict.
>
> Such implicite truncation or deletion must be avoided in a new API,
> forcing the users to explicitly reference an existing region when
> dropping or modifying it. But your API goes in the right direction.
>
>>     - coalescing and dirty logging are all cleared on unregistering, so
>> the client has to re-issue them when re-registering
>> - lots of opaques
>> - no nesting
>>     - if a region has multiple subregions that need different handling
>> (different callbacks, RAM interspersed with MMIO) then client code has
>> to deal with that manually
>>     - we can't interpose code between a device and global memory handling
>
> I would add another drawback:
>
>   - Inability to identify the origin of a region accesses and handle them
>     differently based on the source.
>
>     That is at least problematic for the x86 APIC which is CPU local. Our
>     current way do deal with it is, well, very creative and falls to
>     dust if a guest actually tries to remap the APIC.

This is about registration.  Right now you can only register IO 
intercepts in the chipset, not on a per-CPU basis.  We could just as 
easily have:

CPUState {
     MemoryRegion apic_region;
};

per_cpu_register_memory(env, &env->apic_region);

To make that work.

We need to revamp registration.  We should be able to register at the 
following levels:

1) per-CPU  (new API)

2) chipset  (effectively the current cpu_register_physical_memory)

3) per-BUS  (pci_register_bar())

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 15:11     ` Jan Kiszka
@ 2011-05-18 15:17       ` Peter Maydell
  2011-05-18 15:30         ` Jan Kiszka
  2011-05-18 15:23       ` Avi Kivity
  1 sibling, 1 reply; 187+ messages in thread
From: Peter Maydell @ 2011-05-18 15:17 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, qemu-devel

On 18 May 2011 16:11, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2011-05-18 16:36, Avi Kivity wrote:
>> There is nothing we can do with a return code.  You can't fail an mmio
>> that causes overlapping physical memory map.
>
> We must fail such requests to make progress with the API. That may
> happen either on caller side or in cpu_register_memory_region itself
> (hwerror). Otherwise the new API will just be a shiny new facade for on
> old and still fragile building.

If we don't allow overlapping regions, then how do you implement
things like "on startup board maps ROM into lower addresses
over top of devices, but later it is unmapped and you can see
the underlying devices" ? (You can't currently do this AFAIK,
and it would be nice if the new API supported it.)

-- PMM

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 15:11     ` Jan Kiszka
  2011-05-18 15:17       ` Peter Maydell
@ 2011-05-18 15:23       ` Avi Kivity
  2011-05-18 15:36         ` Jan Kiszka
  2011-05-18 16:33         ` Anthony Liguori
  1 sibling, 2 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 15:23 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel

On 05/18/2011 06:11 PM, Jan Kiszka wrote:
> On 2011-05-18 16:36, Avi Kivity wrote:
> >>  I would add another drawback:
> >>
> >>    - Inability to identify the origin of a region accesses and handle them
> >>      differently based on the source.
> >>
> >>      That is at least problematic for the x86 APIC which is CPU local. Our
> >>      current way do deal with it is, well, very creative and falls to
> >>      dust if a guest actually tries to remap the APIC.
> >>
> >>  However, I'm unsure if that can easily be addressed. As long as only x86
> >>  is affected, it's tricky to ask for a big infrastructure to handle this
> >>  special case. Maybe there some other use cases, don't know.
> >
> >  We could implement it with a per-cpu MemoryRegion, with each cpu's
> >  MemoryRegion populated by a different APIC sub-region.
>
> The tricky part is wiring this up efficiently for TCG, ie. in QEMU's
> softmmu. I played with passing the issuing CPUState (or NULL for
> devices) down the MMIO handler chain. Not totally beautiful as
> decentralized dispatching was still required, but at least only
> moderately invasive. Maybe your API allows for cleaning up the
> management and dispatching part, need to rethink...

My suggestion is opposite - have a different MemoryRegion for each (e.g. 
CPUState::memory).  Then the TLBs will resolve to a different ram_addr_t 
for the same physical address, for the local APIC range.

> >
> >  There is nothing we can do with a return code.  You can't fail an mmio
> >  that causes overlapping physical memory map.
>
> We must fail such requests to make progress with the API. That may
> happen either on caller side or in cpu_register_memory_region itself
> (hwerror). Otherwise the new API will just be a shiny new facade for on
> old and still fragile building.

> >>
> >>>   void cpu_unregister_memory_region(MemoryRegion *mr);
> >
> >  Instead, we need cpu_unregister_memory_region() to restore any
> >  previously hidden ranges.
>
> I disagree. Both approaches, rejecting overlaps or restoring them, imply
> subtle semantical changes that exiting device models have to deal with.
> We can't use any of both without some review and conversion work. So
> better head for the clearer and, thus, cleaner approach.

We need to head for the more hardware-like approach.  What happens when 
you program overlapping BARs?  I imagine the result is 
implementation-defined, but ends up with one region decoded in 
preference to the other.  There is simply no way to reject an 
overlapping mapping.

> >>    - Get rid of wasteful PhysPageDesc at this chance?
> >
> >  That's the intent, but not at this chance, rather later on.
>
> The features you expose to the users somehow have to be mapped on data
> structures internally. Those need to support both the
> registration/deregistration as well as the lookup efficiently. By
> postponing that internal design to the point when we already switched to
> facade, the risk arises that a suboptimal interface was exposed and
> conversion was done in vain.

That is true, and one of the reasons this is posted as an [RFC] (I 
usually prefer [PATCH]).  However, I don't see a way to change the 
internals without changing the API.

> >   But I want
> >  the API to be compatible with the goal so we don't have to touch all
> >  devices again.
>
> We can't perform any proper change in the area without touching all
> users, some a bit more, some only minimally.

Right.  I want to touch the devices *once*, to covert them to this API, 
then work on the internals.

> >
> >>    - How to hook into the region maintenance (CPUPhysMemoryClient,
> >>      listening vs. filtering or modifying changes)? How to simplify
> >>      memory clients this way?
> >
> >  I'd leave things as is, at least for the beginning.  CPUPhysMemoryClient
> >  is global in nature, whereas MemoryRegion is local (offsets are relative
> >  to the containing region).
>
> See [1]: We really need to get rid of slot management on
> CPUPhysMemoryClient side. Your API provides a perfect opportunity to
> establish the infrastructure of slot tracking at a central place. We can
> then switch from reporting cpu_registering_memory events to reporting
> coalesced changes to slots, those slot that also the core uses. So a new
> CPUPhysMemoryClient API needs to be considered in this API change as
> well - or we change twice in the end.

The kvm memory slots (and hopefully future qemu memory slots) are a 
flattened view of the MemoryRegion tree.  There is no 1:1 mapping.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 15:14   ` Anthony Liguori
@ 2011-05-18 15:26     ` Avi Kivity
  2011-05-18 16:21       ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 15:26 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, qemu-devel

On 05/18/2011 06:14 PM, Anthony Liguori wrote:
> On 05/18/2011 09:05 AM, Jan Kiszka wrote:
>> On 2011-05-18 15:12, Avi Kivity wrote:
>>> The current memory APIs (cpu_register_io_memory,
>>> cpu_register_physical_memory) suffer from a number of drawbacks:
>>>
>>> - lack of centralized bookkeeping
>>>     - a cpu_register_physical_memory() that overlaps an existing region
>>> will overwrite the preexisting region; but a following
>>> cpu_unregister_physical_memory() will not restore things to status quo
>>> ante.
>>
>> Restoring is not the problem. The problem is that the current API
>> deletes or truncates regions implicitly by overwriting. That makes
>> tracking the layout hard, and it is also error-prone as the device that
>> accidentally overlaps with some other device won't receive a
>> notification of this potential conflict.
>>
>> Such implicite truncation or deletion must be avoided in a new API,
>> forcing the users to explicitly reference an existing region when
>> dropping or modifying it. But your API goes in the right direction.
>>
>>>     - coalescing and dirty logging are all cleared on unregistering, so
>>> the client has to re-issue them when re-registering
>>> - lots of opaques
>>> - no nesting
>>>     - if a region has multiple subregions that need different handling
>>> (different callbacks, RAM interspersed with MMIO) then client code has
>>> to deal with that manually
>>>     - we can't interpose code between a device and global memory 
>>> handling
>>
>> I would add another drawback:
>>
>>   - Inability to identify the origin of a region accesses and handle 
>> them
>>     differently based on the source.
>>
>>     That is at least problematic for the x86 APIC which is CPU local. 
>> Our
>>     current way do deal with it is, well, very creative and falls to
>>     dust if a guest actually tries to remap the APIC.
>
> This is about registration.  Right now you can only register IO 
> intercepts in the chipset, not on a per-CPU basis.  We could just as 
> easily have:
>
> CPUState {
>     MemoryRegion apic_region;
> };
>
> per_cpu_register_memory(env, &env->apic_region);
>

Right.  Or all memory per-cpu, with two sub-regions:

  - global memory
  - overlaid apic memory

for this, we need to have well defined semantics for overlap (perhaps a 
priority argument to memory_region_add_subregion).

> To make that work.
>
> We need to revamp registration.  We should be able to register at the 
> following levels:
>
> 1) per-CPU  (new API)
>
> 2) chipset  (effectively the current cpu_register_physical_memory)
>
> 3) per-BUS  (pci_register_bar())

The important thing is that all of these take a MemoryRegion argument, 
so we don't have to duplicate the coalescing, logging, and other APIs.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 15:17       ` Peter Maydell
@ 2011-05-18 15:30         ` Jan Kiszka
  2011-05-18 19:10           ` Anthony Liguori
  0 siblings, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-18 15:30 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Avi Kivity, qemu-devel

On 2011-05-18 17:17, Peter Maydell wrote:
> On 18 May 2011 16:11, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> On 2011-05-18 16:36, Avi Kivity wrote:
>>> There is nothing we can do with a return code.  You can't fail an mmio
>>> that causes overlapping physical memory map.
>>
>> We must fail such requests to make progress with the API. That may
>> happen either on caller side or in cpu_register_memory_region itself
>> (hwerror). Otherwise the new API will just be a shiny new facade for on
>> old and still fragile building.
> 
> If we don't allow overlapping regions, then how do you implement
> things like "on startup board maps ROM into lower addresses
> over top of devices, but later it is unmapped and you can see
> the underlying devices" ? (You can't currently do this AFAIK,
> and it would be nice if the new API supported it.)

Right, we can't do this properly, and that's why the attempt if the
i440fx chipset model is so horribly broken ATM.

Just allowing overlapping does not solve this problem either. It does
not specify what region is on top and what is below (even worse if
multiple regions overlap at the place).

We need some managing instance here, and that is e.g. the chipset that
provide control over the overlap in reality. It could hook up a
PhysMemClient to receive and redirect updates to subregions, or only
allow to register them in disabled state.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 15:23       ` Avi Kivity
@ 2011-05-18 15:36         ` Jan Kiszka
  2011-05-18 15:42           ` Avi Kivity
  2011-05-18 16:33         ` Anthony Liguori
  1 sibling, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-18 15:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

On 2011-05-18 17:23, Avi Kivity wrote:
> On 05/18/2011 06:11 PM, Jan Kiszka wrote:
>> On 2011-05-18 16:36, Avi Kivity wrote:
>>>>  I would add another drawback:
>>>>
>>>>    - Inability to identify the origin of a region accesses and handle them
>>>>      differently based on the source.
>>>>
>>>>      That is at least problematic for the x86 APIC which is CPU local. Our
>>>>      current way do deal with it is, well, very creative and falls to
>>>>      dust if a guest actually tries to remap the APIC.
>>>>
>>>>  However, I'm unsure if that can easily be addressed. As long as only x86
>>>>  is affected, it's tricky to ask for a big infrastructure to handle this
>>>>  special case. Maybe there some other use cases, don't know.
>>>
>>>  We could implement it with a per-cpu MemoryRegion, with each cpu's
>>>  MemoryRegion populated by a different APIC sub-region.
>>
>> The tricky part is wiring this up efficiently for TCG, ie. in QEMU's
>> softmmu. I played with passing the issuing CPUState (or NULL for
>> devices) down the MMIO handler chain. Not totally beautiful as
>> decentralized dispatching was still required, but at least only
>> moderately invasive. Maybe your API allows for cleaning up the
>> management and dispatching part, need to rethink...
> 
> My suggestion is opposite - have a different MemoryRegion for each (e.g. 
> CPUState::memory).  Then the TLBs will resolve to a different ram_addr_t 
> for the same physical address, for the local APIC range.

Need to recheck, but I think I dropped that idea due to invasiveness of
the implementation. Not a very good argument, but without a clear
picture how useful this per-cpu registration is beyond x86, that was
more important.

> 
>>>
>>>  There is nothing we can do with a return code.  You can't fail an mmio
>>>  that causes overlapping physical memory map.
>>
>> We must fail such requests to make progress with the API. That may
>> happen either on caller side or in cpu_register_memory_region itself
>> (hwerror). Otherwise the new API will just be a shiny new facade for on
>> old and still fragile building.
> 
>>>>
>>>>>   void cpu_unregister_memory_region(MemoryRegion *mr);
>>>
>>>  Instead, we need cpu_unregister_memory_region() to restore any
>>>  previously hidden ranges.
>>
>> I disagree. Both approaches, rejecting overlaps or restoring them, imply
>> subtle semantical changes that exiting device models have to deal with.
>> We can't use any of both without some review and conversion work. So
>> better head for the clearer and, thus, cleaner approach.
> 
> We need to head for the more hardware-like approach.  What happens when 
> you program overlapping BARs?  I imagine the result is 
> implementation-defined, but ends up with one region decoded in 
> preference to the other.  There is simply no way to reject an 
> overlapping mapping.

But there is also now simple way to allow them. At least not without
exposing control about their ordering AND allowing to hook up managing
code (e.g. of the PCI bridge or the chipset) that controls registrations.

...
>> See [1]: We really need to get rid of slot management on
>> CPUPhysMemoryClient side. Your API provides a perfect opportunity to
>> establish the infrastructure of slot tracking at a central place. We can
>> then switch from reporting cpu_registering_memory events to reporting
>> coalesced changes to slots, those slot that also the core uses. So a new
>> CPUPhysMemoryClient API needs to be considered in this API change as
>> well - or we change twice in the end.
> 
> The kvm memory slots (and hopefully future qemu memory slots) are a 
> flattened view of the MemoryRegion tree.  There is no 1:1 mapping.

We need a flatted view of your memory regions during runtime as well. No
major difference here. If we share that view with PhysMemClients, they
can drop most of their creative slot tracking algorithms, focusing on
the real differences.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 15:08 ` Anthony Liguori
@ 2011-05-18 15:37   ` Avi Kivity
  2011-05-18 19:36     ` Jan Kiszka
  2011-05-18 15:47   ` Stefan Weil
  1 sibling, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 15:37 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel

On 05/18/2011 06:08 PM, Anthony Liguori wrote:
> On 05/18/2011 08:12 AM, Avi Kivity wrote:
>> The current memory APIs (cpu_register_io_memory,
>> cpu_register_physical_memory) suffer from a number of drawbacks:
>>
>> - lack of centralized bookkeeping
>> - a cpu_register_physical_memory() that overlaps an existing region will
>> overwrite the preexisting region; but a following
>> cpu_unregister_physical_memory() will not restore things to status quo
>> ante.
>> - coalescing and dirty logging are all cleared on unregistering, so the
>> client has to re-issue them when re-registering
>> - lots of opaques
>> - no nesting
>> - if a region has multiple subregions that need different handling
>> (different callbacks, RAM interspersed with MMIO) then client code has
>> to deal with that manually
>> - we can't interpose code between a device and global memory handling
>>
>> To fix that, I propose an new API to replace the existing one:
>>
>>
>> #ifndef MEMORY_H
>> #define MEMORY_H
>>
>> typedef struct MemoryRegionOps MemoryRegionOps;
>> typedef struct MemoryRegion MemoryRegion;
>>
>> typedef uint32_t (*MemoryReadFunc)(MemoryRegion *mr, target_phys_addr_t
>> addr);
>> typedef void (*MemoryWriteFunc)(MemoryRegion *mr, target_phys_addr_t 
>> addr,
>> uint32_t data);
>
>
> The API should be 64-bit and needs to have a size associated with it.

That will make conversion hard.  Maybe we can have a native with-size 
callback, and a helper that converts to the traditional form.  Otherwise 
the effort to fully covert the tree increases dramatically.

>>
>> void memory_region_init(MemoryRegion *mr,
>> target_phys_addr_t size);
>
> What is this used for?  It's not clear to me.

Empty container, fill with memory_region_add_subregion() to taste.

>
>> void memory_region_init_io(MemoryRegion *mr,
>> const MemoryRegionOps *ops,
>> target_phys_addr_t size);
>
> How does one test a MemoryRegion to determine it's type?

One doesn't.  Of course the low-level will need to do this (via a 
private API that isn't exposed to devices).

>> void memory_region_init_ram(MemoryRegion *mr,
>> target_phys_addr_t size);
>> void memory_region_init_ram_ptr(MemoryRegion *mr,
>> target_phys_addr_t size,
>> void *ptr);
>> void memory_region_destroy(MemoryRegion *mr);
>> void memory_region_set_offset(MemoryRegion *mr, target_phys_addr_t 
>> offset);
>
> What's "offset" mean?

It's the equivalent of cpu_register_physical_memory_offset().

Note the intent is to have addresses always be relative to the innermost 
container.  Perhaps we can get away without offset.

>
>> void memory_region_set_log(MemoryRegion *mr, bool log);
>> void memory_region_clear_coalescing(MemoryRegion *mr);
>> void memory_region_add_coalescing(MemoryRegion *mr,
>> target_phys_addr_t offset,
>> target_phys_addr_t size);
>
> I don't think it's worth while to try to fit coalescing into this API. 
> It's a KVM specific hack.  I think it's fine to be a hacked on API.

The problem is that only the device knows about coalescing, while the 
region can be mapped, unmapped, or moved without device knowledge.  So 
if a PCI is unmapped and then remapped (possibly at a different address) 
we need to tell kvm about it, but there is no device callback involved.

> I think it pretty much makes sense to me.  It might be worth while to 
> allow memory region definitions to be static.  For instance:
>
> MemoryRegion e1000_bar = { ... };
>

It doesn't work in general, since multiple instances of e1000 will each 
need its own e1000_bar.  So e1000_bar will be a member of the e1000 
device state structure.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 15:36         ` Jan Kiszka
@ 2011-05-18 15:42           ` Avi Kivity
  2011-05-18 16:00             ` Jan Kiszka
  2011-05-19  9:08             ` Gleb Natapov
  0 siblings, 2 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 15:42 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel

On 05/18/2011 06:36 PM, Jan Kiszka wrote:
> >
> >  We need to head for the more hardware-like approach.  What happens when
> >  you program overlapping BARs?  I imagine the result is
> >  implementation-defined, but ends up with one region decoded in
> >  preference to the other.  There is simply no way to reject an
> >  overlapping mapping.
>
> But there is also now simple way to allow them. At least not without
> exposing control about their ordering AND allowing to hook up managing
> code (e.g. of the PCI bridge or the chipset) that controls registrations.

What about memory_region_add_subregion(..., int priority) as I suggested 
in another message?

Regarding bridges, every registration request flows through them so they 
already have full control.

> ...
> >>  See [1]: We really need to get rid of slot management on
> >>  CPUPhysMemoryClient side. Your API provides a perfect opportunity to
> >>  establish the infrastructure of slot tracking at a central place. We can
> >>  then switch from reporting cpu_registering_memory events to reporting
> >>  coalesced changes to slots, those slot that also the core uses. So a new
> >>  CPUPhysMemoryClient API needs to be considered in this API change as
> >>  well - or we change twice in the end.
> >
> >  The kvm memory slots (and hopefully future qemu memory slots) are a
> >  flattened view of the MemoryRegion tree.  There is no 1:1 mapping.
>
> We need a flatted view of your memory regions during runtime as well. No
> major difference here. If we share that view with PhysMemClients, they
> can drop most of their creative slot tracking algorithms, focusing on
> the real differences.

We'll definitely have a flattened view (phys_desc is such a flattened 
view, hopefully we'll have a better one).

We can basically run a tree walk on each change, emitting ranges in 
order and sending them to PhysMemClients.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 15:08 ` Anthony Liguori
  2011-05-18 15:37   ` Avi Kivity
@ 2011-05-18 15:47   ` Stefan Weil
  2011-05-18 16:06     ` Avi Kivity
  1 sibling, 1 reply; 187+ messages in thread
From: Stefan Weil @ 2011-05-18 15:47 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, qemu-devel

Am 18.05.2011 17:08, schrieb Anthony Liguori:
> On 05/18/2011 08:12 AM, Avi Kivity wrote:
>> The current memory APIs (cpu_register_io_memory,
>> cpu_register_physical_memory) suffer from a number of drawbacks:
>>
>> - lack of centralized bookkeeping
>> - a cpu_register_physical_memory() that overlaps an existing region will
>> overwrite the preexisting region; but a following
>> cpu_unregister_physical_memory() will not restore things to status quo
>> ante.
>> - coalescing and dirty logging are all cleared on unregistering, so the
>> client has to re-issue them when re-registering
>> - lots of opaques
>> - no nesting
>> - if a region has multiple subregions that need different handling
>> (different callbacks, RAM interspersed with MMIO) then client code has
>> to deal with that manually
>> - we can't interpose code between a device and global memory handling
>>
>> To fix that, I propose an new API to replace the existing one:
>>
>>
>> #ifndef MEMORY_H
>> #define MEMORY_H
>>
>> typedef struct MemoryRegionOps MemoryRegionOps;
>> typedef struct MemoryRegion MemoryRegion;
>>
>> typedef uint32_t (*MemoryReadFunc)(MemoryRegion *mr, target_phys_addr_t
>> addr);
>> typedef void (*MemoryWriteFunc)(MemoryRegion *mr, target_phys_addr_t 
>> addr,
>> uint32_t data);
>
>
> The API should be 64-bit and needs to have a size associated with it.
>
>> struct MemoryRegionOps {
>> MemoryReadFunc readb, readw, readl;
>> MemoryWriteFunc writeb, writew, writel;
>> };
>
> I'm not a fan of having per-access type function pointers.

Do you think of something like these declaration:

typedef uint64_t (*MemoryReadFunc)(MemoryRegion *mr, target_phys_addr_t 
addr, size_t size);
typedef void (*MemoryWriteFunc)(MemoryRegion *mr, target_phys_addr_t 
addr, uint64_t data, size_tsize);

For 32 bit host / target, this would mean some unnecessary overhead.

What about passing values by address:

typedef void (*MemoryReadFunc)(MemoryRegion *mr, target_phys_addr_t 
addr, void *data, size_t size);
typedef void (*MemoryWriteFunc)(MemoryRegion *mr, target_phys_addr_t 
addr, const void *data, size_t size);

If we keep per-access type function pointers, they should use individual 
prototypes
for the different access types:

typedef uint8_t  (*MemoryReadbFunc)(MemoryRegion *mr, target_phys_addr_t 
addr);
typedef uint16_t (*MemoryReadwFunc)(MemoryRegion *mr, target_phys_addr_t 
addr);
typedef uint32_t (*MemoryReadlFunc)(MemoryRegion *mr, target_phys_addr_t 
addr);
typedef uint64_t (*MemoryReadllFunc)(MemoryRegion *mr, 
target_phys_addr_t addr);
...

Regards,
Stefan W.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 13:12 [Qemu-devel] [RFC] Memory API Avi Kivity
  2011-05-18 14:05 ` Jan Kiszka
  2011-05-18 15:08 ` Anthony Liguori
@ 2011-05-18 15:58 ` Avi Kivity
  2011-05-18 16:26 ` Richard Henderson
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 15:58 UTC (permalink / raw)
  To: qemu-devel

On 05/18/2011 04:12 PM, Avi Kivity wrote:
>
> void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t 
> addr);
> void cpu_unregister_memory_region(MemoryRegion *mr);

These two can probably implemented as:

     extern MemoryRegion memory_map;

     void cpu_register_memory_region(MemoryRegion *mr, 
target_phys_addr_t addr)
     {
         memory_region_add_subregion(&memory_map, addr, mr);
     }

     etc.

Eventually when everything is in a bus relationship they'll just go away.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 15:42           ` Avi Kivity
@ 2011-05-18 16:00             ` Jan Kiszka
  2011-05-18 16:14               ` Avi Kivity
  2011-05-19  9:08             ` Gleb Natapov
  1 sibling, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-18 16:00 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

On 2011-05-18 17:42, Avi Kivity wrote:
> On 05/18/2011 06:36 PM, Jan Kiszka wrote:
>>>
>>>  We need to head for the more hardware-like approach.  What happens when
>>>  you program overlapping BARs?  I imagine the result is
>>>  implementation-defined, but ends up with one region decoded in
>>>  preference to the other.  There is simply no way to reject an
>>>  overlapping mapping.
>>
>> But there is also now simple way to allow them. At least not without
>> exposing control about their ordering AND allowing to hook up managing
>> code (e.g. of the PCI bridge or the chipset) that controls registrations.
> 
> What about memory_region_add_subregion(..., int priority) as I suggested 
> in another message?

That's fine, but also requires a change how, or better where devices
register their regions.

> 
> Regarding bridges, every registration request flows through them so they 
> already have full control.

Not everything is PCI, we also have ISA e.g. If we were able to route
such requests also through a hierarchy of abstract bridges, then even
better.

> 
>> ...
>>>>  See [1]: We really need to get rid of slot management on
>>>>  CPUPhysMemoryClient side. Your API provides a perfect opportunity to
>>>>  establish the infrastructure of slot tracking at a central place. We can
>>>>  then switch from reporting cpu_registering_memory events to reporting
>>>>  coalesced changes to slots, those slot that also the core uses. So a new
>>>>  CPUPhysMemoryClient API needs to be considered in this API change as
>>>>  well - or we change twice in the end.
>>>
>>>  The kvm memory slots (and hopefully future qemu memory slots) are a
>>>  flattened view of the MemoryRegion tree.  There is no 1:1 mapping.
>>
>> We need a flatted view of your memory regions during runtime as well. No
>> major difference here. If we share that view with PhysMemClients, they
>> can drop most of their creative slot tracking algorithms, focusing on
>> the real differences.
> 
> We'll definitely have a flattened view (phys_desc is such a flattened 
> view, hopefully we'll have a better one).

phys_desc is not exportable. If we try (and we do from time to time...),
we end up with more slots than clients like kvm will ever be able to handle.

> 
> We can basically run a tree walk on each change, emitting ranges in 
> order and sending them to PhysMemClients.

I'm specifically thinking of fully trackable slot updates. The clients
should not have to maintain the flat layout. They should just receive
updates in the form of slot X added/modified/removed. For now, this
magic happens multiple times in the clients, and that is very bad.

Given that not only memory clients need that view but that ever TLB miss
(in TCG mode) requires to identify the effective slot as well, it might
be worth preparing a runtime structure at registration time that
supports this efficiently - but this time without wasting memory.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 15:47   ` Stefan Weil
@ 2011-05-18 16:06     ` Avi Kivity
  2011-05-18 16:51       ` Richard Henderson
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 16:06 UTC (permalink / raw)
  To: Stefan Weil; +Cc: qemu-devel

On 05/18/2011 06:47 PM, Stefan Weil wrote:
>> I'm not a fan of having per-access type function pointers.
>
>
> Do you think of something like these declaration:
>
> typedef uint64_t (*MemoryReadFunc)(MemoryRegion *mr, 
> target_phys_addr_t addr, size_t size);
> typedef void (*MemoryWriteFunc)(MemoryRegion *mr, target_phys_addr_t 
> addr, uint64_t data, size_tsize);
>
> For 32 bit host / target, this would mean some unnecessary overhead.

Frankly, the overhead is pretty low.  I think we can neglect it.

>
> What about passing values by address:
>
> typedef void (*MemoryReadFunc)(MemoryRegion *mr, target_phys_addr_t 
> addr, void *data, size_t size);
> typedef void (*MemoryWriteFunc)(MemoryRegion *mr, target_phys_addr_t 
> addr, const void *data, size_t size);

Those void *s will be quite annoying.  Especially on hosts which don't 
allow misaligned data, you'll never know how to reference them.

>
> If we keep per-access type function pointers, they should use 
> individual prototypes
> for the different access types:
>
> typedef uint8_t  (*MemoryReadbFunc)(MemoryRegion *mr, 
> target_phys_addr_t addr);
> typedef uint16_t (*MemoryReadwFunc)(MemoryRegion *mr, 
> target_phys_addr_t addr);
> typedef uint32_t (*MemoryReadlFunc)(MemoryRegion *mr, 
> target_phys_addr_t addr);
> typedef uint64_t (*MemoryReadllFunc)(MemoryRegion *mr, 
> target_phys_addr_t addr);
> ...

I prefer having size as an argument.

Something else I though about:

    void memory_region_set_access_sizes(MemoryRegion *mr, int min, int max);

if, for example, min=2 and max=4, then byte accesses will be emulated as 
word accesses (RMW for writes) and quad accesses will be emulated as two 
long accesses.  So a device that emulates 32-bit registers can set 
min=max=4 and get all the other sizes for free.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 16:00             ` Jan Kiszka
@ 2011-05-18 16:14               ` Avi Kivity
  2011-05-18 16:39                 ` Jan Kiszka
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 16:14 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel

On 05/18/2011 07:00 PM, Jan Kiszka wrote:
> On 2011-05-18 17:42, Avi Kivity wrote:
> >  On 05/18/2011 06:36 PM, Jan Kiszka wrote:
> >>>
> >>>   We need to head for the more hardware-like approach.  What happens when
> >>>   you program overlapping BARs?  I imagine the result is
> >>>   implementation-defined, but ends up with one region decoded in
> >>>   preference to the other.  There is simply no way to reject an
> >>>   overlapping mapping.
> >>
> >>  But there is also now simple way to allow them. At least not without
> >>  exposing control about their ordering AND allowing to hook up managing
> >>  code (e.g. of the PCI bridge or the chipset) that controls registrations.
> >
> >  What about memory_region_add_subregion(..., int priority) as I suggested
> >  in another message?
>
> That's fine, but also requires a change how, or better where devices
> register their regions.

Lost you - please elaborate.

> >
> >  Regarding bridges, every registration request flows through them so they
> >  already have full control.
>
> Not everything is PCI, we also have ISA e.g. If we were able to route
> such requests also through a hierarchy of abstract bridges, then even
> better.

Yes, it's a tree of nested MemoryRegions.

> >  We'll definitely have a flattened view (phys_desc is such a flattened
> >  view, hopefully we'll have a better one).
>
> phys_desc is not exportable. If we try (and we do from time to time...),
> we end up with more slots than clients like kvm will ever be able to handle.

If we coalesce ajacent phys_descs we end up with a minimal 
representation.  Of course that's not the most efficient implementation 
(a tree walk is better).

> >
> >  We can basically run a tree walk on each change, emitting ranges in
> >  order and sending them to PhysMemClients.
>
> I'm specifically thinking of fully trackable slot updates. The clients
> should not have to maintain the flat layout. They should just receive
> updates in the form of slot X added/modified/removed. For now, this
> magic happens multiple times in the clients, and that is very bad.

Slots don't have any meaning.  You can have a RAM region which is 
overlaid by a smaller mmio region -> the RAM slot is split into two.

We should just send clients a list of ranges, and they can associate 
them with slots themselves.

> Given that not only memory clients need that view but that ever TLB miss
> (in TCG mode) requires to identify the effective slot as well, it might
> be worth preparing a runtime structure at registration time that
> supports this efficiently - but this time without wasting memory.

Yes.  Won't be easy though.  Perhaps a perfect hash table for small 
regions and a sorted-by-size array for large regions.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 15:26     ` Avi Kivity
@ 2011-05-18 16:21       ` Avi Kivity
  2011-05-18 16:42         ` Jan Kiszka
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 16:21 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, qemu-devel

On 05/18/2011 06:26 PM, Avi Kivity wrote:
>> This is about registration.  Right now you can only register IO 
>> intercepts in the chipset, not on a per-CPU basis.  We could just as 
>> easily have:
>>
>> CPUState {
>>     MemoryRegion apic_region;
>> };
>>
>> per_cpu_register_memory(env, &env->apic_region);
>>
>
>
> Right.  Or all memory per-cpu, with two sub-regions:
>
>  - global memory
>  - overlaid apic memory
>
> for this, we need to have well defined semantics for overlap (perhaps 
> a priority argument to memory_region_add_subregion).

Or even

cpu_memory_region
   |
   +-- global memory map (prio 0)
   |    |
   |    +-- RAM (prio 0)
   |    |
   |    +-- PCI (prio 1)
   |
   +-- SMM memory (if active, prio 1)
   |
   +-- APIC memory (if active, prio 2)

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 13:12 [Qemu-devel] [RFC] Memory API Avi Kivity
                   ` (2 preceding siblings ...)
  2011-05-18 15:58 ` Avi Kivity
@ 2011-05-18 16:26 ` Richard Henderson
  2011-05-18 16:37   ` Avi Kivity
  2011-05-18 17:17 ` Avi Kivity
  2011-05-18 19:40 ` Jan Kiszka
  5 siblings, 1 reply; 187+ messages in thread
From: Richard Henderson @ 2011-05-18 16:26 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

On 05/18/2011 06:12 AM, Avi Kivity wrote:
> struct MemoryRegionOps {
>     MemoryReadFunc readb, readw, readl;
>     MemoryWriteFunc writeb, writew, writel;
> };

Look back to last May for a discussion between myself and Paul Brook
on this subject.  That started with me merely wanting to expand the
interface to support 8-byte reads/writes, and he wanting a fairly
substantial reorganization.

I'm not married to Paul's total reorg, but please include readq/writeq
support in any reorganization in this area.


r~

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 15:23       ` Avi Kivity
  2011-05-18 15:36         ` Jan Kiszka
@ 2011-05-18 16:33         ` Anthony Liguori
  2011-05-18 16:41           ` Avi Kivity
  2011-05-18 16:42           ` Jan Kiszka
  1 sibling, 2 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-18 16:33 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel

On 05/18/2011 10:23 AM, Avi Kivity wrote:
>> The tricky part is wiring this up efficiently for TCG, ie. in QEMU's
>> softmmu. I played with passing the issuing CPUState (or NULL for
>> devices) down the MMIO handler chain. Not totally beautiful as
>> decentralized dispatching was still required, but at least only
>> moderately invasive. Maybe your API allows for cleaning up the
>> management and dispatching part, need to rethink...
>
> My suggestion is opposite - have a different MemoryRegion for each (e.g.
> CPUState::memory). Then the TLBs will resolve to a different ram_addr_t
> for the same physical address, for the local APIC range.

I don't understand the different ram_addr_t part.

The TLB should dispatch to a per-CPU dispatch table.  The per-CPU should 
dispatch almost everything to a global dispatch table.

The global dispatch table is the chipset (Northbridge/Southbridge).

The chipset can then dispatch to individual busses which can then 
further dispatch as appropriate.

Overlapping regions can be handled differently at each level.  For 
instance, if a PCI device registers an IO region to the same location as 
the APIC, the APIC always wins because the PCI bus will never see the 
access.

You cannot do this properly with a single dispatch table because the 
behavior depends on where in the hierarchy the I/O is being handled.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 16:26 ` Richard Henderson
@ 2011-05-18 16:37   ` Avi Kivity
  0 siblings, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 16:37 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On 05/18/2011 07:26 PM, Richard Henderson wrote:
> On 05/18/2011 06:12 AM, Avi Kivity wrote:
> >  struct MemoryRegionOps {
> >      MemoryReadFunc readb, readw, readl;
> >      MemoryWriteFunc writeb, writew, writel;
> >  };
>
> Look back to last May for a discussion between myself and Paul Brook
> on this subject.  That started with me merely wanting to expand the
> interface to support 8-byte reads/writes, and he wanting a fairly
> substantial reorganization.

I wasn't able to find this conversation, sorry.

> I'm not married to Paul's total reorg, but please include readq/writeq
> support in any reorganization in this area.

Certainly.  Since I'm going with Anthony's single callback plus size, it 
will come naturally.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 16:14               ` Avi Kivity
@ 2011-05-18 16:39                 ` Jan Kiszka
  2011-05-18 16:47                   ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-18 16:39 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

On 2011-05-18 18:14, Avi Kivity wrote:
> On 05/18/2011 07:00 PM, Jan Kiszka wrote:
>> On 2011-05-18 17:42, Avi Kivity wrote:
>>>  On 05/18/2011 06:36 PM, Jan Kiszka wrote:
>>>>>
>>>>>   We need to head for the more hardware-like approach.  What happens when
>>>>>   you program overlapping BARs?  I imagine the result is
>>>>>   implementation-defined, but ends up with one region decoded in
>>>>>   preference to the other.  There is simply no way to reject an
>>>>>   overlapping mapping.
>>>>
>>>>  But there is also now simple way to allow them. At least not without
>>>>  exposing control about their ordering AND allowing to hook up managing
>>>>  code (e.g. of the PCI bridge or the chipset) that controls registrations.
>>>
>>>  What about memory_region_add_subregion(..., int priority) as I suggested
>>>  in another message?
>>
>> That's fine, but also requires a change how, or better where devices
>> register their regions.
> 
> Lost you - please elaborate.

Devices currently register against the core, that's nothing your API
automatically changes (though it lays the foundation to do so). But
devices should rather register against the corresponding bus. The bus
(ie. the device managing it) could then forward the request, stick it
into a subregion, or have it for dinner.

However, we are yet in troubles if we want to change that because
devices can only be on one bus - at least so far.

...
>> I'm specifically thinking of fully trackable slot updates. The clients
>> should not have to maintain the flat layout. They should just receive
>> updates in the form of slot X added/modified/removed. For now, this
>> magic happens multiple times in the clients, and that is very bad.
> 
> Slots don't have any meaning.  You can have a RAM region which is 
> overlaid by a smaller mmio region -> the RAM slot is split into two.
> 
> We should just send clients a list of ranges, and they can associate 
> them with slots themselves.

And that association logic should be as simple as matching a unique
range ID against an existing slot (for updates and deletions) or adding
a new slot for a new range and storing the ID. Anything else will not
allow to simplify the existing code bases noticeably. That's my point.

> 
>> Given that not only memory clients need that view but that ever TLB miss
>> (in TCG mode) requires to identify the effective slot as well, it might
>> be worth preparing a runtime structure at registration time that
>> supports this efficiently - but this time without wasting memory.
> 
> Yes.  Won't be easy though.  Perhaps a perfect hash table for small 
> regions and a sorted-by-size array for large regions.

Keep in mind that TCG will be our benchmark for these changes as it
requires much more accesses than KVM. We must avoid designing the new
infrastructure with exclusive focus on KVM (and also x86).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 16:33         ` Anthony Liguori
@ 2011-05-18 16:41           ` Avi Kivity
  2011-05-18 17:04             ` Anthony Liguori
  2011-05-18 16:42           ` Jan Kiszka
  1 sibling, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 16:41 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, qemu-devel

On 05/18/2011 07:33 PM, Anthony Liguori wrote:
> On 05/18/2011 10:23 AM, Avi Kivity wrote:
>>> The tricky part is wiring this up efficiently for TCG, ie. in QEMU's
>>> softmmu. I played with passing the issuing CPUState (or NULL for
>>> devices) down the MMIO handler chain. Not totally beautiful as
>>> decentralized dispatching was still required, but at least only
>>> moderately invasive. Maybe your API allows for cleaning up the
>>> management and dispatching part, need to rethink...
>>
>> My suggestion is opposite - have a different MemoryRegion for each (e.g.
>> CPUState::memory). Then the TLBs will resolve to a different ram_addr_t
>> for the same physical address, for the local APIC range.
>
> I don't understand the different ram_addr_t part.
>

The TLBs map a virtual address to a ram_addr_t.

> The TLB should dispatch to a per-CPU dispatch table.  The per-CPU 
> should dispatch almost everything to a global dispatch table.
>
> The global dispatch table is the chipset (Northbridge/Southbridge).
>
> The chipset can then dispatch to individual busses which can then 
> further dispatch as appropriate.
>
> Overlapping regions can be handled differently at each level.  For 
> instance, if a PCI device registers an IO region to the same location 
> as the APIC, the APIC always wins because the PCI bus will never see 
> the access.
>

That's inefficient, since you always have to traverse the hierarchy.

> You cannot do this properly with a single dispatch table because the 
> behavior depends on where in the hierarchy the I/O is being handled.

You can.  When you have a TLB miss, you walk the memory hierarchy (which 
is per-cpu) and end up with a ram_addr_t which is stowed in the TLB 
entry.  Further accesses dispatch via this ram_addr_t, without taking 
the cpu into consideration (the TLB is, after all, already per-cpu).

Since each APIC will have its own ram_addr_t, we don't need per-cpu 
dispatch.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 16:33         ` Anthony Liguori
  2011-05-18 16:41           ` Avi Kivity
@ 2011-05-18 16:42           ` Jan Kiszka
  2011-05-18 17:05             ` Avi Kivity
  1 sibling, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-18 16:42 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, qemu-devel

On 2011-05-18 18:33, Anthony Liguori wrote:
> On 05/18/2011 10:23 AM, Avi Kivity wrote:
>>> The tricky part is wiring this up efficiently for TCG, ie. in QEMU's
>>> softmmu. I played with passing the issuing CPUState (or NULL for
>>> devices) down the MMIO handler chain. Not totally beautiful as
>>> decentralized dispatching was still required, but at least only
>>> moderately invasive. Maybe your API allows for cleaning up the
>>> management and dispatching part, need to rethink...
>>
>> My suggestion is opposite - have a different MemoryRegion for each (e.g.
>> CPUState::memory). Then the TLBs will resolve to a different ram_addr_t
>> for the same physical address, for the local APIC range.
> 
> I don't understand the different ram_addr_t part.
> 
> The TLB should dispatch to a per-CPU dispatch table.  The per-CPU should 
> dispatch almost everything to a global dispatch table.
> 
> The global dispatch table is the chipset (Northbridge/Southbridge).
> 
> The chipset can then dispatch to individual busses which can then 
> further dispatch as appropriate.
> 
> Overlapping regions can be handled differently at each level.  For 
> instance, if a PCI device registers an IO region to the same location as 
> the APIC, the APIC always wins because the PCI bus will never see the 
> access.
> 
> You cannot do this properly with a single dispatch table because the 
> behavior depends on where in the hierarchy the I/O is being handled.

Ah, now I remember why I did not follow that path: Not invasiveness, but
performance concerns. I assume TLB refills have their share in TCG
performance, and adding another lookup layer, probably for every target,
will be measurable. I was wondering if that is worth the, granted,
cleaner design.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 16:21       ` Avi Kivity
@ 2011-05-18 16:42         ` Jan Kiszka
  2011-05-18 16:49           ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-18 16:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

On 2011-05-18 18:21, Avi Kivity wrote:
> On 05/18/2011 06:26 PM, Avi Kivity wrote:
>>> This is about registration.  Right now you can only register IO 
>>> intercepts in the chipset, not on a per-CPU basis.  We could just as 
>>> easily have:
>>>
>>> CPUState {
>>>     MemoryRegion apic_region;
>>> };
>>>
>>> per_cpu_register_memory(env, &env->apic_region);
>>>
>>
>>
>> Right.  Or all memory per-cpu, with two sub-regions:
>>
>>  - global memory
>>  - overlaid apic memory
>>
>> for this, we need to have well defined semantics for overlap (perhaps 
>> a priority argument to memory_region_add_subregion).
> 
> Or even
> 
> cpu_memory_region
>    |
>    +-- global memory map (prio 0)
>    |    |
>    |    +-- RAM (prio 0)
>    |    |
>    |    +-- PCI (prio 1)

It depends on the chipset and its configuration (via PAM e.g.) in what
region which takes precedence. Fixed prios do not help here.

>    |
>    +-- SMM memory (if active, prio 1)
>    |
>    +-- APIC memory (if active, prio 2)

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 16:39                 ` Jan Kiszka
@ 2011-05-18 16:47                   ` Avi Kivity
  2011-05-18 17:07                     ` Jan Kiszka
  2011-05-18 20:13                     ` Richard Henderson
  0 siblings, 2 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 16:47 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel

On 05/18/2011 07:39 PM, Jan Kiszka wrote:
> >>
> >>  That's fine, but also requires a change how, or better where devices
> >>  register their regions.
> >
> >  Lost you - please elaborate.
>
> Devices currently register against the core, that's nothing your API
> automatically changes (though it lays the foundation to do so). But
> devices should rather register against the corresponding bus. The bus
> (ie. the device managing it) could then forward the request, stick it
> into a subregion, or have it for dinner.

Yes.  We'd change pci_register_bar() to accept a MemoryRegion.

> However, we are yet in troubles if we want to change that because
> devices can only be on one bus - at least so far.

Nothing prohibits a device from calling pci_register_bar() for one 
region and some other API for another.

btw, another motivation for this API is for dual ISA/PCI devices.  This 
way most of the work is bus agnostic, with just the actual registration 
being bus specific.

> ...
> >>  I'm specifically thinking of fully trackable slot updates. The clients
> >>  should not have to maintain the flat layout. They should just receive
> >>  updates in the form of slot X added/modified/removed. For now, this
> >>  magic happens multiple times in the clients, and that is very bad.
> >
> >  Slots don't have any meaning.  You can have a RAM region which is
> >  overlaid by a smaller mmio region ->  the RAM slot is split into two.
> >
> >  We should just send clients a list of ranges, and they can associate
> >  them with slots themselves.
>
> And that association logic should be as simple as matching a unique
> range ID against an existing slot (for updates and deletions) or adding
> a new slot for a new range and storing the ID. Anything else will not
> allow to simplify the existing code bases noticeably. That's my point.

We won't have a natural ID.  But I'll see if I can have a library 
calculate the minimum difference between the previous layout and current 
layout.  Should not be too hard.

> >
> >>  Given that not only memory clients need that view but that ever TLB miss
> >>  (in TCG mode) requires to identify the effective slot as well, it might
> >>  be worth preparing a runtime structure at registration time that
> >>  supports this efficiently - but this time without wasting memory.
> >
> >  Yes.  Won't be easy though.  Perhaps a perfect hash table for small
> >  regions and a sorted-by-size array for large regions.
>
> Keep in mind that TCG will be our benchmark for these changes as it
> requires much more accesses than KVM. We must avoid designing the new
> infrastructure with exclusive focus on KVM (and also x86).

I think TCG will not be affected much since it stores ram_addr_t's in 
its TLBs, so the lookup is only done on a TLB miss.  In fact there's a 
change for a slight win since we'll probably be able to keep the new 
layout in cache compared to phys_desc.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 16:42         ` Jan Kiszka
@ 2011-05-18 16:49           ` Avi Kivity
  2011-05-18 17:11             ` Anthony Liguori
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 16:49 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel

On 05/18/2011 07:42 PM, Jan Kiszka wrote:
> On 2011-05-18 18:21, Avi Kivity wrote:
> >  On 05/18/2011 06:26 PM, Avi Kivity wrote:
> >>>  This is about registration.  Right now you can only register IO
> >>>  intercepts in the chipset, not on a per-CPU basis.  We could just as
> >>>  easily have:
> >>>
> >>>  CPUState {
> >>>      MemoryRegion apic_region;
> >>>  };
> >>>
> >>>  per_cpu_register_memory(env,&env->apic_region);
> >>>
> >>
> >>
> >>  Right.  Or all memory per-cpu, with two sub-regions:
> >>
> >>   - global memory
> >>   - overlaid apic memory
> >>
> >>  for this, we need to have well defined semantics for overlap (perhaps
> >>  a priority argument to memory_region_add_subregion).
> >
> >  Or even
> >
> >  cpu_memory_region
> >     |
> >     +-- global memory map (prio 0)
> >     |    |
> >     |    +-- RAM (prio 0)
> >     |    |
> >     |    +-- PCI (prio 1)
>
> It depends on the chipset and its configuration (via PAM e.g.) in what
> region which takes precedence. Fixed prios do not help here.

The priorities are determined by the device code.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 16:06     ` Avi Kivity
@ 2011-05-18 16:51       ` Richard Henderson
  2011-05-18 16:53         ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Richard Henderson @ 2011-05-18 16:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

On 05/18/2011 09:06 AM, Avi Kivity wrote:
>> If we keep per-access type function pointers, they should use individual prototypes
>> for the different access types:
>>
>> typedef uint8_t  (*MemoryReadbFunc)(MemoryRegion *mr, target_phys_addr_t addr);
>> typedef uint16_t (*MemoryReadwFunc)(MemoryRegion *mr, target_phys_addr_t addr);
>> typedef uint32_t (*MemoryReadlFunc)(MemoryRegion *mr, target_phys_addr_t addr);
>> typedef uint64_t (*MemoryReadllFunc)(MemoryRegion *mr, target_phys_addr_t addr);
>> ...
> 
> I prefer having size as an argument.

The one thing that makes having these function pointers split apart nice
is that it makes it easy to set policy for what different sized reads do.

E.g. for devices for which only 4 byte reads are defined, you only fill
in readl, and let the other sizes cause a machine-check.

Alternately, for devices for which the fundamental size is 1 byte, but
which does handle larger reads in a more-or-less memory-like fashion,
you can fill in a common readw_via_readb function that does the 
composition for you, and without having to have code for that scattered
through every device.


r~

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 16:51       ` Richard Henderson
@ 2011-05-18 16:53         ` Avi Kivity
  2011-05-18 17:03           ` Richard Henderson
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 16:53 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On 05/18/2011 07:51 PM, Richard Henderson wrote:
> >
> >  I prefer having size as an argument.
>
> The one thing that makes having these function pointers split apart nice
> is that it makes it easy to set policy for what different sized reads do.
>
> E.g. for devices for which only 4 byte reads are defined, you only fill
> in readl, and let the other sizes cause a machine-check.
>
> Alternately, for devices for which the fundamental size is 1 byte, but
> which does handle larger reads in a more-or-less memory-like fashion,
> you can fill in a common readw_via_readb function that does the
> composition for you, and without having to have code for that scattered
> through every device.

I plan to centralize this logic in the memory API.  You'll just declare 
how your device behaves.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 16:53         ` Avi Kivity
@ 2011-05-18 17:03           ` Richard Henderson
  0 siblings, 0 replies; 187+ messages in thread
From: Richard Henderson @ 2011-05-18 17:03 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

On 05/18/2011 09:53 AM, Avi Kivity wrote:
> I plan to centralize this logic in the memory API.  You'll just declare how your device behaves.

Excellent.  Those were my only two concerns then.


r~

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 16:41           ` Avi Kivity
@ 2011-05-18 17:04             ` Anthony Liguori
  2011-05-18 17:13               ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Anthony Liguori @ 2011-05-18 17:04 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel

On 05/18/2011 11:41 AM, Avi Kivity wrote:
> On 05/18/2011 07:33 PM, Anthony Liguori wrote:
>> On 05/18/2011 10:23 AM, Avi Kivity wrote:
>>>> The tricky part is wiring this up efficiently for TCG, ie. in QEMU's
>>>> softmmu. I played with passing the issuing CPUState (or NULL for
>>>> devices) down the MMIO handler chain. Not totally beautiful as
>>>> decentralized dispatching was still required, but at least only
>>>> moderately invasive. Maybe your API allows for cleaning up the
>>>> management and dispatching part, need to rethink...
>>>
>>> My suggestion is opposite - have a different MemoryRegion for each (e.g.
>>> CPUState::memory). Then the TLBs will resolve to a different ram_addr_t
>>> for the same physical address, for the local APIC range.
>>
>> I don't understand the different ram_addr_t part.
>>
>
> The TLBs map a virtual address to a ram_addr_t.

It actually maps virtual address to host virtual addresses.  Virtual 
addresses that map to I/O memory never get stored in the TLB.

You don't need separate I/O registration addresses in order to do 
per-CPU dispatch provided that you route the dispatch routines through 
the CPUs first.

>> Overlapping regions can be handled differently at each level. For
>> instance, if a PCI device registers an IO region to the same location
>> as the APIC, the APIC always wins because the PCI bus will never see
>> the access.
>>
>
> That's inefficient, since you always have to traverse the hierarchy.

Is efficiency really a problem here?  Besides, I don't think that's 
really correct.  You're adding at most 2-3 extra function pointer 
invocations.  I don't think you can really call that inefficient.

>> You cannot do this properly with a single dispatch table because the
>> behavior depends on where in the hierarchy the I/O is being handled.
>
> You can. When you have a TLB miss, you walk the memory hierarchy (which
> is per-cpu) and end up with a ram_addr_t which is stowed in the TLB
> entry.

I think we're overloading the term TLB.  Are you referring to 
l1_phys_map as the TLB because I thought Jan was referring to the actual 
emulated TLB that TCG uses?

> Further accesses dispatch via this ram_addr_t, without taking the
> cpu into consideration (the TLB is, after all, already per-cpu).
>
> Since each APIC will have its own ram_addr_t, we don't need per-cpu
> dispatch.

You need to have per-CPU l1_phys_maps which would result in quite a lot 
of additional memory overhead.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 16:42           ` Jan Kiszka
@ 2011-05-18 17:05             ` Avi Kivity
  0 siblings, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 17:05 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel

On 05/18/2011 07:42 PM, Jan Kiszka wrote:
> >
> >  You cannot do this properly with a single dispatch table because the
> >  behavior depends on where in the hierarchy the I/O is being handled.
>
> Ah, now I remember why I did not follow that path: Not invasiveness, but
> performance concerns. I assume TLB refills have their share in TCG
> performance, and adding another lookup layer, probably for every target,
> will be measurable. I was wondering if that is worth the, granted,
> cleaner design.

We can have a per-cpu hash table and flattened slots list, though that 
seems wasteful.  I agree that a tree walk is too expensive for a tlb miss.

We'll start however with the existing phys_desc, so no performance 
concerns, and no per-cpu APIC address either (btw it is broken in kvm as 
well, and hard to fix - we don't want per-cpu page tables).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 16:47                   ` Avi Kivity
@ 2011-05-18 17:07                     ` Jan Kiszka
  2011-05-18 17:15                       ` Avi Kivity
  2011-05-18 20:13                     ` Richard Henderson
  1 sibling, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-18 17:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

On 2011-05-18 18:47, Avi Kivity wrote:
>>>>  I'm specifically thinking of fully trackable slot updates. The clients
>>>>  should not have to maintain the flat layout. They should just receive
>>>>  updates in the form of slot X added/modified/removed. For now, this
>>>>  magic happens multiple times in the clients, and that is very bad.
>>>
>>>  Slots don't have any meaning.  You can have a RAM region which is
>>>  overlaid by a smaller mmio region ->  the RAM slot is split into two.
>>>
>>>  We should just send clients a list of ranges, and they can associate
>>>  them with slots themselves.
>>
>> And that association logic should be as simple as matching a unique
>> range ID against an existing slot (for updates and deletions) or adding
>> a new slot for a new range and storing the ID. Anything else will not
>> allow to simplify the existing code bases noticeably. That's my point.
> 
> We won't have a natural ID.

The address of the data structure describing a region could be such a
thing. Provided, of course, we prepare a flatted view ahead of time, not
on the fly.

>  But I'll see if I can have a library
> calculate the minimum difference between the previous layout and current 
> layout.  Should not be too hard.

We need exact match on the client side with the old range. E.g. if you
put a new region over an existing one, effectively splitting it into two
halves, the core has to
 - shrink the existing range to form the first half
 - register two new ranges to reflect the rest

On unregistering of that overlap, we need the reverse. So all the client
has to do then is to decide if it is interested in some range, and then
store it internally with some additional data (and process it, of
course). No more merging, no more overlap detection and splitting up at
client level.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 16:49           ` Avi Kivity
@ 2011-05-18 17:11             ` Anthony Liguori
  2011-05-18 17:38               ` Jan Kiszka
  0 siblings, 1 reply; 187+ messages in thread
From: Anthony Liguori @ 2011-05-18 17:11 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel

On 05/18/2011 11:49 AM, Avi Kivity wrote:
> On 05/18/2011 07:42 PM, Jan Kiszka wrote:
>> On 2011-05-18 18:21, Avi Kivity wrote:
>> > On 05/18/2011 06:26 PM, Avi Kivity wrote:
>> >>> This is about registration. Right now you can only register IO
>> >>> intercepts in the chipset, not on a per-CPU basis. We could just as
>> >>> easily have:
>> >>>
>> >>> CPUState {
>> >>> MemoryRegion apic_region;
>> >>> };
>> >>>
>> >>> per_cpu_register_memory(env,&env->apic_region);
>> >>>
>> >>
>> >>
>> >> Right. Or all memory per-cpu, with two sub-regions:
>> >>
>> >> - global memory
>> >> - overlaid apic memory
>> >>
>> >> for this, we need to have well defined semantics for overlap (perhaps
>> >> a priority argument to memory_region_add_subregion).
>> >
>> > Or even
>> >
>> > cpu_memory_region
>> > |
>> > +-- global memory map (prio 0)
>> > | |
>> > | +-- RAM (prio 0)
>> > | |
>> > | +-- PCI (prio 1)
>>
>> It depends on the chipset and its configuration (via PAM e.g.) in what
>> region which takes precedence. Fixed prios do not help here.

It's really layering.

To implement PAM in a robust way, you need a certain set of memory 
accesses to first flow through the chipset before going to the next 
location with the ability to intercept.

We do something rather weird today by changing registrations first 
saving the current registrations.  It would be much more elegant to just 
intercept the I/O requests and redirect accordingly.

Regards,

Anthony Liguori

>
> The priorities are determined by the device code.
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 17:04             ` Anthony Liguori
@ 2011-05-18 17:13               ` Avi Kivity
  0 siblings, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 17:13 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, qemu-devel

On 05/18/2011 08:04 PM, Anthony Liguori wrote:
>> The TLBs map a virtual address to a ram_addr_t.
>
>
> It actually maps virtual address to host virtual addresses.  Virtual 
> addresses that map to I/O memory never get stored in the TLB.

They are stored.  Look at glue(io_read, SUFFIX) and its caller for example.

>
> You don't need separate I/O registration addresses in order to do 
> per-CPU dispatch provided that you route the dispatch routines through 
> the CPUs first.

Right, but you pay for an extra lookup which always fails.

>>> Overlapping regions can be handled differently at each level. For
>>> instance, if a PCI device registers an IO region to the same location
>>> as the APIC, the APIC always wins because the PCI bus will never see
>>> the access.
>>>
>>
>> That's inefficient, since you always have to traverse the hierarchy.
>
> Is efficiency really a problem here?  Besides, I don't think that's 
> really correct.  You're adding at most 2-3 extra function pointer 
> invocations.  I don't think you can really call that inefficient.

Well, you'll need something special for SMM as well.  Per-cpu memory map 
solves all that neatly.

>
>>> You cannot do this properly with a single dispatch table because the
>>> behavior depends on where in the hierarchy the I/O is being handled.
>>
>> You can. When you have a TLB miss, you walk the memory hierarchy (which
>> is per-cpu) and end up with a ram_addr_t which is stowed in the TLB
>> entry.
>
> I think we're overloading the term TLB.  Are you referring to 
> l1_phys_map as the TLB because I thought Jan was referring to the 
> actual emulated TLB that TCG uses?

env->iotlb.  For some reason I though this was folded into 
enc->tlb_table, but it isn't (should it be? saves a lookup).

>
>> Further accesses dispatch via this ram_addr_t, without taking the
>> cpu into consideration (the TLB is, after all, already per-cpu).
>>
>> Since each APIC will have its own ram_addr_t, we don't need per-cpu
>> dispatch.
>
> You need to have per-CPU l1_phys_maps which would result in quite a 
> lot of additional memory overhead.

This is predicated on a better lookup method, yes.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 17:07                     ` Jan Kiszka
@ 2011-05-18 17:15                       ` Avi Kivity
  2011-05-18 17:40                         ` Jan Kiszka
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 17:15 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel

On 05/18/2011 08:07 PM, Jan Kiszka wrote:
> On 2011-05-18 18:47, Avi Kivity wrote:
> >>>>   I'm specifically thinking of fully trackable slot updates. The clients
> >>>>   should not have to maintain the flat layout. They should just receive
> >>>>   updates in the form of slot X added/modified/removed. For now, this
> >>>>   magic happens multiple times in the clients, and that is very bad.
> >>>
> >>>   Slots don't have any meaning.  You can have a RAM region which is
> >>>   overlaid by a smaller mmio region ->   the RAM slot is split into two.
> >>>
> >>>   We should just send clients a list of ranges, and they can associate
> >>>   them with slots themselves.
> >>
> >>  And that association logic should be as simple as matching a unique
> >>  range ID against an existing slot (for updates and deletions) or adding
> >>  a new slot for a new range and storing the ID. Anything else will not
> >>  allow to simplify the existing code bases noticeably. That's my point.
> >
> >  We won't have a natural ID.
>
> The address of the data structure describing a region could be such a
> thing. Provided, of course, we prepare a flatted view ahead of time, not
> on the fly.

It will change as soon as the memory map changes.

> >   But I'll see if I can have a library
> >  calculate the minimum difference between the previous layout and current
> >  layout.  Should not be too hard.
>
> We need exact match on the client side with the old range. E.g. if you
> put a new region over an existing one, effectively splitting it into two
> halves, the core has to
>   - shrink the existing range to form the first half
>   - register two new ranges to reflect the rest
>
> On unregistering of that overlap, we need the reverse. So all the client
> has to do then is to decide if it is interested in some range, and then
> store it internally with some additional data (and process it, of
> course). No more merging, no more overlap detection and splitting up at
> client level.

Right.  We do a symmetric set difference between the old and new maps 
and let the client know what has changed.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 13:12 [Qemu-devel] [RFC] Memory API Avi Kivity
                   ` (3 preceding siblings ...)
  2011-05-18 16:26 ` Richard Henderson
@ 2011-05-18 17:17 ` Avi Kivity
  2011-05-18 19:40 ` Jan Kiszka
  5 siblings, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-18 17:17 UTC (permalink / raw)
  To: qemu-devel, KVM list

Copying kvm@.

On 05/18/2011 04:12 PM, Avi Kivity wrote:
> The current memory APIs (cpu_register_io_memory, 
> cpu_register_physical_memory) suffer from a number of drawbacks:
>
> - lack of centralized bookkeeping
>    - a cpu_register_physical_memory() that overlaps an existing region 
> will overwrite the preexisting region; but a following 
> cpu_unregister_physical_memory() will not restore things to status quo 
> ante.
>    - coalescing and dirty logging are all cleared on unregistering, so 
> the client has to re-issue them when re-registering
> - lots of opaques
> - no nesting
>    - if a region has multiple subregions that need different handling 
> (different callbacks, RAM interspersed with MMIO) then client code has 
> to deal with that manually
>    - we can't interpose code between a device and global memory handling
>
> To fix that, I propose an new API to replace the existing one:
>
>
> #ifndef MEMORY_H
> #define MEMORY_H
>
> typedef struct MemoryRegionOps MemoryRegionOps;
> typedef struct MemoryRegion MemoryRegion;
>
> typedef uint32_t (*MemoryReadFunc)(MemoryRegion *mr, 
> target_phys_addr_t addr);
> typedef void (*MemoryWriteFunc)(MemoryRegion *mr, target_phys_addr_t 
> addr,
>                                 uint32_t data);
>
> struct MemoryRegionOps {
>     MemoryReadFunc readb, readw, readl;
>     MemoryWriteFunc writeb, writew, writel;
> };
>
> struct MemoryRegion {
>     const MemoryRegionOps *ops;
>     target_phys_addr_t size;
>     target_phys_addr_t addr;
> };
>
> void memory_region_init(MemoryRegion *mr,
>                         target_phys_addr_t size);
> void memory_region_init_io(MemoryRegion *mr,
>                            const MemoryRegionOps *ops,
>                            target_phys_addr_t size);
> void memory_region_init_ram(MemoryRegion *mr,
>                             target_phys_addr_t size);
> void memory_region_init_ram_ptr(MemoryRegion *mr,
>                                 target_phys_addr_t size,
>                                 void *ptr);
> void memory_region_destroy(MemoryRegion *mr);
> void memory_region_set_offset(MemoryRegion *mr, target_phys_addr_t 
> offset);
> void memory_region_set_log(MemoryRegion *mr, bool log);
> void memory_region_clear_coalescing(MemoryRegion *mr);
> void memory_region_add_coalescing(MemoryRegion *mr,
>                                   target_phys_addr_t offset,
>                                   target_phys_addr_t size);
>
> void memory_region_add_subregion(MemoryRegion *mr,
>                                  target_phys_addr_t offset,
>                                  MemoryRegion *subregion);
> void memory_region_del_subregion(MemoryRegion *mr,
>                                  target_phys_addr_t offset,
>                                  MemoryRegion *subregion);
>
> void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t 
> addr);
> void cpu_unregister_memory_region(MemoryRegion *mr);
>
> #endif
>
> The API is nested: you can define, say, a PCI BAR containing RAM and 
> MMIO, and give it to the PCI subsystem.  PCI can then enable/disable 
> the BAR and move it to different addresses without calling any 
> callbacks; the client code can enable or disable logging or coalescing 
> without caring if the BAR is mapped or not.  For example:
>
>   MemoryRegion mr, mr_mmio, mr_ram;
>
>   memory_region_init(&mr);
>   memory_region_init_io(&mr_mmio, &mmio_ops, 0x1000);
>   memory_region_init_ram(&mr_ram, 0x100000);
>   memory_region_add_subregion(&mr, 0, &mr_ram);
>   memory_region_add_subregion(&mr, 0x10000, &mr_io);
>   memory_region_add_coalescing(&mr_ram, 0, 0x100000);
>   pci_register_bar(&pci_dev, 0, &mr);
>
> at this point the PCI subsystem knows everything about the BAR and can 
> enable or disable it, or move it around, without further help from the 
> device code.  On the other hand, the device code can change logging or 
> coalescing, or even change the structure of the region, without caring 
> about whether the region is currently registered or not.
>
> If we can agree on the API, then I think the way forward is to 
> implement it in terms of the old API, change over all devices, then 
> fold the old API into the new one.
>


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 17:11             ` Anthony Liguori
@ 2011-05-18 17:38               ` Jan Kiszka
  0 siblings, 0 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-18 17:38 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, qemu-devel

On 2011-05-18 19:11, Anthony Liguori wrote:
> On 05/18/2011 11:49 AM, Avi Kivity wrote:
>> On 05/18/2011 07:42 PM, Jan Kiszka wrote:
>>> On 2011-05-18 18:21, Avi Kivity wrote:
>>>> On 05/18/2011 06:26 PM, Avi Kivity wrote:
>>>>>> This is about registration. Right now you can only register IO
>>>>>> intercepts in the chipset, not on a per-CPU basis. We could just as
>>>>>> easily have:
>>>>>>
>>>>>> CPUState {
>>>>>> MemoryRegion apic_region;
>>>>>> };
>>>>>>
>>>>>> per_cpu_register_memory(env,&env->apic_region);
>>>>>>
>>>>>
>>>>>
>>>>> Right. Or all memory per-cpu, with two sub-regions:
>>>>>
>>>>> - global memory
>>>>> - overlaid apic memory
>>>>>
>>>>> for this, we need to have well defined semantics for overlap (perhaps
>>>>> a priority argument to memory_region_add_subregion).
>>>>
>>>> Or even
>>>>
>>>> cpu_memory_region
>>>> |
>>>> +-- global memory map (prio 0)
>>>> | |
>>>> | +-- RAM (prio 0)
>>>> | |
>>>> | +-- PCI (prio 1)
>>>
>>> It depends on the chipset and its configuration (via PAM e.g.) in what
>>> region which takes precedence. Fixed prios do not help here.
> 
> It's really layering.
> 
> To implement PAM in a robust way, you need a certain set of memory 
> accesses to first flow through the chipset before going to the next 
> location with the ability to intercept.
> 
> We do something rather weird today by changing registrations first 
> saving the current registrations.  It would be much more elegant to just 
> intercept the I/O requests and redirect accordingly.

That's what I implemented already, though using the current API with
some tweaks (filtering PhysMemClient) and then facing massive slot
fragmentation problems on KVM side.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 17:15                       ` Avi Kivity
@ 2011-05-18 17:40                         ` Jan Kiszka
  0 siblings, 0 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-18 17:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

On 2011-05-18 19:15, Avi Kivity wrote:
> On 05/18/2011 08:07 PM, Jan Kiszka wrote:
>> On 2011-05-18 18:47, Avi Kivity wrote:
>>>>>>   I'm specifically thinking of fully trackable slot updates. The clients
>>>>>>   should not have to maintain the flat layout. They should just receive
>>>>>>   updates in the form of slot X added/modified/removed. For now, this
>>>>>>   magic happens multiple times in the clients, and that is very bad.
>>>>>
>>>>>   Slots don't have any meaning.  You can have a RAM region which is
>>>>>   overlaid by a smaller mmio region ->   the RAM slot is split into two.
>>>>>
>>>>>   We should just send clients a list of ranges, and they can associate
>>>>>   them with slots themselves.
>>>>
>>>>  And that association logic should be as simple as matching a unique
>>>>  range ID against an existing slot (for updates and deletions) or adding
>>>>  a new slot for a new range and storing the ID. Anything else will not
>>>>  allow to simplify the existing code bases noticeably. That's my point.
>>>
>>>  We won't have a natural ID.
>>
>> The address of the data structure describing a region could be such a
>> thing. Provided, of course, we prepare a flatted view ahead of time, not
>> on the fly.
> 
> It will change as soon as the memory map changes.

...which is supposed to be properly reported to the client beforehand.

> 
>>>   But I'll see if I can have a library
>>>  calculate the minimum difference between the previous layout and current
>>>  layout.  Should not be too hard.
>>
>> We need exact match on the client side with the old range. E.g. if you
>> put a new region over an existing one, effectively splitting it into two
>> halves, the core has to
>>   - shrink the existing range to form the first half
>>   - register two new ranges to reflect the rest
>>
>> On unregistering of that overlap, we need the reverse. So all the client
>> has to do then is to decide if it is interested in some range, and then
>> store it internally with some additional data (and process it, of
>> course). No more merging, no more overlap detection and splitting up at
>> client level.
> 
> Right.  We do a symmetric set difference between the old and new maps 
> and let the client know what has changed.

That would be fine as well. Then the internal representation could be
anything, though I bet a flattened form would have its advantages.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 15:30         ` Jan Kiszka
@ 2011-05-18 19:10           ` Anthony Liguori
  2011-05-18 19:27             ` Jan Kiszka
                               ` (2 more replies)
  0 siblings, 3 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-18 19:10 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Peter Maydell, Avi Kivity, qemu-devel

On 05/18/2011 10:30 AM, Jan Kiszka wrote:
> On 2011-05-18 17:17, Peter Maydell wrote:
>> On 18 May 2011 16:11, Jan Kiszka<jan.kiszka@siemens.com>  wrote:
>>> On 2011-05-18 16:36, Avi Kivity wrote:
>>>> There is nothing we can do with a return code.  You can't fail an mmio
>>>> that causes overlapping physical memory map.
>>>
>>> We must fail such requests to make progress with the API. That may
>>> happen either on caller side or in cpu_register_memory_region itself
>>> (hwerror). Otherwise the new API will just be a shiny new facade for on
>>> old and still fragile building.
>>
>> If we don't allow overlapping regions, then how do you implement
>> things like "on startup board maps ROM into lower addresses
>> over top of devices, but later it is unmapped and you can see
>> the underlying devices" ? (You can't currently do this AFAIK,
>> and it would be nice if the new API supported it.)
>
> Right, we can't do this properly, and that's why the attempt if the
> i440fx chipset model is so horribly broken ATM.
>
> Just allowing overlapping does not solve this problem either. It does
> not specify what region is on top and what is below (even worse if
> multiple regions overlap at the place).
>
> We need some managing instance here, and that is e.g. the chipset that
> provide control over the overlap in reality. It could hook up a
> PhysMemClient to receive and redirect updates to subregions, or only
> allow to register them in disabled state.

I think that gets ugly pretty fast.  The way this works IRL is that all 
I/O dispatches pass through the chipset.  You literally need something 
as simple as:

static void i440fx_io_intercept(void *opaque, uint64_t addr, uint32_t 
value, int size, MemRegion *next)
{
     I440FX *s = opaque;

     if (range_overlaps(addr, size, PAM_REGION)) {
         ...
     } else {
         dispatch_io(next, addr, value, size);
     }
}

There's no need for an explicit intercept mechanism if you make multiple 
levels have their own dispatch tables and register progressively larger 
regions.  In fact....

You really don't need to register 90% of the time.  In the case of a PC 
with i440fx, it's really quite simple:

if an I/O is to the APIC page,
    it's handled by the APIC
elif the I/O is in ROM regions:
    if PAM RE or WE
       redirect to RAM appropriately
    else:
       send to ROMs
elif the I/O is in the PCI windows:
    send to the PCI bus
else:
    send to the PIIX3

For x86, you could easily completely skip the explicit registration and 
just have a direct dispatch to the i440fx and implement something 
slightly more fancy than the above.

And I think this is true for most other types of boards too.

Regards,

Anthony Liguori

>
> Jan
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 19:10           ` Anthony Liguori
@ 2011-05-18 19:27             ` Jan Kiszka
  2011-05-18 19:34               ` Anthony Liguori
  2011-05-19  8:26               ` Gleb Natapov
  2011-05-19  6:31             ` Jan Kiszka
  2011-05-19  8:09             ` Avi Kivity
  2 siblings, 2 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-18 19:27 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Peter Maydell, Avi Kivity, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3571 bytes --]

On 2011-05-18 21:10, Anthony Liguori wrote:
> On 05/18/2011 10:30 AM, Jan Kiszka wrote:
>> On 2011-05-18 17:17, Peter Maydell wrote:
>>> On 18 May 2011 16:11, Jan Kiszka<jan.kiszka@siemens.com>  wrote:
>>>> On 2011-05-18 16:36, Avi Kivity wrote:
>>>>> There is nothing we can do with a return code.  You can't fail an mmio
>>>>> that causes overlapping physical memory map.
>>>>
>>>> We must fail such requests to make progress with the API. That may
>>>> happen either on caller side or in cpu_register_memory_region itself
>>>> (hwerror). Otherwise the new API will just be a shiny new facade for on
>>>> old and still fragile building.
>>>
>>> If we don't allow overlapping regions, then how do you implement
>>> things like "on startup board maps ROM into lower addresses
>>> over top of devices, but later it is unmapped and you can see
>>> the underlying devices" ? (You can't currently do this AFAIK,
>>> and it would be nice if the new API supported it.)
>>
>> Right, we can't do this properly, and that's why the attempt if the
>> i440fx chipset model is so horribly broken ATM.
>>
>> Just allowing overlapping does not solve this problem either. It does
>> not specify what region is on top and what is below (even worse if
>> multiple regions overlap at the place).
>>
>> We need some managing instance here, and that is e.g. the chipset that
>> provide control over the overlap in reality. It could hook up a
>> PhysMemClient to receive and redirect updates to subregions, or only
>> allow to register them in disabled state.
> 
> I think that gets ugly pretty fast.  The way this works IRL is that all
> I/O dispatches pass through the chipset.  You literally need something
> as simple as:
> 
> static void i440fx_io_intercept(void *opaque, uint64_t addr, uint32_t
> value, int size, MemRegion *next)
> {
>     I440FX *s = opaque;
> 
>     if (range_overlaps(addr, size, PAM_REGION)) {
>         ...
>     } else {
>         dispatch_io(next, addr, value, size);
>     }
> }
> 
> There's no need for an explicit intercept mechanism if you make multiple
> levels have their own dispatch tables and register progressively larger
> regions.  In fact....
> 
> You really don't need to register 90% of the time.  In the case of a PC
> with i440fx, it's really quite simple:
> 
> if an I/O is to the APIC page,
>    it's handled by the APIC

That's not that simple. We need to tell apart:
 - if a cpu issued the request, and which one => forward to APIC
 - if the range was addressed by a device (PCI or other system bus
   devices) => forward to MSI or other MMIO handlers

> elif the I/O is in ROM regions:
>    if PAM RE or WE
>       redirect to RAM appropriately
>    else:
>       send to ROMs
> elif the I/O is in the PCI windows:
>    send to the PCI bus
> else:
>    send to the PIIX3
> 
> For x86, you could easily completely skip the explicit registration and
> just have a direct dispatch to the i440fx and implement something
> slightly more fancy than the above.
> 
> And I think this is true for most other types of boards too.

This all melts down that we need to stop accepting memory region
mappings from everywhere at core level, but properly dispatch them up
the device tree. A device should register against its bus which can then
forward or handle the mapping directly. And that handling requires a
central tool box to avoid reinventing wheels.

That's a worthwhile change, though a much bigger one than I was
originally hoping to get away with.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 19:27             ` Jan Kiszka
@ 2011-05-18 19:34               ` Anthony Liguori
  2011-05-18 20:02                 ` Alex Williamson
  2011-05-18 20:07                 ` Jan Kiszka
  2011-05-19  8:26               ` Gleb Natapov
  1 sibling, 2 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-18 19:34 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Peter Maydell, Avi Kivity, qemu-devel

On 05/18/2011 02:27 PM, Jan Kiszka wrote:
> On 2011-05-18 21:10, Anthony Liguori wrote:
>> On 05/18/2011 10:30 AM, Jan Kiszka wrote:
>> You really don't need to register 90% of the time.  In the case of a PC
>> with i440fx, it's really quite simple:
>>
>> if an I/O is to the APIC page,
>>     it's handled by the APIC
>
> That's not that simple. We need to tell apart:
>   - if a cpu issued the request, and which one =>  forward to APIC

Right, but what I'm saying is that this logic lives in 
kvm-all.c:kvm_run():case EXIT_MMIO.

Obviously for TCG, it's a bit more complicated but this should be 
handled way before there's any kind of general dispatch.

>   - if the range was addressed by a device (PCI or other system bus
>     devices) =>  forward to MSI or other MMIO handlers

The same is true for MSI.  Other MMIO handlers can be handled as 
appropriate.  For instance, once an I/O is sent to the PCI bus, you can 
walk each PCI device's BAR list to figure out which device owns the I/O 
event.

For ISA, it's a little trickier since ISA doesn't do positive decoding. 
  You need each ISA device to declare what I/O addresses it handles.

>
>> elif the I/O is in ROM regions:
>>     if PAM RE or WE
>>        redirect to RAM appropriately
>>     else:
>>        send to ROMs
>> elif the I/O is in the PCI windows:
>>     send to the PCI bus
>> else:
>>     send to the PIIX3
>>
>> For x86, you could easily completely skip the explicit registration and
>> just have a direct dispatch to the i440fx and implement something
>> slightly more fancy than the above.
>>
>> And I think this is true for most other types of boards too.
>
> This all melts down that we need to stop accepting memory region
> mappings from everywhere at core level, but properly dispatch them up
> the device tree. A device should register against its bus which can then
> forward or handle the mapping directly. And that handling requires a
> central tool box to avoid reinventing wheels.
>
> That's a worthwhile change, though a much bigger one than I was
> originally hoping to get away with.

Agreed and as long as we've got the right long term vision in mind, we 
don't have to do everything up front.

Regards,

Anthony Liguori

> Jan
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 15:37   ` Avi Kivity
@ 2011-05-18 19:36     ` Jan Kiszka
  0 siblings, 0 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-18 19:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1719 bytes --]

On 2011-05-18 17:37, Avi Kivity wrote:
>>> void memory_region_init_ram(MemoryRegion *mr,
>>> target_phys_addr_t size);
>>> void memory_region_init_ram_ptr(MemoryRegion *mr,
>>> target_phys_addr_t size,
>>> void *ptr);
>>> void memory_region_destroy(MemoryRegion *mr);
>>> void memory_region_set_offset(MemoryRegion *mr, target_phys_addr_t
>>> offset);
>>
>> What's "offset" mean?
> 
> It's the equivalent of cpu_register_physical_memory_offset().
> 
> Note the intent is to have addresses always be relative to the innermost
> container.  Perhaps we can get away without offset.

Offset is supposed to support the transition of devices that expect
absolute I/O addresses in their callbacks to those that are fine with
relative ones (based on the region start). This API change is a good
chance to finally get rid of the former group.

> 
>>
>>> void memory_region_set_log(MemoryRegion *mr, bool log);
>>> void memory_region_clear_coalescing(MemoryRegion *mr);
>>> void memory_region_add_coalescing(MemoryRegion *mr,
>>> target_phys_addr_t offset,
>>> target_phys_addr_t size);
>>
>> I don't think it's worth while to try to fit coalescing into this API.
>> It's a KVM specific hack.  I think it's fine to be a hacked on API.
> 
> The problem is that only the device knows about coalescing, while the
> region can be mapped, unmapped, or moved without device knowledge.  So
> if a PCI is unmapped and then remapped (possibly at a different address)
> we need to tell kvm about it, but there is no device callback involved.

Doesn't Xen use coalescing as well, or could use? It looks like a
generic optimization feature for hypervisors that want to build on top
of QEMU.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 13:12 [Qemu-devel] [RFC] Memory API Avi Kivity
                   ` (4 preceding siblings ...)
  2011-05-18 17:17 ` Avi Kivity
@ 2011-05-18 19:40 ` Jan Kiszka
  2011-05-19  8:06   ` Avi Kivity
  2011-05-19 13:36   ` Anthony Liguori
  5 siblings, 2 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-18 19:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 575 bytes --]

On 2011-05-18 15:12, Avi Kivity wrote:
> void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t addr);

OK, let's allow overlapping, but make it explicit:

void cpu_register_memory_region_overlap(MemoryRegion *mr,
                                        target_phys_addr_t addr,
                                        int priority);

We need that ordering, so we need an interface. Regions registered via
cpu_register_memory_region must not overlap with existing one or we will
throw an hwerror. And they shall get a low default priority.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 19:34               ` Anthony Liguori
@ 2011-05-18 20:02                 ` Alex Williamson
  2011-05-18 20:11                   ` Jan Kiszka
  2011-05-18 20:07                 ` Jan Kiszka
  1 sibling, 1 reply; 187+ messages in thread
From: Alex Williamson @ 2011-05-18 20:02 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Peter Maydell, Jan Kiszka, Avi Kivity, qemu-devel

On Wed, 2011-05-18 at 14:34 -0500, Anthony Liguori wrote:
> On 05/18/2011 02:27 PM, Jan Kiszka wrote:
> > On 2011-05-18 21:10, Anthony Liguori wrote:
> >> On 05/18/2011 10:30 AM, Jan Kiszka wrote:
> >> You really don't need to register 90% of the time.  In the case of a PC
> >> with i440fx, it's really quite simple:
> >>
> >> if an I/O is to the APIC page,
> >>     it's handled by the APIC
> >
> > That's not that simple. We need to tell apart:
> >   - if a cpu issued the request, and which one =>  forward to APIC
> 
> Right, but what I'm saying is that this logic lives in 
> kvm-all.c:kvm_run():case EXIT_MMIO.
> 
> Obviously for TCG, it's a bit more complicated but this should be 
> handled way before there's any kind of general dispatch.
> 
> >   - if the range was addressed by a device (PCI or other system bus
> >     devices) =>  forward to MSI or other MMIO handlers
> 
> The same is true for MSI.  Other MMIO handlers can be handled as 
> appropriate.  For instance, once an I/O is sent to the PCI bus, you can 
> walk each PCI device's BAR list to figure out which device owns the I/O 
> event.
> 
> For ISA, it's a little trickier since ISA doesn't do positive decoding. 
>   You need each ISA device to declare what I/O addresses it handles.

Do we only need to handle CPU based I/O with this API?  I would think we
would be layering memory regions and implementing them as a hierarchy
that reflects the actual hardware layout we're emulating.  An access
from an I/O device may get a different translation to memory than a CPU
(IOMMU).  We also might have a system with two VGA devices that both
register 0xa0000 with a switch in the chipset that decides which one
sees the accesses, just as real hardware does. ISTR a presentation at
one of the first KVM forums from you that talked about this type of
model.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 19:34               ` Anthony Liguori
  2011-05-18 20:02                 ` Alex Williamson
@ 2011-05-18 20:07                 ` Jan Kiszka
  2011-05-18 20:41                   ` Anthony Liguori
  1 sibling, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-18 20:07 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Peter Maydell, Avi Kivity, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1217 bytes --]

On 2011-05-18 21:34, Anthony Liguori wrote:
> On 05/18/2011 02:27 PM, Jan Kiszka wrote:
>> On 2011-05-18 21:10, Anthony Liguori wrote:
>>> On 05/18/2011 10:30 AM, Jan Kiszka wrote:
>>> You really don't need to register 90% of the time.  In the case of a PC
>>> with i440fx, it's really quite simple:
>>>
>>> if an I/O is to the APIC page,
>>>     it's handled by the APIC
>>
>> That's not that simple. We need to tell apart:
>>   - if a cpu issued the request, and which one =>  forward to APIC
> 
> Right, but what I'm saying is that this logic lives in
> kvm-all.c:kvm_run():case EXIT_MMIO.
> 
> Obviously for TCG, it's a bit more complicated but this should be
> handled way before there's any kind of general dispatch.

Hmm, checking again, I think the APIC should not show up here at all. We
really need to filter it out very early at CPU level, i.e. when creating
the iotlb (or when dispatching a KVM EXIT_MMIO). It's cpu local, nothing
the chipset will ever see.

I really wonder now why I dropped the idea of handling per-cpu regions
as a special case in tlb_set_page. It looks trivial, could even be done
with a linear per-cpu list before looking at any chipset mappings.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 20:02                 ` Alex Williamson
@ 2011-05-18 20:11                   ` Jan Kiszka
  2011-05-18 20:13                     ` Alex Williamson
  0 siblings, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-18 20:11 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Peter Maydell, Avi Kivity, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2088 bytes --]

On 2011-05-18 22:02, Alex Williamson wrote:
> On Wed, 2011-05-18 at 14:34 -0500, Anthony Liguori wrote:
>> On 05/18/2011 02:27 PM, Jan Kiszka wrote:
>>> On 2011-05-18 21:10, Anthony Liguori wrote:
>>>> On 05/18/2011 10:30 AM, Jan Kiszka wrote:
>>>> You really don't need to register 90% of the time.  In the case of a PC
>>>> with i440fx, it's really quite simple:
>>>>
>>>> if an I/O is to the APIC page,
>>>>     it's handled by the APIC
>>>
>>> That's not that simple. We need to tell apart:
>>>   - if a cpu issued the request, and which one =>  forward to APIC
>>
>> Right, but what I'm saying is that this logic lives in 
>> kvm-all.c:kvm_run():case EXIT_MMIO.
>>
>> Obviously for TCG, it's a bit more complicated but this should be 
>> handled way before there's any kind of general dispatch.
>>
>>>   - if the range was addressed by a device (PCI or other system bus
>>>     devices) =>  forward to MSI or other MMIO handlers
>>
>> The same is true for MSI.  Other MMIO handlers can be handled as 
>> appropriate.  For instance, once an I/O is sent to the PCI bus, you can 
>> walk each PCI device's BAR list to figure out which device owns the I/O 
>> event.
>>
>> For ISA, it's a little trickier since ISA doesn't do positive decoding. 
>>   You need each ISA device to declare what I/O addresses it handles.
> 
> Do we only need to handle CPU based I/O with this API?  I would think we
> would be layering memory regions and implementing them as a hierarchy
> that reflects the actual hardware layout we're emulating.  An access
> from an I/O device may get a different translation to memory than a CPU
> (IOMMU).  We also might have a system with two VGA devices that both
> register 0xa0000 with a switch in the chipset that decides which one
> sees the accesses, just as real hardware does. ISTR a presentation at
> one of the first KVM forums from you that talked about this type of
> model.  Thanks,

IIUC, that switch is present in every PCI bridge. It can forward legacy
VGA I/O request to its devices or ignore them.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 16:47                   ` Avi Kivity
  2011-05-18 17:07                     ` Jan Kiszka
@ 2011-05-18 20:13                     ` Richard Henderson
  2011-05-19  8:04                       ` Avi Kivity
  1 sibling, 1 reply; 187+ messages in thread
From: Richard Henderson @ 2011-05-18 20:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel

On 05/18/2011 09:47 AM, Avi Kivity wrote:
> Yes.  We'd change pci_register_bar() to accept a MemoryRegion.

Surely this detail would be hidden on the pci_dev->bus?

>> However, we are yet in troubles if we want to change that because
>> devices can only be on one bus - at least so far.
> 
> Nothing prohibits a device from calling pci_register_bar() for one region and some other API for another.

Sure, but the majority of PCI devices are plain pci, and 
that sort of complexity should be hidden by default.


r~

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 20:11                   ` Jan Kiszka
@ 2011-05-18 20:13                     ` Alex Williamson
  0 siblings, 0 replies; 187+ messages in thread
From: Alex Williamson @ 2011-05-18 20:13 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Peter Maydell, Avi Kivity, qemu-devel

On Wed, 2011-05-18 at 22:11 +0200, Jan Kiszka wrote:
> On 2011-05-18 22:02, Alex Williamson wrote:
> > On Wed, 2011-05-18 at 14:34 -0500, Anthony Liguori wrote:
> >> On 05/18/2011 02:27 PM, Jan Kiszka wrote:
> >>> On 2011-05-18 21:10, Anthony Liguori wrote:
> >>>> On 05/18/2011 10:30 AM, Jan Kiszka wrote:
> >>>> You really don't need to register 90% of the time.  In the case of a PC
> >>>> with i440fx, it's really quite simple:
> >>>>
> >>>> if an I/O is to the APIC page,
> >>>>     it's handled by the APIC
> >>>
> >>> That's not that simple. We need to tell apart:
> >>>   - if a cpu issued the request, and which one =>  forward to APIC
> >>
> >> Right, but what I'm saying is that this logic lives in 
> >> kvm-all.c:kvm_run():case EXIT_MMIO.
> >>
> >> Obviously for TCG, it's a bit more complicated but this should be 
> >> handled way before there's any kind of general dispatch.
> >>
> >>>   - if the range was addressed by a device (PCI or other system bus
> >>>     devices) =>  forward to MSI or other MMIO handlers
> >>
> >> The same is true for MSI.  Other MMIO handlers can be handled as 
> >> appropriate.  For instance, once an I/O is sent to the PCI bus, you can 
> >> walk each PCI device's BAR list to figure out which device owns the I/O 
> >> event.
> >>
> >> For ISA, it's a little trickier since ISA doesn't do positive decoding. 
> >>   You need each ISA device to declare what I/O addresses it handles.
> > 
> > Do we only need to handle CPU based I/O with this API?  I would think we
> > would be layering memory regions and implementing them as a hierarchy
> > that reflects the actual hardware layout we're emulating.  An access
> > from an I/O device may get a different translation to memory than a CPU
> > (IOMMU).  We also might have a system with two VGA devices that both
> > register 0xa0000 with a switch in the chipset that decides which one
> > sees the accesses, just as real hardware does. ISTR a presentation at
> > one of the first KVM forums from you that talked about this type of
> > model.  Thanks,
> 
> IIUC, that switch is present in every PCI bridge. It can forward legacy
> VGA I/O request to its devices or ignore them.

Right, I just thought it was a good example of how we both need to have
decode done in a hierarchy that reflects actual hardware and how we need
to support overlapping ranges.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 20:07                 ` Jan Kiszka
@ 2011-05-18 20:41                   ` Anthony Liguori
  0 siblings, 0 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-18 20:41 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Peter Maydell, Avi Kivity, qemu-devel

On 05/18/2011 03:07 PM, Jan Kiszka wrote:
> On 2011-05-18 21:34, Anthony Liguori wrote:
>> On 05/18/2011 02:27 PM, Jan Kiszka wrote:
>>> On 2011-05-18 21:10, Anthony Liguori wrote:
>>>> On 05/18/2011 10:30 AM, Jan Kiszka wrote:
>>>> You really don't need to register 90% of the time.  In the case of a PC
>>>> with i440fx, it's really quite simple:
>>>>
>>>> if an I/O is to the APIC page,
>>>>      it's handled by the APIC
>>>
>>> That's not that simple. We need to tell apart:
>>>    - if a cpu issued the request, and which one =>   forward to APIC
>>
>> Right, but what I'm saying is that this logic lives in
>> kvm-all.c:kvm_run():case EXIT_MMIO.
>>
>> Obviously for TCG, it's a bit more complicated but this should be
>> handled way before there's any kind of general dispatch.
>
> Hmm, checking again, I think the APIC should not show up here at all. We
> really need to filter it out very early at CPU level, i.e. when creating
> the iotlb (or when dispatching a KVM EXIT_MMIO). It's cpu local, nothing
> the chipset will ever see.
>
> I really wonder now why I dropped the idea of handling per-cpu regions
> as a special case in tlb_set_page. It looks trivial, could even be done
> with a linear per-cpu list before looking at any chipset mappings.

Yup, I agree this is the right way to do it.

Regards,

Anthony Liguori

>
> Jan
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 19:10           ` Anthony Liguori
  2011-05-18 19:27             ` Jan Kiszka
@ 2011-05-19  6:31             ` Jan Kiszka
  2011-05-19 13:23               ` Anthony Liguori
  2011-05-19  8:09             ` Avi Kivity
  2 siblings, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19  6:31 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Peter Maydell, Avi Kivity, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2802 bytes --]

On 2011-05-18 21:10, Anthony Liguori wrote:
> On 05/18/2011 10:30 AM, Jan Kiszka wrote:
>> On 2011-05-18 17:17, Peter Maydell wrote:
>>> On 18 May 2011 16:11, Jan Kiszka<jan.kiszka@siemens.com>  wrote:
>>>> On 2011-05-18 16:36, Avi Kivity wrote:
>>>>> There is nothing we can do with a return code.  You can't fail an mmio
>>>>> that causes overlapping physical memory map.
>>>>
>>>> We must fail such requests to make progress with the API. That may
>>>> happen either on caller side or in cpu_register_memory_region itself
>>>> (hwerror). Otherwise the new API will just be a shiny new facade for on
>>>> old and still fragile building.
>>>
>>> If we don't allow overlapping regions, then how do you implement
>>> things like "on startup board maps ROM into lower addresses
>>> over top of devices, but later it is unmapped and you can see
>>> the underlying devices" ? (You can't currently do this AFAIK,
>>> and it would be nice if the new API supported it.)
>>
>> Right, we can't do this properly, and that's why the attempt if the
>> i440fx chipset model is so horribly broken ATM.
>>
>> Just allowing overlapping does not solve this problem either. It does
>> not specify what region is on top and what is below (even worse if
>> multiple regions overlap at the place).
>>
>> We need some managing instance here, and that is e.g. the chipset that
>> provide control over the overlap in reality. It could hook up a
>> PhysMemClient to receive and redirect updates to subregions, or only
>> allow to register them in disabled state.
> 
> I think that gets ugly pretty fast.  The way this works IRL is that all
> I/O dispatches pass through the chipset.  You literally need something
> as simple as:
> 
> static void i440fx_io_intercept(void *opaque, uint64_t addr, uint32_t
> value, int size, MemRegion *next)
> {
>     I440FX *s = opaque;
> 
>     if (range_overlaps(addr, size, PAM_REGION)) {
>         ...
>     } else {
>         dispatch_io(next, addr, value, size);
>     }
> }
> 
> There's no need for an explicit intercept mechanism if you make multiple
> levels have their own dispatch tables and register progressively larger
> regions.  In fact....

Actually, things are a bit more complicated: This layer has to properly
adopt the coalescing properties of underlying regions or we cause
performance regressions to VGA emulation. That means it has to register
dispatching slots of the corresponding size and set the coalescing flag
accordingly. And it likely need to adjust them as the regions below change.

IOW, I don't think we get away with that simple approach above but still
require to track mapping via a PhysMemClient. But we should be able to
avoid filtering by adding overlapping regions with higher prio.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 20:13                     ` Richard Henderson
@ 2011-05-19  8:04                       ` Avi Kivity
  0 siblings, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-19  8:04 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Jan Kiszka, qemu-devel

On 05/18/2011 11:13 PM, Richard Henderson wrote:
> On 05/18/2011 09:47 AM, Avi Kivity wrote:
> >  Yes.  We'd change pci_register_bar() to accept a MemoryRegion.
>
> Surely this detail would be hidden on the pci_dev->bus?

Not sure what you mean.

The reason I want pci_register_bar() to accept a memory region is that 
some BARs are composed of multiple subregions, for example cirrus has a 
RAM framebuffer an an mmio region in one BAR.  So the device describes 
the region relative to its start point and hands it off to the pci 
subsystem, which can then enable or disable the region, and place it 
anywhere in the bus address space it likes.

> >>  However, we are yet in troubles if we want to change that because
> >>  devices can only be on one bus - at least so far.
> >
> >  Nothing prohibits a device from calling pci_register_bar() for one region and some other API for another.
>
> Sure, but the majority of PCI devices are plain pci, and
> that sort of complexity should be hidden by default.

There is pci_register_bar_simple().

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 19:40 ` Jan Kiszka
@ 2011-05-19  8:06   ` Avi Kivity
  2011-05-19  8:08     ` Jan Kiszka
  2011-05-19 13:36   ` Anthony Liguori
  1 sibling, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-19  8:06 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel

On 05/18/2011 10:40 PM, Jan Kiszka wrote:
> On 2011-05-18 15:12, Avi Kivity wrote:
> >  void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t addr);
>
> OK, let's allow overlapping, but make it explicit:
>
> void cpu_register_memory_region_overlap(MemoryRegion *mr,
>                                          target_phys_addr_t addr,
>                                          int priority);
>
> We need that ordering, so we need an interface. Regions registered via
> cpu_register_memory_region must not overlap with existing one or we will
> throw an hwerror. And they shall get a low default priority.
>

PCI BARs can overlap with anything.  So any region can overlap with any 
other region.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  8:06   ` Avi Kivity
@ 2011-05-19  8:08     ` Jan Kiszka
  2011-05-19  8:13       ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19  8:08 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 958 bytes --]

On 2011-05-19 10:06, Avi Kivity wrote:
> On 05/18/2011 10:40 PM, Jan Kiszka wrote:
>> On 2011-05-18 15:12, Avi Kivity wrote:
>> >  void cpu_register_memory_region(MemoryRegion *mr,
>> target_phys_addr_t addr);
>>
>> OK, let's allow overlapping, but make it explicit:
>>
>> void cpu_register_memory_region_overlap(MemoryRegion *mr,
>>                                          target_phys_addr_t addr,
>>                                          int priority);
>>
>> We need that ordering, so we need an interface. Regions registered via
>> cpu_register_memory_region must not overlap with existing one or we will
>> throw an hwerror. And they shall get a low default priority.
>>
> 
> PCI BARs can overlap with anything.  So any region can overlap with any
> other region.

I know, but that result is unspecified anyway. The user (guest OS) can't
expect any reasonable result. We rather need priorities for useful
overlapping.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 19:10           ` Anthony Liguori
  2011-05-18 19:27             ` Jan Kiszka
  2011-05-19  6:31             ` Jan Kiszka
@ 2011-05-19  8:09             ` Avi Kivity
  2 siblings, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-19  8:09 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, qemu-devel, Peter Maydell

On 05/18/2011 10:10 PM, Anthony Liguori wrote:
> On 05/18/2011 10:30 AM, Jan Kiszka wrote:
>> On 2011-05-18 17:17, Peter Maydell wrote:
>>> On 18 May 2011 16:11, Jan Kiszka<jan.kiszka@siemens.com>  wrote:
>>>> On 2011-05-18 16:36, Avi Kivity wrote:
>>>>> There is nothing we can do with a return code.  You can't fail an 
>>>>> mmio
>>>>> that causes overlapping physical memory map.
>>>>
>>>> We must fail such requests to make progress with the API. That may
>>>> happen either on caller side or in cpu_register_memory_region itself
>>>> (hwerror). Otherwise the new API will just be a shiny new facade 
>>>> for on
>>>> old and still fragile building.
>>>
>>> If we don't allow overlapping regions, then how do you implement
>>> things like "on startup board maps ROM into lower addresses
>>> over top of devices, but later it is unmapped and you can see
>>> the underlying devices" ? (You can't currently do this AFAIK,
>>> and it would be nice if the new API supported it.)
>>
>> Right, we can't do this properly, and that's why the attempt if the
>> i440fx chipset model is so horribly broken ATM.
>>
>> Just allowing overlapping does not solve this problem either. It does
>> not specify what region is on top and what is below (even worse if
>> multiple regions overlap at the place).
>>
>> We need some managing instance here, and that is e.g. the chipset that
>> provide control over the overlap in reality. It could hook up a
>> PhysMemClient to receive and redirect updates to subregions, or only
>> allow to register them in disabled state.
>
> I think that gets ugly pretty fast.  The way this works IRL is that 
> all I/O dispatches pass through the chipset.  You literally need 
> something as simple as:
>
> static void i440fx_io_intercept(void *opaque, uint64_t addr, uint32_t 
> value, int size, MemRegion *next)
> {
>     I440FX *s = opaque;
>
>     if (range_overlaps(addr, size, PAM_REGION)) {
>         ...
>     } else {
>         dispatch_io(next, addr, value, size);
>     }
> }
>
> There's no need for an explicit intercept mechanism if you make 
> multiple levels have their own dispatch tables and register 
> progressively larger regions.  In fact....
>
> You really don't need to register 90% of the time.  In the case of a 
> PC with i440fx, it's really quite simple:
>
> if an I/O is to the APIC page,
>    it's handled by the APIC
> elif the I/O is in ROM regions:
>    if PAM RE or WE
>       redirect to RAM appropriately
>    else:
>       send to ROMs
> elif the I/O is in the PCI windows:
>    send to the PCI bus
> else:
>    send to the PIIX3
>

Memory regions do the exact same thing, except at registration time.  
The nested if/elif/else tree is captured in the nested region tree.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  8:08     ` Jan Kiszka
@ 2011-05-19  8:13       ` Avi Kivity
  2011-05-19  8:25         ` Jan Kiszka
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-19  8:13 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel

On 05/19/2011 11:08 AM, Jan Kiszka wrote:
> On 2011-05-19 10:06, Avi Kivity wrote:
> >  On 05/18/2011 10:40 PM, Jan Kiszka wrote:
> >>  On 2011-05-18 15:12, Avi Kivity wrote:
> >>  >   void cpu_register_memory_region(MemoryRegion *mr,
> >>  target_phys_addr_t addr);
> >>
> >>  OK, let's allow overlapping, but make it explicit:
> >>
> >>  void cpu_register_memory_region_overlap(MemoryRegion *mr,
> >>                                           target_phys_addr_t addr,
> >>                                           int priority);
> >>
> >>  We need that ordering, so we need an interface. Regions registered via
> >>  cpu_register_memory_region must not overlap with existing one or we will
> >>  throw an hwerror. And they shall get a low default priority.
> >>
> >
> >  PCI BARs can overlap with anything.  So any region can overlap with any
> >  other region.
>
> I know, but that result is unspecified anyway. The user (guest OS) can't
> expect any reasonable result. We rather need priorities for useful
> overlapping.

Unspecified doesn't mean abort.  It means we need to specify something 
(which translates to: we get to pick the priorities).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  8:13       ` Avi Kivity
@ 2011-05-19  8:25         ` Jan Kiszka
  2011-05-19  8:43           ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19  8:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1581 bytes --]

On 2011-05-19 10:13, Avi Kivity wrote:
> On 05/19/2011 11:08 AM, Jan Kiszka wrote:
>> On 2011-05-19 10:06, Avi Kivity wrote:
>> >  On 05/18/2011 10:40 PM, Jan Kiszka wrote:
>> >>  On 2011-05-18 15:12, Avi Kivity wrote:
>> >>  >   void cpu_register_memory_region(MemoryRegion *mr,
>> >>  target_phys_addr_t addr);
>> >>
>> >>  OK, let's allow overlapping, but make it explicit:
>> >>
>> >>  void cpu_register_memory_region_overlap(MemoryRegion *mr,
>> >>                                           target_phys_addr_t addr,
>> >>                                           int priority);
>> >>
>> >>  We need that ordering, so we need an interface. Regions registered
>> via
>> >>  cpu_register_memory_region must not overlap with existing one or
>> we will
>> >>  throw an hwerror. And they shall get a low default priority.
>> >>
>> >
>> >  PCI BARs can overlap with anything.  So any region can overlap with
>> any
>> >  other region.
>>
>> I know, but that result is unspecified anyway. The user (guest OS) can't
>> expect any reasonable result. We rather need priorities for useful
>> overlapping.
> 
> Unspecified doesn't mean abort.  It means we need to specify something
> (which translates to: we get to pick the priorities).

Of course, PCI bars would have to be registered via
cpu_register_memory_region_overlap, just specifying the default
priority. Here we know that overlapping can happen and is not a bug in
the board emulation. I want to avoid that such use cases make
overlapping generally legal, papering over real bugs.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 19:27             ` Jan Kiszka
  2011-05-18 19:34               ` Anthony Liguori
@ 2011-05-19  8:26               ` Gleb Natapov
  2011-05-19  8:30                 ` Jan Kiszka
  1 sibling, 1 reply; 187+ messages in thread
From: Gleb Natapov @ 2011-05-19  8:26 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Peter Maydell, Avi Kivity, qemu-devel

On Wed, May 18, 2011 at 09:27:55PM +0200, Jan Kiszka wrote:
> > if an I/O is to the APIC page,
> >    it's handled by the APIC
> 
> That's not that simple. We need to tell apart:
>  - if a cpu issued the request, and which one => forward to APIC
And cpu mode may affect where access is forwarded to. If cpu is in SMM
mode access to frame buffer may be forwarded to a memory (depends on
chipset configuration).

>  - if the range was addressed by a device (PCI or other system bus
>    devices) => forward to MSI or other MMIO handlers
> 



--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  8:26               ` Gleb Natapov
@ 2011-05-19  8:30                 ` Jan Kiszka
  2011-05-19  8:44                   ` Avi Kivity
                                     ` (2 more replies)
  0 siblings, 3 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19  8:30 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Peter Maydell, Avi Kivity, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 749 bytes --]

On 2011-05-19 10:26, Gleb Natapov wrote:
> On Wed, May 18, 2011 at 09:27:55PM +0200, Jan Kiszka wrote:
>>> if an I/O is to the APIC page,
>>>    it's handled by the APIC
>>
>> That's not that simple. We need to tell apart:
>>  - if a cpu issued the request, and which one => forward to APIC
> And cpu mode may affect where access is forwarded to. If cpu is in SMM
> mode access to frame buffer may be forwarded to a memory (depends on
> chipset configuration).

So we have a second use case for CPU-local I/O regions?

I wonder if only a single CPU can enter SMM or if all have to. Right now
only the first CPU can switch to that mode, and that affects the
behaviour of the chipset /wrt SMRAM mapping. Is that another hack?

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  8:25         ` Jan Kiszka
@ 2011-05-19  8:43           ` Avi Kivity
  2011-05-19  9:24             ` Jan Kiszka
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-19  8:43 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel

On 05/19/2011 11:25 AM, Jan Kiszka wrote:
> >
> >  Unspecified doesn't mean abort.  It means we need to specify something
> >  (which translates to: we get to pick the priorities).
>
> Of course, PCI bars would have to be registered via
> cpu_register_memory_region_overlap, just specifying the default
> priority. Here we know that overlapping can happen and is not a bug in
> the board emulation. I want to avoid that such use cases make
> overlapping generally legal, papering over real bugs.

But those are the majority of regions.  There are a few extra RAM and 
ROM and fixed function regions, but these are hardly likely to clash.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  8:30                 ` Jan Kiszka
@ 2011-05-19  8:44                   ` Avi Kivity
  2011-05-19 13:59                     ` Anthony Liguori
  2011-05-19 13:52                   ` Anthony Liguori
  2011-05-20 17:30                   ` Blue Swirl
  2 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-19  8:44 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Peter Maydell, Gleb Natapov, qemu-devel

On 05/19/2011 11:30 AM, Jan Kiszka wrote:
> >>
> >>  That's not that simple. We need to tell apart:
> >>   - if a cpu issued the request, and which one =>  forward to APIC
> >  And cpu mode may affect where access is forwarded to. If cpu is in SMM
> >  mode access to frame buffer may be forwarded to a memory (depends on
> >  chipset configuration).
>
> So we have a second use case for CPU-local I/O regions?
>
> I wonder if only a single CPU can enter SMM or if all have to. Right now
> only the first CPU can switch to that mode, and that affects the
> behaviour of the chipset /wrt SMRAM mapping. Is that another hack?

It's a hack.  SMM is a per-cpu setting.  Effectively it's another 
address pin - it changes the meaning of (potentially) all addresses.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 15:42           ` Avi Kivity
  2011-05-18 16:00             ` Jan Kiszka
@ 2011-05-19  9:08             ` Gleb Natapov
  2011-05-19  9:10               ` Avi Kivity
  2011-05-19  9:24               ` Edgar E. Iglesias
  1 sibling, 2 replies; 187+ messages in thread
From: Gleb Natapov @ 2011-05-19  9:08 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel

On Wed, May 18, 2011 at 06:42:14PM +0300, Avi Kivity wrote:
> On 05/18/2011 06:36 PM, Jan Kiszka wrote:
> >>
> >>  We need to head for the more hardware-like approach.  What happens when
> >>  you program overlapping BARs?  I imagine the result is
> >>  implementation-defined, but ends up with one region decoded in
> >>  preference to the other.  There is simply no way to reject an
> >>  overlapping mapping.
> >
> >But there is also now simple way to allow them. At least not without
> >exposing control about their ordering AND allowing to hook up managing
> >code (e.g. of the PCI bridge or the chipset) that controls registrations.
> 
> What about memory_region_add_subregion(..., int priority) as I
> suggested in another message?
Haven't saw another message yet, but how caller knows about priority?


--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  9:08             ` Gleb Natapov
@ 2011-05-19  9:10               ` Avi Kivity
  2011-05-19  9:14                 ` Gleb Natapov
  2011-05-19 13:44                 ` Anthony Liguori
  2011-05-19  9:24               ` Edgar E. Iglesias
  1 sibling, 2 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-19  9:10 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel

On 05/19/2011 12:08 PM, Gleb Natapov wrote:
> On Wed, May 18, 2011 at 06:42:14PM +0300, Avi Kivity wrote:
> >  On 05/18/2011 06:36 PM, Jan Kiszka wrote:
> >  >>
> >  >>   We need to head for the more hardware-like approach.  What happens when
> >  >>   you program overlapping BARs?  I imagine the result is
> >  >>   implementation-defined, but ends up with one region decoded in
> >  >>   preference to the other.  There is simply no way to reject an
> >  >>   overlapping mapping.
> >  >
> >  >But there is also now simple way to allow them. At least not without
> >  >exposing control about their ordering AND allowing to hook up managing
> >  >code (e.g. of the PCI bridge or the chipset) that controls registrations.
> >
> >  What about memory_region_add_subregion(..., int priority) as I
> >  suggested in another message?
> Haven't saw another message yet, but how caller knows about priority?

The caller is emulating some hub or router and should decide on priority 
like real hardware.

For example, piix gives higher priority to the vga window over RAM.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  9:10               ` Avi Kivity
@ 2011-05-19  9:14                 ` Gleb Natapov
  2011-05-19 11:44                   ` Avi Kivity
  2011-05-19 13:44                 ` Anthony Liguori
  1 sibling, 1 reply; 187+ messages in thread
From: Gleb Natapov @ 2011-05-19  9:14 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel

On Thu, May 19, 2011 at 12:10:38PM +0300, Avi Kivity wrote:
> On 05/19/2011 12:08 PM, Gleb Natapov wrote:
> >On Wed, May 18, 2011 at 06:42:14PM +0300, Avi Kivity wrote:
> >>  On 05/18/2011 06:36 PM, Jan Kiszka wrote:
> >>  >>
> >>  >>   We need to head for the more hardware-like approach.  What happens when
> >>  >>   you program overlapping BARs?  I imagine the result is
> >>  >>   implementation-defined, but ends up with one region decoded in
> >>  >>   preference to the other.  There is simply no way to reject an
> >>  >>   overlapping mapping.
> >>  >
> >>  >But there is also now simple way to allow them. At least not without
> >>  >exposing control about their ordering AND allowing to hook up managing
> >>  >code (e.g. of the PCI bridge or the chipset) that controls registrations.
> >>
> >>  What about memory_region_add_subregion(..., int priority) as I
> >>  suggested in another message?
> >Haven't saw another message yet, but how caller knows about priority?
> 
> The caller is emulating some hub or router and should decide on
> priority like real hardware.
> 
> For example, piix gives higher priority to the vga window over RAM.
> 
Hmm, but if a caller of the memory_region_add_subregion() function is a
device itself how does it know about chipset priorities. All it wants to
tell to the system is that it is ready to handle mmio access in this phys
range, but chipset may decide to forward those accesses elsewhere.

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  9:08             ` Gleb Natapov
  2011-05-19  9:10               ` Avi Kivity
@ 2011-05-19  9:24               ` Edgar E. Iglesias
  2011-05-19 14:49                 ` Peter Maydell
  1 sibling, 1 reply; 187+ messages in thread
From: Edgar E. Iglesias @ 2011-05-19  9:24 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, Avi Kivity, qemu-devel

On Thu, May 19, 2011 at 12:08:51PM +0300, Gleb Natapov wrote:
> On Wed, May 18, 2011 at 06:42:14PM +0300, Avi Kivity wrote:
> > On 05/18/2011 06:36 PM, Jan Kiszka wrote:
> > >>
> > >>  We need to head for the more hardware-like approach.  What happens when
> > >>  you program overlapping BARs?  I imagine the result is
> > >>  implementation-defined, but ends up with one region decoded in
> > >>  preference to the other.  There is simply no way to reject an
> > >>  overlapping mapping.
> > >
> > >But there is also now simple way to allow them. At least not without
> > >exposing control about their ordering AND allowing to hook up managing
> > >code (e.g. of the PCI bridge or the chipset) that controls registrations.
> > 
> > What about memory_region_add_subregion(..., int priority) as I
> > suggested in another message?
> Haven't saw another message yet, but how caller knows about priority?

I dont think it does, overlap behaviour should be decided by nodes in the bus
path from master to slave. I think Anthonys call chain approach makes sense,
but it doesn't mean that one cant create a "flat" cached representation of
it for fast acceses. The bus (interconnects & bridges) must be able to
invalidate it when configuration changes though (leading to a new walk
through the call chain and creation of an updated cached "flat" state).

Some other comments:
On the CPU local aspect, I think it is increasingly common in the
embedded space to see local busses with CPU local peripherals in
addition to the "system" bus with "global" peripherals.

Another thing that was discussed was the ability for devices to know
who is accessing them, I think this is uncommon but still it does
exist. IIRC an example is the MIPS GIC has some internal decoding
based on which CPU is accessing it (but there are other examples).
Would be nice to be able to handle it but the accesor info should
definitely not be passed to devices by default, it'll only lead to
hacks when not strictly necessary...

Overall I like the initiaive and look forward to see code and testing it.

Cheers

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  8:43           ` Avi Kivity
@ 2011-05-19  9:24             ` Jan Kiszka
  2011-05-19 11:58               ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19  9:24 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 815 bytes --]

On 2011-05-19 10:43, Avi Kivity wrote:
> On 05/19/2011 11:25 AM, Jan Kiszka wrote:
>> >
>> >  Unspecified doesn't mean abort.  It means we need to specify something
>> >  (which translates to: we get to pick the priorities).
>>
>> Of course, PCI bars would have to be registered via
>> cpu_register_memory_region_overlap, just specifying the default
>> priority. Here we know that overlapping can happen and is not a bug in
>> the board emulation. I want to avoid that such use cases make
>> overlapping generally legal, papering over real bugs.
> 
> But those are the majority of regions.  There are a few extra RAM and
> ROM and fixed function regions, but these are hardly likely to clash.

You are probably only thinking about x86, which does not provide the
majority of QEMU devices.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  9:14                 ` Gleb Natapov
@ 2011-05-19 11:44                   ` Avi Kivity
  2011-05-19 11:54                     ` Gleb Natapov
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 11:44 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel

On 05/19/2011 12:14 PM, Gleb Natapov wrote:
> On Thu, May 19, 2011 at 12:10:38PM +0300, Avi Kivity wrote:
> >  On 05/19/2011 12:08 PM, Gleb Natapov wrote:
> >  >On Wed, May 18, 2011 at 06:42:14PM +0300, Avi Kivity wrote:
> >  >>   On 05/18/2011 06:36 PM, Jan Kiszka wrote:
> >  >>   >>
> >  >>   >>    We need to head for the more hardware-like approach.  What happens when
> >  >>   >>    you program overlapping BARs?  I imagine the result is
> >  >>   >>    implementation-defined, but ends up with one region decoded in
> >  >>   >>    preference to the other.  There is simply no way to reject an
> >  >>   >>    overlapping mapping.
> >  >>   >
> >  >>   >But there is also now simple way to allow them. At least not without
> >  >>   >exposing control about their ordering AND allowing to hook up managing
> >  >>   >code (e.g. of the PCI bridge or the chipset) that controls registrations.
> >  >>
> >  >>   What about memory_region_add_subregion(..., int priority) as I
> >  >>   suggested in another message?
> >  >Haven't saw another message yet, but how caller knows about priority?
> >
> >  The caller is emulating some hub or router and should decide on
> >  priority like real hardware.
> >
> >  For example, piix gives higher priority to the vga window over RAM.
> >
> Hmm, but if a caller of the memory_region_add_subregion() function is a
> device itself how does it know about chipset priorities. All it wants to
> tell to the system is that it is ready to handle mmio access in this phys
> range, but chipset may decide to forward those accesses elsewhere.

In this case the device would call a chipset function, passing the 
memory region as a parameter, and the chipset would call 
m_r_add_subregion().  Alternatively, the chipset can instantiate the 
device (if it is an embedded one) and call m_r_add_subregion() itself.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 11:44                   ` Avi Kivity
@ 2011-05-19 11:54                     ` Gleb Natapov
  2011-05-19 11:57                       ` Jan Kiszka
  2011-05-19 11:57                       ` Avi Kivity
  0 siblings, 2 replies; 187+ messages in thread
From: Gleb Natapov @ 2011-05-19 11:54 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel

On Thu, May 19, 2011 at 02:44:29PM +0300, Avi Kivity wrote:
> On 05/19/2011 12:14 PM, Gleb Natapov wrote:
> >On Thu, May 19, 2011 at 12:10:38PM +0300, Avi Kivity wrote:
> >>  On 05/19/2011 12:08 PM, Gleb Natapov wrote:
> >>  >On Wed, May 18, 2011 at 06:42:14PM +0300, Avi Kivity wrote:
> >>  >>   On 05/18/2011 06:36 PM, Jan Kiszka wrote:
> >>  >>   >>
> >>  >>   >>    We need to head for the more hardware-like approach.  What happens when
> >>  >>   >>    you program overlapping BARs?  I imagine the result is
> >>  >>   >>    implementation-defined, but ends up with one region decoded in
> >>  >>   >>    preference to the other.  There is simply no way to reject an
> >>  >>   >>    overlapping mapping.
> >>  >>   >
> >>  >>   >But there is also now simple way to allow them. At least not without
> >>  >>   >exposing control about their ordering AND allowing to hook up managing
> >>  >>   >code (e.g. of the PCI bridge or the chipset) that controls registrations.
> >>  >>
> >>  >>   What about memory_region_add_subregion(..., int priority) as I
> >>  >>   suggested in another message?
> >>  >Haven't saw another message yet, but how caller knows about priority?
> >>
> >>  The caller is emulating some hub or router and should decide on
> >>  priority like real hardware.
> >>
> >>  For example, piix gives higher priority to the vga window over RAM.
> >>
> >Hmm, but if a caller of the memory_region_add_subregion() function is a
> >device itself how does it know about chipset priorities. All it wants to
> >tell to the system is that it is ready to handle mmio access in this phys
> >range, but chipset may decide to forward those accesses elsewhere.
> 
> In this case the device would call a chipset function, passing the
> memory region as a parameter, and the chipset would call
> m_r_add_subregion().
But then chipset can resolve all overlapping by itself and register only
regions that are actually accessible by a guest software. Also there are
devices that on some architectures are accessed through a chipset and on
other they resides directly on a system bus. If they will need to call
different memory registration api depending on how they are instantiated
the code can become messy.

>                       Alternatively, the chipset can instantiate the
> device (if it is an embedded one) and call m_r_add_subregion()
> itself.
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 11:54                     ` Gleb Natapov
@ 2011-05-19 11:57                       ` Jan Kiszka
  2011-05-19 11:58                         ` Gleb Natapov
  2011-05-19 11:57                       ` Avi Kivity
  1 sibling, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 11:57 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, qemu-devel

On 2011-05-19 13:54, Gleb Natapov wrote:
> On Thu, May 19, 2011 at 02:44:29PM +0300, Avi Kivity wrote:
>> On 05/19/2011 12:14 PM, Gleb Natapov wrote:
>>> On Thu, May 19, 2011 at 12:10:38PM +0300, Avi Kivity wrote:
>>>>  On 05/19/2011 12:08 PM, Gleb Natapov wrote:
>>>>  >On Wed, May 18, 2011 at 06:42:14PM +0300, Avi Kivity wrote:
>>>>  >>   On 05/18/2011 06:36 PM, Jan Kiszka wrote:
>>>>  >>   >>
>>>>  >>   >>    We need to head for the more hardware-like approach.  What happens when
>>>>  >>   >>    you program overlapping BARs?  I imagine the result is
>>>>  >>   >>    implementation-defined, but ends up with one region decoded in
>>>>  >>   >>    preference to the other.  There is simply no way to reject an
>>>>  >>   >>    overlapping mapping.
>>>>  >>   >
>>>>  >>   >But there is also now simple way to allow them. At least not without
>>>>  >>   >exposing control about their ordering AND allowing to hook up managing
>>>>  >>   >code (e.g. of the PCI bridge or the chipset) that controls registrations.
>>>>  >>
>>>>  >>   What about memory_region_add_subregion(..., int priority) as I
>>>>  >>   suggested in another message?
>>>>  >Haven't saw another message yet, but how caller knows about priority?
>>>>
>>>>  The caller is emulating some hub or router and should decide on
>>>>  priority like real hardware.
>>>>
>>>>  For example, piix gives higher priority to the vga window over RAM.
>>>>
>>> Hmm, but if a caller of the memory_region_add_subregion() function is a
>>> device itself how does it know about chipset priorities. All it wants to
>>> tell to the system is that it is ready to handle mmio access in this phys
>>> range, but chipset may decide to forward those accesses elsewhere.
>>
>> In this case the device would call a chipset function, passing the
>> memory region as a parameter, and the chipset would call
>> m_r_add_subregion().
> But then chipset can resolve all overlapping by itself and register only
> regions that are actually accessible by a guest software. Also there are
> devices that on some architectures are accessed through a chipset and on
> other they resides directly on a system bus. If they will need to call
> different memory registration api depending on how they are instantiated
> the code can become messy.

Devices shall register their regions with the bus. Every device is on
some bus, so that's not a problem. And we can then provide registration
handlers at bus level that either implement specific logic or just
forward the request to the next hierarchy level (default handler).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 11:54                     ` Gleb Natapov
  2011-05-19 11:57                       ` Jan Kiszka
@ 2011-05-19 11:57                       ` Avi Kivity
  2011-05-19 12:20                         ` Jan Kiszka
  1 sibling, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 11:57 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel

On 05/19/2011 02:54 PM, Gleb Natapov wrote:
> >
> >  In this case the device would call a chipset function, passing the
> >  memory region as a parameter, and the chipset would call
> >  m_r_add_subregion().
> But then chipset can resolve all overlapping by itself and register only
> regions that are actually accessible by a guest software.

Sure it can (and it does now), but it's hard.  This API centralizes the 
logic, leaving the devices/chipsets to specify what they want.

For a PC, we have at least two such cases, the ISA bus and the PCI bus.

>   Also there are
> devices that on some architectures are accessed through a chipset and on
> other they resides directly on a system bus. If they will need to call
> different memory registration api depending on how they are instantiated
> the code can become messy.

An example is ne2000-isa and ne2000-pci.  There's no getting around some 
glue logic, but I think this API minimizes it (you can declare 
everything about your memory region in common code, the only thing that 
is different is registration).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  9:24             ` Jan Kiszka
@ 2011-05-19 11:58               ` Avi Kivity
  0 siblings, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 11:58 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel

On 05/19/2011 12:24 PM, Jan Kiszka wrote:
> On 2011-05-19 10:43, Avi Kivity wrote:
> >  On 05/19/2011 11:25 AM, Jan Kiszka wrote:
> >>  >
> >>  >   Unspecified doesn't mean abort.  It means we need to specify something
> >>  >   (which translates to: we get to pick the priorities).
> >>
> >>  Of course, PCI bars would have to be registered via
> >>  cpu_register_memory_region_overlap, just specifying the default
> >>  priority. Here we know that overlapping can happen and is not a bug in
> >>  the board emulation. I want to avoid that such use cases make
> >>  overlapping generally legal, papering over real bugs.
> >
> >  But those are the majority of regions.  There are a few extra RAM and
> >  ROM and fixed function regions, but these are hardly likely to clash.
>
> You are probably only thinking about x86, which does not provide the
> majority of QEMU devices.

The next version of the API will include an overlap property.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 11:57                       ` Jan Kiszka
@ 2011-05-19 11:58                         ` Gleb Natapov
  2011-05-19 12:02                           ` Avi Kivity
  2011-05-19 12:02                           ` Jan Kiszka
  0 siblings, 2 replies; 187+ messages in thread
From: Gleb Natapov @ 2011-05-19 11:58 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, qemu-devel

On Thu, May 19, 2011 at 01:57:15PM +0200, Jan Kiszka wrote:
> On 2011-05-19 13:54, Gleb Natapov wrote:
> > On Thu, May 19, 2011 at 02:44:29PM +0300, Avi Kivity wrote:
> >> On 05/19/2011 12:14 PM, Gleb Natapov wrote:
> >>> On Thu, May 19, 2011 at 12:10:38PM +0300, Avi Kivity wrote:
> >>>>  On 05/19/2011 12:08 PM, Gleb Natapov wrote:
> >>>>  >On Wed, May 18, 2011 at 06:42:14PM +0300, Avi Kivity wrote:
> >>>>  >>   On 05/18/2011 06:36 PM, Jan Kiszka wrote:
> >>>>  >>   >>
> >>>>  >>   >>    We need to head for the more hardware-like approach.  What happens when
> >>>>  >>   >>    you program overlapping BARs?  I imagine the result is
> >>>>  >>   >>    implementation-defined, but ends up with one region decoded in
> >>>>  >>   >>    preference to the other.  There is simply no way to reject an
> >>>>  >>   >>    overlapping mapping.
> >>>>  >>   >
> >>>>  >>   >But there is also now simple way to allow them. At least not without
> >>>>  >>   >exposing control about their ordering AND allowing to hook up managing
> >>>>  >>   >code (e.g. of the PCI bridge or the chipset) that controls registrations.
> >>>>  >>
> >>>>  >>   What about memory_region_add_subregion(..., int priority) as I
> >>>>  >>   suggested in another message?
> >>>>  >Haven't saw another message yet, but how caller knows about priority?
> >>>>
> >>>>  The caller is emulating some hub or router and should decide on
> >>>>  priority like real hardware.
> >>>>
> >>>>  For example, piix gives higher priority to the vga window over RAM.
> >>>>
> >>> Hmm, but if a caller of the memory_region_add_subregion() function is a
> >>> device itself how does it know about chipset priorities. All it wants to
> >>> tell to the system is that it is ready to handle mmio access in this phys
> >>> range, but chipset may decide to forward those accesses elsewhere.
> >>
> >> In this case the device would call a chipset function, passing the
> >> memory region as a parameter, and the chipset would call
> >> m_r_add_subregion().
> > But then chipset can resolve all overlapping by itself and register only
> > regions that are actually accessible by a guest software. Also there are
> > devices that on some architectures are accessed through a chipset and on
> > other they resides directly on a system bus. If they will need to call
> > different memory registration api depending on how they are instantiated
> > the code can become messy.
> 
> Devices shall register their regions with the bus. Every device is on
> some bus, so that's not a problem. And we can then provide registration
> handlers at bus level that either implement specific logic or just
> forward the request to the next hierarchy level (default handler).
> 
Yes, I agree with that. I just don't see the need for "priority" parameter
in this model.

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 11:58                         ` Gleb Natapov
@ 2011-05-19 12:02                           ` Avi Kivity
  2011-05-19 12:21                             ` Gleb Natapov
  2011-05-19 12:02                           ` Jan Kiszka
  1 sibling, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 12:02 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel

On 05/19/2011 02:58 PM, Gleb Natapov wrote:
> >
> >  Devices shall register their regions with the bus. Every device is on
> >  some bus, so that's not a problem. And we can then provide registration
> >  handlers at bus level that either implement specific logic or just
> >  forward the request to the next hierarchy level (default handler).
> >
> Yes, I agree with that. I just don't see the need for "priority" parameter
> in this model.

Priority allows you to register RAM from 0-EOM and overlay it with the 
ROM and VGA windows as necessary.  It also allows PCI to override RAM 
(or vice versa, however we decide).

Sure, you can let the caller chop the various regions manually in the 
first case, but it's just extra work that can be done in common code.  
And it cannot be done at all for the second case, if RAM overrides PCI 
(the PCI bus doesn't know how to chop BARs).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 11:58                         ` Gleb Natapov
  2011-05-19 12:02                           ` Avi Kivity
@ 2011-05-19 12:02                           ` Jan Kiszka
  1 sibling, 0 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 12:02 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, qemu-devel

On 2011-05-19 13:58, Gleb Natapov wrote:
> On Thu, May 19, 2011 at 01:57:15PM +0200, Jan Kiszka wrote:
>> On 2011-05-19 13:54, Gleb Natapov wrote:
>>> On Thu, May 19, 2011 at 02:44:29PM +0300, Avi Kivity wrote:
>>>> On 05/19/2011 12:14 PM, Gleb Natapov wrote:
>>>>> On Thu, May 19, 2011 at 12:10:38PM +0300, Avi Kivity wrote:
>>>>>>  On 05/19/2011 12:08 PM, Gleb Natapov wrote:
>>>>>>  >On Wed, May 18, 2011 at 06:42:14PM +0300, Avi Kivity wrote:
>>>>>>  >>   On 05/18/2011 06:36 PM, Jan Kiszka wrote:
>>>>>>  >>   >>
>>>>>>  >>   >>    We need to head for the more hardware-like approach.  What happens when
>>>>>>  >>   >>    you program overlapping BARs?  I imagine the result is
>>>>>>  >>   >>    implementation-defined, but ends up with one region decoded in
>>>>>>  >>   >>    preference to the other.  There is simply no way to reject an
>>>>>>  >>   >>    overlapping mapping.
>>>>>>  >>   >
>>>>>>  >>   >But there is also now simple way to allow them. At least not without
>>>>>>  >>   >exposing control about their ordering AND allowing to hook up managing
>>>>>>  >>   >code (e.g. of the PCI bridge or the chipset) that controls registrations.
>>>>>>  >>
>>>>>>  >>   What about memory_region_add_subregion(..., int priority) as I
>>>>>>  >>   suggested in another message?
>>>>>>  >Haven't saw another message yet, but how caller knows about priority?
>>>>>>
>>>>>>  The caller is emulating some hub or router and should decide on
>>>>>>  priority like real hardware.
>>>>>>
>>>>>>  For example, piix gives higher priority to the vga window over RAM.
>>>>>>
>>>>> Hmm, but if a caller of the memory_region_add_subregion() function is a
>>>>> device itself how does it know about chipset priorities. All it wants to
>>>>> tell to the system is that it is ready to handle mmio access in this phys
>>>>> range, but chipset may decide to forward those accesses elsewhere.
>>>>
>>>> In this case the device would call a chipset function, passing the
>>>> memory region as a parameter, and the chipset would call
>>>> m_r_add_subregion().
>>> But then chipset can resolve all overlapping by itself and register only
>>> regions that are actually accessible by a guest software. Also there are
>>> devices that on some architectures are accessed through a chipset and on
>>> other they resides directly on a system bus. If they will need to call
>>> different memory registration api depending on how they are instantiated
>>> the code can become messy.
>>
>> Devices shall register their regions with the bus. Every device is on
>> some bus, so that's not a problem. And we can then provide registration
>> handlers at bus level that either implement specific logic or just
>> forward the request to the next hierarchy level (default handler).
>>
> Yes, I agree with that. I just don't see the need for "priority" parameter
> in this model.

"Controlled" overlapping can indeed simplify the implementation of a
dispatcher that wants to flip between putting some region logically over
an existing, possibly complex layout and disabling this overlay. For
that purpose, the overlay requires a high prio than what is below (using
a default priority).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 11:57                       ` Avi Kivity
@ 2011-05-19 12:20                         ` Jan Kiszka
  2011-05-19 12:50                           ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 12:20 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, Gleb Natapov

On 2011-05-19 13:57, Avi Kivity wrote:
> On 05/19/2011 02:54 PM, Gleb Natapov wrote:
>>>
>>>  In this case the device would call a chipset function, passing the
>>>  memory region as a parameter, and the chipset would call
>>>  m_r_add_subregion().
>> But then chipset can resolve all overlapping by itself and register only
>> regions that are actually accessible by a guest software.
> 
> Sure it can (and it does now), but it's hard.  This API centralizes the 
> logic, leaving the devices/chipsets to specify what they want.
> 
> For a PC, we have at least two such cases, the ISA bus and the PCI bus.
> 
>>   Also there are
>> devices that on some architectures are accessed through a chipset and on
>> other they resides directly on a system bus. If they will need to call
>> different memory registration api depending on how they are instantiated
>> the code can become messy.
> 
> An example is ne2000-isa and ne2000-pci.  There's no getting around some 
> glue logic, but I think this API minimizes it (you can declare 
> everything about your memory region in common code, the only thing that 
> is different is registration).

If devices register against the corresponding qbus (I expect we'll have
multiple ones in the future), not even that need to be different.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 12:02                           ` Avi Kivity
@ 2011-05-19 12:21                             ` Gleb Natapov
  0 siblings, 0 replies; 187+ messages in thread
From: Gleb Natapov @ 2011-05-19 12:21 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel

On Thu, May 19, 2011 at 03:02:09PM +0300, Avi Kivity wrote:
> On 05/19/2011 02:58 PM, Gleb Natapov wrote:
> >>
> >>  Devices shall register their regions with the bus. Every device is on
> >>  some bus, so that's not a problem. And we can then provide registration
> >>  handlers at bus level that either implement specific logic or just
> >>  forward the request to the next hierarchy level (default handler).
> >>
> >Yes, I agree with that. I just don't see the need for "priority" parameter
> >in this model.
> 
> Priority allows you to register RAM from 0-EOM and overlay it with
> the ROM and VGA windows as necessary.  It also allows PCI to
> override RAM (or vice versa, however we decide).
> 
> Sure, you can let the caller chop the various regions manually in
> the first case, but it's just extra work that can be done in common
> code.  And it cannot be done at all for the second case, if RAM
> overrides PCI (the PCI bus doesn't know how to chop BARs).
> 

In the model described by Jan no device or bus (except system bus)
will call memory API directly. Instead each device will call memory
registration function of its bus and the bus will call memory registration
function of a device that provides the bus and so on up to the system bus
(maps nicely to qdev!). Each level knows by itself (and only it really
knows!) how to solve the overlapping, but it can't provide meaningful
priority to the layer above since only layer above knows what are relative
priority between siblings (e.g only chipset knows that it will forward
access to particular address to RAM and not PCI, no matter that PCI bus
asked for priority 100 and RAM asked for 0).

So it boils down to this: since chopping will have to happen at each
level of the hierarchy I do not see why the lowest level of the hierarchy
should be made special.

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 12:20                         ` Jan Kiszka
@ 2011-05-19 12:50                           ` Avi Kivity
  2011-05-19 12:58                             ` Jan Kiszka
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 12:50 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, Gleb Natapov

On 05/19/2011 03:20 PM, Jan Kiszka wrote:
> >
> >  An example is ne2000-isa and ne2000-pci.  There's no getting around some
> >  glue logic, but I think this API minimizes it (you can declare
> >  everything about your memory region in common code, the only thing that
> >  is different is registration).
>
> If devices register against the corresponding qbus (I expect we'll have
> multiple ones in the future), not even that need to be different.

Eventually we may make the memory API a sub-API of qdev.  I don't want 
to start with that however, the change is large enough already.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 12:50                           ` Avi Kivity
@ 2011-05-19 12:58                             ` Jan Kiszka
  2011-05-19 13:00                               ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 12:58 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, Gleb Natapov

On 2011-05-19 14:50, Avi Kivity wrote:
> On 05/19/2011 03:20 PM, Jan Kiszka wrote:
>>>
>>>  An example is ne2000-isa and ne2000-pci.  There's no getting around some
>>>  glue logic, but I think this API minimizes it (you can declare
>>>  everything about your memory region in common code, the only thing that
>>>  is different is registration).
>>
>> If devices register against the corresponding qbus (I expect we'll have
>> multiple ones in the future), not even that need to be different.
> 
> Eventually we may make the memory API a sub-API of qdev.  I don't want 
> to start with that however, the change is large enough already.

Touching all devices again at that point to change the way they register
regions may not be the best approach. I would try to consolidate the
refactoring work that affects the majority of device models.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 12:58                             ` Jan Kiszka
@ 2011-05-19 13:00                               ` Avi Kivity
  2011-05-19 13:03                                 ` Jan Kiszka
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 13:00 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, Gleb Natapov

On 05/19/2011 03:58 PM, Jan Kiszka wrote:
> >
> >  Eventually we may make the memory API a sub-API of qdev.  I don't want
> >  to start with that however, the change is large enough already.
>
> Touching all devices again at that point to change the way they register
> regions may not be the best approach. I would try to consolidate the
> refactoring work that affects the majority of device models.

The risk is that the entire work will be stalled if it requires too much 
effort.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:00                               ` Avi Kivity
@ 2011-05-19 13:03                                 ` Jan Kiszka
  2011-05-19 13:07                                   ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 13:03 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, Gleb Natapov

On 2011-05-19 15:00, Avi Kivity wrote:
> On 05/19/2011 03:58 PM, Jan Kiszka wrote:
>>>
>>>  Eventually we may make the memory API a sub-API of qdev.  I don't want
>>>  to start with that however, the change is large enough already.
>>
>> Touching all devices again at that point to change the way they register
>> regions may not be the best approach. I would try to consolidate the
>> refactoring work that affects the majority of device models.
> 
> The risk is that the entire work will be stalled if it requires too much 
> effort.

Then we could still switch one gear down, converting at least some
exemplary devices completely to demonstrate that the API changes fit
into the big picture.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:03                                 ` Jan Kiszka
@ 2011-05-19 13:07                                   ` Avi Kivity
  2011-05-19 13:26                                     ` Jan Kiszka
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 13:07 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, Gleb Natapov

On 05/19/2011 04:03 PM, Jan Kiszka wrote:
> On 2011-05-19 15:00, Avi Kivity wrote:
> >  On 05/19/2011 03:58 PM, Jan Kiszka wrote:
> >>>
> >>>   Eventually we may make the memory API a sub-API of qdev.  I don't want
> >>>   to start with that however, the change is large enough already.
> >>
> >>  Touching all devices again at that point to change the way they register
> >>  regions may not be the best approach. I would try to consolidate the
> >>  refactoring work that affects the majority of device models.
> >
> >  The risk is that the entire work will be stalled if it requires too much
> >  effort.
>
> Then we could still switch one gear down, converting at least some
> exemplary devices completely to demonstrate that the API changes fit
> into the big picture.

My plan is:

- post RFC v1 with updated API in patch form
- RFC v2 with implementation + significant fraction of PC devices coverted
- PATCH v3 with full conversion an elimination of the old API

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  6:31             ` Jan Kiszka
@ 2011-05-19 13:23               ` Anthony Liguori
  2011-05-19 13:25                 ` Jan Kiszka
  2011-05-19 13:26                 ` Avi Kivity
  0 siblings, 2 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 13:23 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Peter Maydell, Avi Kivity, qemu-devel

On 05/19/2011 01:31 AM, Jan Kiszka wrote:
> On 2011-05-18 21:10, Anthony Liguori wrote:
>> On 05/18/2011 10:30 AM, Jan Kiszka wrote:
>> static void i440fx_io_intercept(void *opaque, uint64_t addr, uint32_t
>> value, int size, MemRegion *next)
>> {
>>      I440FX *s = opaque;
>>
>>      if (range_overlaps(addr, size, PAM_REGION)) {
>>          ...
>>      } else {
>>          dispatch_io(next, addr, value, size);
>>      }
>> }
>>
>> There's no need for an explicit intercept mechanism if you make multiple
>> levels have their own dispatch tables and register progressively larger
>> regions.  In fact....
>
> Actually, things are a bit more complicated: This layer has to properly
> adopt the coalescing properties of underlying regions or we cause
> performance regressions to VGA emulation. That means it has to register
> dispatching slots of the corresponding size and set the coalescing flag
> accordingly. And it likely need to adjust them as the regions below change.

As I mentioned in another thread, I don't think we want to "design" 
coalescing into the API.  Coalescing is something that breaks through 
abstractions layers and is really just a hack.

Regards,

Anthony Liguori

> IOW, I don't think we get away with that simple approach above but still
> require to track mapping via a PhysMemClient. But we should be able to
> avoid filtering by adding overlapping regions with higher prio.
>
> Jan
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:23               ` Anthony Liguori
@ 2011-05-19 13:25                 ` Jan Kiszka
  2011-05-19 13:26                 ` Avi Kivity
  1 sibling, 0 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 13:25 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Peter Maydell, Avi Kivity, qemu-devel

On 2011-05-19 15:23, Anthony Liguori wrote:
> On 05/19/2011 01:31 AM, Jan Kiszka wrote:
>> On 2011-05-18 21:10, Anthony Liguori wrote:
>>> On 05/18/2011 10:30 AM, Jan Kiszka wrote:
>>> static void i440fx_io_intercept(void *opaque, uint64_t addr, uint32_t
>>> value, int size, MemRegion *next)
>>> {
>>>      I440FX *s = opaque;
>>>
>>>      if (range_overlaps(addr, size, PAM_REGION)) {
>>>          ...
>>>      } else {
>>>          dispatch_io(next, addr, value, size);
>>>      }
>>> }
>>>
>>> There's no need for an explicit intercept mechanism if you make multiple
>>> levels have their own dispatch tables and register progressively larger
>>> regions.  In fact....
>>
>> Actually, things are a bit more complicated: This layer has to properly
>> adopt the coalescing properties of underlying regions or we cause
>> performance regressions to VGA emulation. That means it has to register
>> dispatching slots of the corresponding size and set the coalescing flag
>> accordingly. And it likely need to adjust them as the regions below
>> change.
> 
> As I mentioned in another thread, I don't think we want to "design"
> coalescing into the API.  Coalescing is something that breaks through
> abstractions layers and is really just a hack.

We depend on it for speed, and even if it's nothing to be found in real
HW, we need to design it in as this example demonstrates. The pain of
not doing it, specifically when we go for hierarchical management, will
be much higher.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:07                                   ` Avi Kivity
@ 2011-05-19 13:26                                     ` Jan Kiszka
  2011-05-19 13:30                                       ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 13:26 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, Gleb Natapov

On 2011-05-19 15:07, Avi Kivity wrote:
> On 05/19/2011 04:03 PM, Jan Kiszka wrote:
>> On 2011-05-19 15:00, Avi Kivity wrote:
>>>  On 05/19/2011 03:58 PM, Jan Kiszka wrote:
>>>>>
>>>>>   Eventually we may make the memory API a sub-API of qdev.  I don't want
>>>>>   to start with that however, the change is large enough already.
>>>>
>>>>  Touching all devices again at that point to change the way they register
>>>>  regions may not be the best approach. I would try to consolidate the
>>>>  refactoring work that affects the majority of device models.
>>>
>>>  The risk is that the entire work will be stalled if it requires too much
>>>  effort.
>>
>> Then we could still switch one gear down, converting at least some
>> exemplary devices completely to demonstrate that the API changes fit
>> into the big picture.
> 
> My plan is:
> 
> - post RFC v1 with updated API in patch form
> - RFC v2 with implementation + significant fraction of PC devices coverted
> - PATCH v3 with full conversion an elimination of the old API

And when introducing hierarchical registration, we will have to go
through all of this once again. Plus the API may have to be changed
again if it does not fulfill all requirements of the hierarchical region
management. And we have no proof that it allows an efficient core
implementation.

Before touching any device at all, what about building the
infrastructure to manage and address memory regions hierarchically via
qdev first? That could be started on a green field, then applied to the
PC architecture, and finally rolled out for all.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:23               ` Anthony Liguori
  2011-05-19 13:25                 ` Jan Kiszka
@ 2011-05-19 13:26                 ` Avi Kivity
  2011-05-19 13:35                   ` Anthony Liguori
  1 sibling, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 13:26 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Peter Maydell, Jan Kiszka, qemu-devel

On 05/19/2011 04:23 PM, Anthony Liguori wrote:
>> Actually, things are a bit more complicated: This layer has to properly
>> adopt the coalescing properties of underlying regions or we cause
>> performance regressions to VGA emulation. That means it has to register
>> dispatching slots of the corresponding size and set the coalescing flag
>> accordingly. And it likely need to adjust them as the regions below 
>> change.
>
>
> As I mentioned in another thread, I don't think we want to "design" 
> coalescing into the API.  Coalescing is something that breaks through 
> abstractions layers and is really just a hack.

It's impossible not to design it into the API.  The layer which wants to 
do coalescing (the device) has no idea if and where its memory is mapped.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:26                                     ` Jan Kiszka
@ 2011-05-19 13:30                                       ` Avi Kivity
  2011-05-19 13:43                                         ` Jan Kiszka
  2011-05-19 13:49                                         ` Anthony Liguori
  0 siblings, 2 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 13:30 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, Gleb Natapov

On 05/19/2011 04:26 PM, Jan Kiszka wrote:
> On 2011-05-19 15:07, Avi Kivity wrote:
> >  On 05/19/2011 04:03 PM, Jan Kiszka wrote:
> >>  On 2011-05-19 15:00, Avi Kivity wrote:
> >>>   On 05/19/2011 03:58 PM, Jan Kiszka wrote:
> >>>>>
> >>>>>    Eventually we may make the memory API a sub-API of qdev.  I don't want
> >>>>>    to start with that however, the change is large enough already.
> >>>>
> >>>>   Touching all devices again at that point to change the way they register
> >>>>   regions may not be the best approach. I would try to consolidate the
> >>>>   refactoring work that affects the majority of device models.
> >>>
> >>>   The risk is that the entire work will be stalled if it requires too much
> >>>   effort.
> >>
> >>  Then we could still switch one gear down, converting at least some
> >>  exemplary devices completely to demonstrate that the API changes fit
> >>  into the big picture.
> >
> >  My plan is:
> >
> >  - post RFC v1 with updated API in patch form
> >  - RFC v2 with implementation + significant fraction of PC devices coverted
> >  - PATCH v3 with full conversion an elimination of the old API
>
> And when introducing hierarchical registration, we will have to go
> through all of this once again. Plus the API may have to be changed
> again if it does not fulfill all requirements of the hierarchical region
> management. And we have no proof that it allows an efficient core
> implementation.

This API *is* hierarchical registration.  v2 will (hopefully) prove that 
it can be done efficiently.

What is missing is to make qdev and this API the same hierarchy.

> Before touching any device at all, what about building the
> infrastructure to manage and address memory regions hierarchically via
> qdev first? That could be started on a green field, then applied to the
> PC architecture, and finally rolled out for all.

I feel a lot less comfortable about it since it introduces more 
variables.  It also means a full conversion is impossible.

While doing it in one pass reduces the total effort, it increases the 
risk IMO.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:26                 ` Avi Kivity
@ 2011-05-19 13:35                   ` Anthony Liguori
  2011-05-19 13:36                     ` Jan Kiszka
  2011-05-19 13:39                     ` Avi Kivity
  0 siblings, 2 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 13:35 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Peter Maydell, Jan Kiszka, qemu-devel

On 05/19/2011 08:26 AM, Avi Kivity wrote:
> On 05/19/2011 04:23 PM, Anthony Liguori wrote:
>>> Actually, things are a bit more complicated: This layer has to properly
>>> adopt the coalescing properties of underlying regions or we cause
>>> performance regressions to VGA emulation. That means it has to register
>>> dispatching slots of the corresponding size and set the coalescing flag
>>> accordingly. And it likely need to adjust them as the regions below
>>> change.
>>
>>
>> As I mentioned in another thread, I don't think we want to "design"
>> coalescing into the API. Coalescing is something that breaks through
>> abstractions layers and is really just a hack.
>
> It's impossible not to design it into the API. The layer which wants to
> do coalescing (the device) has no idea if and where its memory is mapped.

There's two places coalescing currently matters: VGA and PCI devices. 
Since VGA is just a special PCI device, let's just focus on PCI devices.

The PCI bus knows exactly where the memory is mapped.  Yes, if you have 
complex IOMMU hierarchies it doesn't, but this is my point.  We don't 
need to design coalesced IO to support these sort of complex hierarchies.



Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-18 19:40 ` Jan Kiszka
  2011-05-19  8:06   ` Avi Kivity
@ 2011-05-19 13:36   ` Anthony Liguori
  2011-05-19 13:37     ` Jan Kiszka
  1 sibling, 1 reply; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 13:36 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, qemu-devel

On 05/18/2011 02:40 PM, Jan Kiszka wrote:
> On 2011-05-18 15:12, Avi Kivity wrote:
>> void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t addr);
>
> OK, let's allow overlapping, but make it explicit:
>
> void cpu_register_memory_region_overlap(MemoryRegion *mr,
>                                          target_phys_addr_t addr,
>                                          int priority);

The device doesn't actually know how overlapping is handled.  This is 
based on the bus hierarchy.

Regards,

Anthony Liguori

> We need that ordering, so we need an interface. Regions registered via
> cpu_register_memory_region must not overlap with existing one or we will
> throw an hwerror. And they shall get a low default priority.
>
> Jan
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:35                   ` Anthony Liguori
@ 2011-05-19 13:36                     ` Jan Kiszka
  2011-05-19 13:43                       ` Avi Kivity
  2011-05-19 13:39                     ` Avi Kivity
  1 sibling, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 13:36 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Peter Maydell, Avi Kivity, qemu-devel

On 2011-05-19 15:35, Anthony Liguori wrote:
> On 05/19/2011 08:26 AM, Avi Kivity wrote:
>> On 05/19/2011 04:23 PM, Anthony Liguori wrote:
>>>> Actually, things are a bit more complicated: This layer has to properly
>>>> adopt the coalescing properties of underlying regions or we cause
>>>> performance regressions to VGA emulation. That means it has to register
>>>> dispatching slots of the corresponding size and set the coalescing flag
>>>> accordingly. And it likely need to adjust them as the regions below
>>>> change.
>>>
>>>
>>> As I mentioned in another thread, I don't think we want to "design"
>>> coalescing into the API. Coalescing is something that breaks through
>>> abstractions layers and is really just a hack.
>>
>> It's impossible not to design it into the API. The layer which wants to
>> do coalescing (the device) has no idea if and where its memory is mapped.
> 
> There's two places coalescing currently matters: VGA and PCI devices.
> Since VGA is just a special PCI device, let's just focus on PCI devices.

Every frame buffer device, PCI or not, benefits from it. Don't focus on
PCI or x86.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:36   ` Anthony Liguori
@ 2011-05-19 13:37     ` Jan Kiszka
  2011-05-19 13:41       ` Avi Kivity
  2011-05-19 17:39       ` Gleb Natapov
  0 siblings, 2 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 13:37 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, qemu-devel

On 2011-05-19 15:36, Anthony Liguori wrote:
> On 05/18/2011 02:40 PM, Jan Kiszka wrote:
>> On 2011-05-18 15:12, Avi Kivity wrote:
>>> void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t
>>> addr);
>>
>> OK, let's allow overlapping, but make it explicit:
>>
>> void cpu_register_memory_region_overlap(MemoryRegion *mr,
>>                                          target_phys_addr_t addr,
>>                                          int priority);
> 
> The device doesn't actually know how overlapping is handled.  This is
> based on the bus hierarchy.

Devices don't register their regions, buses do.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:35                   ` Anthony Liguori
  2011-05-19 13:36                     ` Jan Kiszka
@ 2011-05-19 13:39                     ` Avi Kivity
  1 sibling, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 13:39 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Peter Maydell, Jan Kiszka, qemu-devel

On 05/19/2011 04:35 PM, Anthony Liguori wrote:
> On 05/19/2011 08:26 AM, Avi Kivity wrote:
>> On 05/19/2011 04:23 PM, Anthony Liguori wrote:
>>>> Actually, things are a bit more complicated: This layer has to 
>>>> properly
>>>> adopt the coalescing properties of underlying regions or we cause
>>>> performance regressions to VGA emulation. That means it has to 
>>>> register
>>>> dispatching slots of the corresponding size and set the coalescing 
>>>> flag
>>>> accordingly. And it likely need to adjust them as the regions below
>>>> change.
>>>
>>>
>>> As I mentioned in another thread, I don't think we want to "design"
>>> coalescing into the API. Coalescing is something that breaks through
>>> abstractions layers and is really just a hack.
>>
>> It's impossible not to design it into the API. The layer which wants to
>> do coalescing (the device) has no idea if and where its memory is 
>> mapped.
>
> There's two places coalescing currently matters: VGA and PCI devices. 
> Since VGA is just a special PCI device, let's just focus on PCI devices.
>
> The PCI bus knows exactly where the memory is mapped. 

The PCI bus doesn't know about coalescing.  Only the device does.  Look 
at e1000 for an example.  Cirrus also enables/disables coalescing 
dynamically.

> Yes, if you have complex IOMMU hierarchies it doesn't, but this is my 
> point.  We don't need to design coalesced IO to support these sort of 
> complex hierarchies.
>

Are we aiming at our own feet again?

Look at the API, it adds three functions that non-users need not care about.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:37     ` Jan Kiszka
@ 2011-05-19 13:41       ` Avi Kivity
  2011-05-19 17:39       ` Gleb Natapov
  1 sibling, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 13:41 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel

On 05/19/2011 04:37 PM, Jan Kiszka wrote:
> >  The device doesn't actually know how overlapping is handled.  This is
> >  based on the bus hierarchy.
>
> Devices don't register their regions, buses do.
>

Exactly.  Devices may register sub-regions to describe complex BARs, but 
most will just create a region and hand it off to their bus.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:36                     ` Jan Kiszka
@ 2011-05-19 13:43                       ` Avi Kivity
  0 siblings, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 13:43 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Peter Maydell, qemu-devel

On 05/19/2011 04:36 PM, Jan Kiszka wrote:
> >
> >  There's two places coalescing currently matters: VGA and PCI devices.
> >  Since VGA is just a special PCI device, let's just focus on PCI devices.
>
> Every frame buffer device, PCI or not, benefits from it. Don't focus on
> PCI or x86.

Actually, only framebuffers that can't be mapped as RAM benefit from 
it.  The only one I know of is VGA in planar mode (but there may be others).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:30                                       ` Avi Kivity
@ 2011-05-19 13:43                                         ` Jan Kiszka
  2011-05-19 13:47                                           ` Avi Kivity
  2011-05-19 13:49                                         ` Anthony Liguori
  1 sibling, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 13:43 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, Gleb Natapov

On 2011-05-19 15:30, Avi Kivity wrote:
> On 05/19/2011 04:26 PM, Jan Kiszka wrote:
>> On 2011-05-19 15:07, Avi Kivity wrote:
>>>  On 05/19/2011 04:03 PM, Jan Kiszka wrote:
>>>>  On 2011-05-19 15:00, Avi Kivity wrote:
>>>>>   On 05/19/2011 03:58 PM, Jan Kiszka wrote:
>>>>>>>
>>>>>>>    Eventually we may make the memory API a sub-API of qdev.  I don't want
>>>>>>>    to start with that however, the change is large enough already.
>>>>>>
>>>>>>   Touching all devices again at that point to change the way they register
>>>>>>   regions may not be the best approach. I would try to consolidate the
>>>>>>   refactoring work that affects the majority of device models.
>>>>>
>>>>>   The risk is that the entire work will be stalled if it requires too much
>>>>>   effort.
>>>>
>>>>  Then we could still switch one gear down, converting at least some
>>>>  exemplary devices completely to demonstrate that the API changes fit
>>>>  into the big picture.
>>>
>>>  My plan is:
>>>
>>>  - post RFC v1 with updated API in patch form
>>>  - RFC v2 with implementation + significant fraction of PC devices coverted
>>>  - PATCH v3 with full conversion an elimination of the old API
>>
>> And when introducing hierarchical registration, we will have to go
>> through all of this once again. Plus the API may have to be changed
>> again if it does not fulfill all requirements of the hierarchical region
>> management. And we have no proof that it allows an efficient core
>> implementation.
> 
> This API *is* hierarchical registration.  v2 will (hopefully) prove that 
> it can be done efficiently.

It may supports it, but most users will not use it like this. That comes
with consequent qdev integration. PCI is just an exception here as it
already provides some instantiation services.

> 
> What is missing is to make qdev and this API the same hierarchy.
> 
>> Before touching any device at all, what about building the
>> infrastructure to manage and address memory regions hierarchically via
>> qdev first? That could be started on a green field, then applied to the
>> PC architecture, and finally rolled out for all.
> 
> I feel a lot less comfortable about it since it introduces more 
> variables.  It also means a full conversion is impossible.
> 
> While doing it in one pass reduces the total effort, it increases the 
> risk IMO.

If we are sure we can reasonably evolve from this conversion level and
it will still fit the final model, then we may take that route. If you
want to go that way, OK, let's see what v2 will bring. We should use the
meantime and try to further develop the long-term APIs.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  9:10               ` Avi Kivity
  2011-05-19  9:14                 ` Gleb Natapov
@ 2011-05-19 13:44                 ` Anthony Liguori
  2011-05-19 13:47                   ` Jan Kiszka
  2011-05-19 13:49                   ` Avi Kivity
  1 sibling, 2 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 13:44 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov

On 05/19/2011 04:10 AM, Avi Kivity wrote:
> On 05/19/2011 12:08 PM, Gleb Natapov wrote:
>> On Wed, May 18, 2011 at 06:42:14PM +0300, Avi Kivity wrote:
>> > On 05/18/2011 06:36 PM, Jan Kiszka wrote:
>> > >>
>> > >> We need to head for the more hardware-like approach. What happens
>> when
>> > >> you program overlapping BARs? I imagine the result is
>> > >> implementation-defined, but ends up with one region decoded in
>> > >> preference to the other. There is simply no way to reject an
>> > >> overlapping mapping.
>> > >
>> > >But there is also now simple way to allow them. At least not without
>> > >exposing control about their ordering AND allowing to hook up managing
>> > >code (e.g. of the PCI bridge or the chipset) that controls
>> registrations.
>> >
>> > What about memory_region_add_subregion(..., int priority) as I
>> > suggested in another message?
>> Haven't saw another message yet, but how caller knows about priority?
>
> The caller is emulating some hub or router and should decide on priority
> like real hardware.
>
> For example, piix gives higher priority to the vga window over RAM.

Well......

The i440fx may direct VGA accesses to RAM depending on the SMM 
registers.  By the time the PIIX gets the I/O request, we're past the 
memory controller.

This is my biggest concern about this whole notion of "priority".  These 
sort of issues are not dealt with by a simple z-ordering.  There is 
logic in each component that may be arbitrarily complex.

We're going to end up having to dynamically change the "priority" based 
how registers are programmed.  But priorities are relative so it's 
unclear to me how the I440FX would prioritize RAM over dispatch to PIIX 
(for VGA, for instance).

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:43                                         ` Jan Kiszka
@ 2011-05-19 13:47                                           ` Avi Kivity
  0 siblings, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 13:47 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, Gleb Natapov

On 05/19/2011 04:43 PM, Jan Kiszka wrote:
> >
> >  I feel a lot less comfortable about it since it introduces more
> >  variables.  It also means a full conversion is impossible.
> >
> >  While doing it in one pass reduces the total effort, it increases the
> >  risk IMO.
>
> If we are sure we can reasonably evolve from this conversion level and
> it will still fit the final model, then we may take that route. If you
> want to go that way, OK, let's see what v2 will bring. We should use the
> meantime and try to further develop the long-term APIs.
>

Yes.  This way people who are familiar with qdev and non-x86 can look at 
the API and the partial conversion and see if something is missing.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:44                 ` Anthony Liguori
@ 2011-05-19 13:47                   ` Jan Kiszka
  2011-05-19 13:50                     ` Anthony Liguori
  2011-05-19 13:49                   ` Avi Kivity
  1 sibling, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 13:47 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, Gleb Natapov, qemu-devel

On 2011-05-19 15:44, Anthony Liguori wrote:
> On 05/19/2011 04:10 AM, Avi Kivity wrote:
>> On 05/19/2011 12:08 PM, Gleb Natapov wrote:
>>> On Wed, May 18, 2011 at 06:42:14PM +0300, Avi Kivity wrote:
>>>> On 05/18/2011 06:36 PM, Jan Kiszka wrote:
>>>>>>
>>>>>> We need to head for the more hardware-like approach. What happens
>>> when
>>>>>> you program overlapping BARs? I imagine the result is
>>>>>> implementation-defined, but ends up with one region decoded in
>>>>>> preference to the other. There is simply no way to reject an
>>>>>> overlapping mapping.
>>>>>
>>>>> But there is also now simple way to allow them. At least not without
>>>>> exposing control about their ordering AND allowing to hook up managing
>>>>> code (e.g. of the PCI bridge or the chipset) that controls
>>> registrations.
>>>>
>>>> What about memory_region_add_subregion(..., int priority) as I
>>>> suggested in another message?
>>> Haven't saw another message yet, but how caller knows about priority?
>>
>> The caller is emulating some hub or router and should decide on priority
>> like real hardware.
>>
>> For example, piix gives higher priority to the vga window over RAM.
> 
> Well......
> 
> The i440fx may direct VGA accesses to RAM depending on the SMM 
> registers.  By the time the PIIX gets the I/O request, we're past the 
> memory controller.
> 
> This is my biggest concern about this whole notion of "priority".  These 
> sort of issues are not dealt with by a simple z-ordering.  There is 
> logic in each component that may be arbitrarily complex.
> 
> We're going to end up having to dynamically change the "priority" based 
> how registers are programmed.  But priorities are relative so it's 
> unclear to me how the I440FX would prioritize RAM over dispatch to PIIX 
> (for VGA, for instance).

But creating an extra RAM window region with higher priority than the
underlying mappings.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:30                                       ` Avi Kivity
  2011-05-19 13:43                                         ` Jan Kiszka
@ 2011-05-19 13:49                                         ` Anthony Liguori
  2011-05-19 13:53                                           ` Avi Kivity
  1 sibling, 1 reply; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 13:49 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov

On 05/19/2011 08:30 AM, Avi Kivity wrote:
> On 05/19/2011 04:26 PM, Jan Kiszka wrote:
>> On 2011-05-19 15:07, Avi Kivity wrote:

>> And when introducing hierarchical registration, we will have to go
>> through all of this once again. Plus the API may have to be changed
>> again if it does not fulfill all requirements of the hierarchical region
>> management. And we have no proof that it allows an efficient core
>> implementation.
>
> This API *is* hierarchical registration. v2 will (hopefully) prove that
> it can be done efficiently.

We also need hierarchical dispatch.  Priorities are just a weak attempt 
to emulate hierarchical dispatch but I don't think there's an 
improvement over a single dispatch table.

Hierarchical dispatch is simpler.  You just need a simple list at each bus.

I don't see a strong need to tie anything to qdev here FWIW.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:44                 ` Anthony Liguori
  2011-05-19 13:47                   ` Jan Kiszka
@ 2011-05-19 13:49                   ` Avi Kivity
  1 sibling, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 13:49 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov

On 05/19/2011 04:44 PM, Anthony Liguori wrote:
>
> The i440fx may direct VGA accesses to RAM depending on the SMM 
> registers.  By the time the PIIX gets the I/O request, we're past the 
> memory controller.
>
> This is my biggest concern about this whole notion of "priority".  
> These sort of issues are not dealt with by a simple z-ordering.  There 
> is logic in each component that may be arbitrarily complex.
>
> We're going to end up having to dynamically change the "priority" 
> based how registers are programmed.  But priorities are relative so 
> it's unclear to me how the I440FX would prioritize RAM over dispatch 
> to PIIX (for VGA, for instance).

You can change priorities by removing the region and re-adding it with a 
different priority.  In practice I don't think this is ever necessary; 
we'll have fixed priorities and dynamic addition and removal.

For the per-cpu SMM case the only reasonable solution I see is a per-cpu 
memory map (= root region).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:47                   ` Jan Kiszka
@ 2011-05-19 13:50                     ` Anthony Liguori
  2011-05-19 13:55                       ` Jan Kiszka
  2011-05-19 13:55                       ` Avi Kivity
  0 siblings, 2 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 13:50 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, Gleb Natapov, qemu-devel

On 05/19/2011 08:47 AM, Jan Kiszka wrote:
> On 2011-05-19 15:44, Anthony Liguori wrote:
>> Well......
>>
>> The i440fx may direct VGA accesses to RAM depending on the SMM
>> registers.  By the time the PIIX gets the I/O request, we're past the
>> memory controller.
>>
>> This is my biggest concern about this whole notion of "priority".  These
>> sort of issues are not dealt with by a simple z-ordering.  There is
>> logic in each component that may be arbitrarily complex.
>>
>> We're going to end up having to dynamically change the "priority" based
>> how registers are programmed.  But priorities are relative so it's
>> unclear to me how the I440FX would prioritize RAM over dispatch to PIIX
>> (for VGA, for instance).
>
> But creating an extra RAM window region with higher priority than the
> underlying mappings.

But the i440fx doesn't register the VGA region.  The PIIX3 (ISA bus) 
does, so how does it know what the priority of that mapping is?

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  8:30                 ` Jan Kiszka
  2011-05-19  8:44                   ` Avi Kivity
@ 2011-05-19 13:52                   ` Anthony Liguori
  2011-05-19 13:56                     ` Avi Kivity
  2011-05-19 13:57                     ` Jan Kiszka
  2011-05-20 17:30                   ` Blue Swirl
  2 siblings, 2 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 13:52 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Peter Maydell, Avi Kivity, Gleb Natapov, qemu-devel

On 05/19/2011 03:30 AM, Jan Kiszka wrote:
> On 2011-05-19 10:26, Gleb Natapov wrote:
>> On Wed, May 18, 2011 at 09:27:55PM +0200, Jan Kiszka wrote:
>>>> if an I/O is to the APIC page,
>>>>     it's handled by the APIC
>>>
>>> That's not that simple. We need to tell apart:
>>>   - if a cpu issued the request, and which one =>  forward to APIC
>> And cpu mode may affect where access is forwarded to. If cpu is in SMM
>> mode access to frame buffer may be forwarded to a memory (depends on
>> chipset configuration).
>
> So we have a second use case for CPU-local I/O regions?
>
> I wonder if only a single CPU can enter SMM or if all have to.

For the i440fx, it's a chipset register (not a per-CPU register).

Maybe it's per-CPU on more modern chipsets... I'm not really sure.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:49                                         ` Anthony Liguori
@ 2011-05-19 13:53                                           ` Avi Kivity
  2011-05-19 14:15                                             ` Anthony Liguori
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 13:53 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov

On 05/19/2011 04:49 PM, Anthony Liguori wrote:
> On 05/19/2011 08:30 AM, Avi Kivity wrote:
>> On 05/19/2011 04:26 PM, Jan Kiszka wrote:
>>> On 2011-05-19 15:07, Avi Kivity wrote:
>
>>> And when introducing hierarchical registration, we will have to go
>>> through all of this once again. Plus the API may have to be changed
>>> again if it does not fulfill all requirements of the hierarchical 
>>> region
>>> management. And we have no proof that it allows an efficient core
>>> implementation.
>>
>> This API *is* hierarchical registration. v2 will (hopefully) prove that
>> it can be done efficiently.
>
> We also need hierarchical dispatch.  Priorities are just a weak 
> attempt to emulate hierarchical dispatch but I don't think there's an 
> improvement over a single dispatch table.
>
> Hierarchical dispatch is simpler.  You just need a simple list at each 
> bus.
>

The API itself says nothing about whether the hierarchy is evaluated at 
run-time or registration time.  We could easily have the implementation 
walk the memory hierarchy to dispatch an mmio.

However, RAM cannot be dispatched this way (we need to resolve which 
ranges are RAM when the regions are registered, not accessed) so a data 
structure that contains all of the information is mandatory.

Since the first implementation will use the existing 
cpu_physical_register_memory() API as a back-end, we'll end up with a 
flattened dispatch model (which I think is the right thing anyway).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:50                     ` Anthony Liguori
@ 2011-05-19 13:55                       ` Jan Kiszka
  2011-05-19 13:55                       ` Avi Kivity
  1 sibling, 0 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 13:55 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, Gleb Natapov, qemu-devel

On 2011-05-19 15:50, Anthony Liguori wrote:
> On 05/19/2011 08:47 AM, Jan Kiszka wrote:
>> On 2011-05-19 15:44, Anthony Liguori wrote:
>>> Well......
>>>
>>> The i440fx may direct VGA accesses to RAM depending on the SMM
>>> registers.  By the time the PIIX gets the I/O request, we're past the
>>> memory controller.
>>>
>>> This is my biggest concern about this whole notion of "priority".  These
>>> sort of issues are not dealt with by a simple z-ordering.  There is
>>> logic in each component that may be arbitrarily complex.
>>>
>>> We're going to end up having to dynamically change the "priority" based
>>> how registers are programmed.  But priorities are relative so it's
>>> unclear to me how the I440FX would prioritize RAM over dispatch to PIIX
>>> (for VGA, for instance).
>>
>> But creating an extra RAM window region with higher priority than the
>> underlying mappings.
> 
> But the i440fx doesn't register the VGA region.  The PIIX3 (ISA bus) 
> does, so how does it know what the priority of that mapping is?

Everything imported from "below" is of default priority for a bridge. So
it just has to add 1 to that prio value (or more if it needs to support
more layers).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:50                     ` Anthony Liguori
  2011-05-19 13:55                       ` Jan Kiszka
@ 2011-05-19 13:55                       ` Avi Kivity
  2011-05-19 18:06                         ` Anthony Liguori
  1 sibling, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 13:55 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov

On 05/19/2011 04:50 PM, Anthony Liguori wrote:
>
> But the i440fx doesn't register the VGA region.  The PIIX3 (ISA bus) 
> does, so how does it know what the priority of that mapping is?
>

The PCI bridge also has a say, no?

(and it would be a VGA region over memory, not the other way around).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:52                   ` Anthony Liguori
@ 2011-05-19 13:56                     ` Avi Kivity
  2011-05-19 13:57                     ` Jan Kiszka
  1 sibling, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 13:56 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Peter Maydell, Jan Kiszka, qemu-devel, Gleb Natapov

On 05/19/2011 04:52 PM, Anthony Liguori wrote:
>> So we have a second use case for CPU-local I/O regions?
>>
>> I wonder if only a single CPU can enter SMM or if all have to.
>
>
> For the i440fx, it's a chipset register (not a per-CPU register).
>
> Maybe it's per-CPU on more modern chipsets... I'm not really sure.

IIRC the chipset register is only used during setup, so you can have 
access to this memory without entering SMM mode.  Once you are in SMM 
mode, you don't need to touch anything.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:52                   ` Anthony Liguori
  2011-05-19 13:56                     ` Avi Kivity
@ 2011-05-19 13:57                     ` Jan Kiszka
  2011-05-19 14:04                       ` Anthony Liguori
  1 sibling, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 13:57 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Peter Maydell, Avi Kivity, Gleb Natapov, qemu-devel

On 2011-05-19 15:52, Anthony Liguori wrote:
> On 05/19/2011 03:30 AM, Jan Kiszka wrote:
>> On 2011-05-19 10:26, Gleb Natapov wrote:
>>> On Wed, May 18, 2011 at 09:27:55PM +0200, Jan Kiszka wrote:
>>>>> if an I/O is to the APIC page,
>>>>>     it's handled by the APIC
>>>>
>>>> That's not that simple. We need to tell apart:
>>>>   - if a cpu issued the request, and which one =>  forward to APIC
>>> And cpu mode may affect where access is forwarded to. If cpu is in SMM
>>> mode access to frame buffer may be forwarded to a memory (depends on
>>> chipset configuration).
>>
>> So we have a second use case for CPU-local I/O regions?
>>
>> I wonder if only a single CPU can enter SMM or if all have to.
> 
> For the i440fx, it's a chipset register (not a per-CPU register).

There are two sources: the chipset register and the mode of the first
CPU. Both things were apparently incorrectly merged into the
minimalistic i440fx model.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  8:44                   ` Avi Kivity
@ 2011-05-19 13:59                     ` Anthony Liguori
  0 siblings, 0 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 13:59 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Peter Maydell, Jan Kiszka, qemu-devel, Gleb Natapov

On 05/19/2011 03:44 AM, Avi Kivity wrote:
> On 05/19/2011 11:30 AM, Jan Kiszka wrote:
>> >>
>> >> That's not that simple. We need to tell apart:
>> >> - if a cpu issued the request, and which one => forward to APIC
>> > And cpu mode may affect where access is forwarded to. If cpu is in SMM
>> > mode access to frame buffer may be forwarded to a memory (depends on
>> > chipset configuration).
>>
>> So we have a second use case for CPU-local I/O regions?
>>
>> I wonder if only a single CPU can enter SMM or if all have to. Right now
>> only the first CPU can switch to that mode, and that affects the
>> behaviour of the chipset /wrt SMRAM mapping. Is that another hack?
>
> It's a hack. SMM is a per-cpu setting. Effectively it's another address
> pin - it changes the meaning of (potentially) all addresses.

Hrm, this may be splitting hairs, but the chipset enables SMM globally. 
  The processor can enable it by raising a pin but there is only a 
single pin.

It may raise/lower the pin on every load/store, but from the chipset's 
perspective, it's a global setting (at least in the i440fx).

Regards,

Anthony Liguori

>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:57                     ` Jan Kiszka
@ 2011-05-19 14:04                       ` Anthony Liguori
  2011-05-19 14:06                         ` Jan Kiszka
  2011-05-19 14:11                         ` Avi Kivity
  0 siblings, 2 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 14:04 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Peter Maydell, Avi Kivity, Gleb Natapov, qemu-devel

On 05/19/2011 08:57 AM, Jan Kiszka wrote:
> On 2011-05-19 15:52, Anthony Liguori wrote:
>> On 05/19/2011 03:30 AM, Jan Kiszka wrote:
>>> On 2011-05-19 10:26, Gleb Natapov wrote:
>>>> On Wed, May 18, 2011 at 09:27:55PM +0200, Jan Kiszka wrote:
>>>>>> if an I/O is to the APIC page,
>>>>>>      it's handled by the APIC
>>>>>
>>>>> That's not that simple. We need to tell apart:
>>>>>    - if a cpu issued the request, and which one =>   forward to APIC
>>>> And cpu mode may affect where access is forwarded to. If cpu is in SMM
>>>> mode access to frame buffer may be forwarded to a memory (depends on
>>>> chipset configuration).
>>>
>>> So we have a second use case for CPU-local I/O regions?
>>>
>>> I wonder if only a single CPU can enter SMM or if all have to.
>>
>> For the i440fx, it's a chipset register (not a per-CPU register).
>
> There are two sources: the chipset register and the mode of the first
> CPU. Both things were apparently incorrectly merged into the
> minimalistic i440fx model.

Right, the chipset register is mainly used to program the contents of SMM.

There is a single access pin that has effectively the same semantics as 
setting the chipset register.

It's not a per-CPU setting--that's the point.  You can't have one CPU 
reading SMM memory at the exactly same time as accessing VGA.

But I guess you can never have two simultaneous accesses anyway so 
perhaps it's splitting hairs :-)

Regards,

Anthony Liguori

> Jan
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 14:04                       ` Anthony Liguori
@ 2011-05-19 14:06                         ` Jan Kiszka
  2011-05-19 14:11                         ` Avi Kivity
  1 sibling, 0 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 14:06 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Peter Maydell, Avi Kivity, Gleb Natapov, qemu-devel

On 2011-05-19 16:04, Anthony Liguori wrote:
> On 05/19/2011 08:57 AM, Jan Kiszka wrote:
>> On 2011-05-19 15:52, Anthony Liguori wrote:
>>> On 05/19/2011 03:30 AM, Jan Kiszka wrote:
>>>> On 2011-05-19 10:26, Gleb Natapov wrote:
>>>>> On Wed, May 18, 2011 at 09:27:55PM +0200, Jan Kiszka wrote:
>>>>>>> if an I/O is to the APIC page,
>>>>>>>      it's handled by the APIC
>>>>>>
>>>>>> That's not that simple. We need to tell apart:
>>>>>>    - if a cpu issued the request, and which one =>   forward to APIC
>>>>> And cpu mode may affect where access is forwarded to. If cpu is in SMM
>>>>> mode access to frame buffer may be forwarded to a memory (depends on
>>>>> chipset configuration).
>>>>
>>>> So we have a second use case for CPU-local I/O regions?
>>>>
>>>> I wonder if only a single CPU can enter SMM or if all have to.
>>>
>>> For the i440fx, it's a chipset register (not a per-CPU register).
>>
>> There are two sources: the chipset register and the mode of the first
>> CPU. Both things were apparently incorrectly merged into the
>> minimalistic i440fx model.
> 
> Right, the chipset register is mainly used to program the contents of SMM.
> 
> There is a single access pin that has effectively the same semantics as 
> setting the chipset register.
> 
> It's not a per-CPU setting--that's the point.  You can't have one CPU 
> reading SMM memory at the exactly same time as accessing VGA.
> 
> But I guess you can never have two simultaneous accesses anyway so 
> perhaps it's splitting hairs :-)

Not sure. If one CPU enters SMM, it must be able to read SMRAM content,
independently of the second CPU - unless there is some magic to stop
that CPU in the meantime.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 14:04                       ` Anthony Liguori
  2011-05-19 14:06                         ` Jan Kiszka
@ 2011-05-19 14:11                         ` Avi Kivity
  2011-05-19 18:18                           ` Anthony Liguori
  1 sibling, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 14:11 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov, Peter Maydell

On 05/19/2011 05:04 PM, Anthony Liguori wrote:
>
> Right, the chipset register is mainly used to program the contents of 
> SMM.
>
> There is a single access pin that has effectively the same semantics 
> as setting the chipset register.
>
> It's not a per-CPU setting--that's the point.  You can't have one CPU 
> reading SMM memory at the exactly same time as accessing VGA.
>
> But I guess you can never have two simultaneous accesses anyway so 
> perhaps it's splitting hairs :-)

Exactly - it just works.

btw, a way to implement it would be to have two memory maps, one for SMM 
and one for non-SMM, and select between them based on the CPU mode.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:53                                           ` Avi Kivity
@ 2011-05-19 14:15                                             ` Anthony Liguori
  2011-05-19 14:20                                               ` Jan Kiszka
  2011-05-19 14:21                                               ` Avi Kivity
  0 siblings, 2 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 14:15 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov

On 05/19/2011 08:53 AM, Avi Kivity wrote:
> On 05/19/2011 04:49 PM, Anthony Liguori wrote:
>> On 05/19/2011 08:30 AM, Avi Kivity wrote:
>>> On 05/19/2011 04:26 PM, Jan Kiszka wrote:
>>>> On 2011-05-19 15:07, Avi Kivity wrote:
>>
>>>> And when introducing hierarchical registration, we will have to go
>>>> through all of this once again. Plus the API may have to be changed
>>>> again if it does not fulfill all requirements of the hierarchical
>>>> region
>>>> management. And we have no proof that it allows an efficient core
>>>> implementation.
>>>
>>> This API *is* hierarchical registration. v2 will (hopefully) prove that
>>> it can be done efficiently.
>>
>> We also need hierarchical dispatch. Priorities are just a weak attempt
>> to emulate hierarchical dispatch but I don't think there's an
>> improvement over a single dispatch table.
>>
>> Hierarchical dispatch is simpler. You just need a simple list at each
>> bus.
>>
>
> The API itself says nothing about whether the hierarchy is evaluated at
> run-time or registration time.

Except for priorities.

If you've got a hierarchy like:

- CPU:0
  - i440fx:1
    - PIIX3:2
      - ISA:3
        - DeviceA
    - PCI:2
      - DeviceB

In your model, the default priorities are as shown, but nothing stops 
DeviceB from registering with a priority of 0 which means it can 
intercept accesses that would normally go to the i440fx.

This is impossible in a hierarchical dispatch model.  There is no 
setting that a PCI device can use to trap accesses that the i440fx would 
normally take.

I don't mind if we don't have hierarchical dispatch to start with, but 
priorities are fundamentally broken.

> We could easily have the implementation
> walk the memory hierarchy to dispatch an mmio.
>
> However, RAM cannot be dispatched this way (we need to resolve which
> ranges are RAM when the regions are registered, not accessed) so a data
> structure that contains all of the information is mandatory.

There is only one device that is capable of affecting the view of 
RAM--the i440fx PMC.  The reason is simple, the i440fx is the thing that 
sends a request from a CPU either to a DIMM or to some device.  It 
doesn't know which device it goes to and it doesn't care.

That's where the RAM mapping lives.  It doesn't matter how the PCI I/O 
window is split up.  You don't need that information to know where RAM is.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 14:15                                             ` Anthony Liguori
@ 2011-05-19 14:20                                               ` Jan Kiszka
  2011-05-19 14:25                                                 ` Anthony Liguori
  2011-05-19 14:21                                               ` Avi Kivity
  1 sibling, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 14:20 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, Gleb Natapov, qemu-devel

On 2011-05-19 16:15, Anthony Liguori wrote:
> On 05/19/2011 08:53 AM, Avi Kivity wrote:
>> On 05/19/2011 04:49 PM, Anthony Liguori wrote:
>>> On 05/19/2011 08:30 AM, Avi Kivity wrote:
>>>> On 05/19/2011 04:26 PM, Jan Kiszka wrote:
>>>>> On 2011-05-19 15:07, Avi Kivity wrote:
>>>
>>>>> And when introducing hierarchical registration, we will have to go
>>>>> through all of this once again. Plus the API may have to be changed
>>>>> again if it does not fulfill all requirements of the hierarchical
>>>>> region
>>>>> management. And we have no proof that it allows an efficient core
>>>>> implementation.
>>>>
>>>> This API *is* hierarchical registration. v2 will (hopefully) prove that
>>>> it can be done efficiently.
>>>
>>> We also need hierarchical dispatch. Priorities are just a weak attempt
>>> to emulate hierarchical dispatch but I don't think there's an
>>> improvement over a single dispatch table.
>>>
>>> Hierarchical dispatch is simpler. You just need a simple list at each
>>> bus.
>>>
>>
>> The API itself says nothing about whether the hierarchy is evaluated at
>> run-time or registration time.
> 
> Except for priorities.
> 
> If you've got a hierarchy like:
> 
> - CPU:0
>   - i440fx:1
>     - PIIX3:2
>       - ISA:3
>         - DeviceA
>     - PCI:2
>       - DeviceB
> 
> In your model, the default priorities are as shown, but nothing stops 
> DeviceB from registering with a priority of 0 which means it can 
> intercept accesses that would normally go to the i440fx.

Priorities would be local, so the normal tree would look like this:

 - CPU:0
   - i440fx:0
     - PIIX3:0
       - DeviceA
     - PCI-DeviceB:0

If the i440fx would like to map something different over DeviceA (or
parts of it), it would create a region of prio 1 or higher.

The same would happen at CPU-level with SMRAM when SMM is entered.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 14:15                                             ` Anthony Liguori
  2011-05-19 14:20                                               ` Jan Kiszka
@ 2011-05-19 14:21                                               ` Avi Kivity
  1 sibling, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 14:21 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov

On 05/19/2011 05:15 PM, Anthony Liguori wrote:
>
> Except for priorities.
>
> If you've got a hierarchy like:
>
> - CPU:0
>  - i440fx:1
>    - PIIX3:2
>      - ISA:3
>        - DeviceA
>    - PCI:2
>      - DeviceB
>
> In your model, the default priorities are as shown, but nothing stops 
> DeviceB from registering with a priority of 0 which means it can 
> intercept accesses that would normally go to the i440fx.

Priorities are only within the parent region.  DeviceB's priorities have 
no effect whatsoever since it is an only child.

(and btw the intent is to have higher priorities hide lower priorities).

The model is that of a nested window system with transparent windows.

>
> This is impossible in a hierarchical dispatch model.  There is no 
> setting that a PCI device can use to trap accesses that the i440fx 
> would normally take.
>
> I don't mind if we don't have hierarchical dispatch to start with, but 
> priorities are fundamentally broken.

Are not.

>
>> We could easily have the implementation
>> walk the memory hierarchy to dispatch an mmio.
>>
>> However, RAM cannot be dispatched this way (we need to resolve which
>> ranges are RAM when the regions are registered, not accessed) so a data
>> structure that contains all of the information is mandatory.
>
> There is only one device that is capable of affecting the view of 
> RAM--the i440fx PMC.  The reason is simple, the i440fx is the thing 
> that sends a request from a CPU either to a DIMM or to some device.  
> It doesn't know which device it goes to and it doesn't care.
>
> That's where the RAM mapping lives.  It doesn't matter how the PCI I/O 
> window is split up.  You don't need that information to know where RAM 
> is.

There is also RAM in the framebuffer and the vga windows.  It moves around.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 14:20                                               ` Jan Kiszka
@ 2011-05-19 14:25                                                 ` Anthony Liguori
  2011-05-19 14:28                                                   ` Jan Kiszka
  0 siblings, 1 reply; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 14:25 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, Gleb Natapov, qemu-devel

On 05/19/2011 09:20 AM, Jan Kiszka wrote:
> On 2011-05-19 16:15, Anthony Liguori wrote:
>> On 05/19/2011 08:53 AM, Avi Kivity wrote:
>>> On 05/19/2011 04:49 PM, Anthony Liguori wrote:
>>>> On 05/19/2011 08:30 AM, Avi Kivity wrote:
>>>>> On 05/19/2011 04:26 PM, Jan Kiszka wrote:
>>>>>> On 2011-05-19 15:07, Avi Kivity wrote:
>>>>
>>>>>> And when introducing hierarchical registration, we will have to go
>>>>>> through all of this once again. Plus the API may have to be changed
>>>>>> again if it does not fulfill all requirements of the hierarchical
>>>>>> region
>>>>>> management. And we have no proof that it allows an efficient core
>>>>>> implementation.
>>>>>
>>>>> This API *is* hierarchical registration. v2 will (hopefully) prove that
>>>>> it can be done efficiently.
>>>>
>>>> We also need hierarchical dispatch. Priorities are just a weak attempt
>>>> to emulate hierarchical dispatch but I don't think there's an
>>>> improvement over a single dispatch table.
>>>>
>>>> Hierarchical dispatch is simpler. You just need a simple list at each
>>>> bus.
>>>>
>>>
>>> The API itself says nothing about whether the hierarchy is evaluated at
>>> run-time or registration time.
>>
>> Except for priorities.
>>
>> If you've got a hierarchy like:
>>
>> - CPU:0
>>    - i440fx:1
>>      - PIIX3:2
>>        - ISA:3
>>          - DeviceA
>>      - PCI:2
>>        - DeviceB
>>
>> In your model, the default priorities are as shown, but nothing stops
>> DeviceB from registering with a priority of 0 which means it can
>> intercept accesses that would normally go to the i440fx.
>
> Priorities would be local, so the normal tree would look like this:
>
>   - CPU:0
>     - i440fx:0
>       - PIIX3:0
>         - DeviceA
>       - PCI-DeviceB:0
>
> If the i440fx would like to map something different over DeviceA (or
> parts of it), it would create a region of prio 1 or higher.

If it's local, then you need a local dispatch table, no?

Regards,

Anthony Liguori

> The same would happen at CPU-level with SMRAM when SMM is entered.
>
> Jan
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 14:25                                                 ` Anthony Liguori
@ 2011-05-19 14:28                                                   ` Jan Kiszka
  2011-05-19 14:31                                                     ` Avi Kivity
  2011-05-19 14:37                                                     ` Anthony Liguori
  0 siblings, 2 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 14:28 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, Gleb Natapov, qemu-devel

On 2011-05-19 16:25, Anthony Liguori wrote:
> On 05/19/2011 09:20 AM, Jan Kiszka wrote:
>> On 2011-05-19 16:15, Anthony Liguori wrote:
>>> On 05/19/2011 08:53 AM, Avi Kivity wrote:
>>>> On 05/19/2011 04:49 PM, Anthony Liguori wrote:
>>>>> On 05/19/2011 08:30 AM, Avi Kivity wrote:
>>>>>> On 05/19/2011 04:26 PM, Jan Kiszka wrote:
>>>>>>> On 2011-05-19 15:07, Avi Kivity wrote:
>>>>>
>>>>>>> And when introducing hierarchical registration, we will have to go
>>>>>>> through all of this once again. Plus the API may have to be changed
>>>>>>> again if it does not fulfill all requirements of the hierarchical
>>>>>>> region
>>>>>>> management. And we have no proof that it allows an efficient core
>>>>>>> implementation.
>>>>>>
>>>>>> This API *is* hierarchical registration. v2 will (hopefully) prove that
>>>>>> it can be done efficiently.
>>>>>
>>>>> We also need hierarchical dispatch. Priorities are just a weak attempt
>>>>> to emulate hierarchical dispatch but I don't think there's an
>>>>> improvement over a single dispatch table.
>>>>>
>>>>> Hierarchical dispatch is simpler. You just need a simple list at each
>>>>> bus.
>>>>>
>>>>
>>>> The API itself says nothing about whether the hierarchy is evaluated at
>>>> run-time or registration time.
>>>
>>> Except for priorities.
>>>
>>> If you've got a hierarchy like:
>>>
>>> - CPU:0
>>>    - i440fx:1
>>>      - PIIX3:2
>>>        - ISA:3
>>>          - DeviceA
>>>      - PCI:2
>>>        - DeviceB
>>>
>>> In your model, the default priorities are as shown, but nothing stops
>>> DeviceB from registering with a priority of 0 which means it can
>>> intercept accesses that would normally go to the i440fx.
>>
>> Priorities would be local, so the normal tree would look like this:
>>
>>   - CPU:0
>>     - i440fx:0
>>       - PIIX3:0
>>         - DeviceA
>>       - PCI-DeviceB:0
>>
>> If the i440fx would like to map something different over DeviceA (or
>> parts of it), it would create a region of prio 1 or higher.
> 
> If it's local, then you need a local dispatch table, no?

Not working for the coalescing reason pointed out before. It's also more
handy to rely on the core to do the proper dispatching then write your
own logic over and over again. The core has to deal with overlapping anyway.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 14:28                                                   ` Jan Kiszka
@ 2011-05-19 14:31                                                     ` Avi Kivity
  2011-05-19 14:37                                                     ` Anthony Liguori
  1 sibling, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 14:31 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Gleb Natapov, qemu-devel

On 05/19/2011 05:28 PM, Jan Kiszka wrote:
> >>
> >>  Priorities would be local, so the normal tree would look like this:
> >>
> >>    - CPU:0
> >>      - i440fx:0
> >>        - PIIX3:0
> >>          - DeviceA
> >>        - PCI-DeviceB:0
> >>
> >>  If the i440fx would like to map something different over DeviceA (or
> >>  parts of it), it would create a region of prio 1 or higher.
> >
> >  If it's local, then you need a local dispatch table, no?
>
> Not working for the coalescing reason pointed out before.

+ RAM (including framebuffers)

> It's also more
> handy to rely on the core to do the proper dispatching then write your
> own logic over and over again. The core has to deal with overlapping anyway.

Plus, to reiterate, if you have the information you can calculate a 
flattened global dispatch table taking into account all offsets and 
priorities.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 14:28                                                   ` Jan Kiszka
  2011-05-19 14:31                                                     ` Avi Kivity
@ 2011-05-19 14:37                                                     ` Anthony Liguori
  2011-05-19 14:40                                                       ` Avi Kivity
  1 sibling, 1 reply; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 14:37 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, Gleb Natapov, qemu-devel

On 05/19/2011 09:28 AM, Jan Kiszka wrote:
> On 2011-05-19 16:25, Anthony Liguori wrote:
>> On 05/19/2011 09:20 AM, Jan Kiszka wrote:
>>> On 2011-05-19 16:15, Anthony Liguori wrote:
>>> Priorities would be local, so the normal tree would look like this:
>>>
>>>    - CPU:0
>>>      - i440fx:0
>>>        - PIIX3:0
>>>          - DeviceA
>>>        - PCI-DeviceB:0
>>>
>>> If the i440fx would like to map something different over DeviceA (or
>>> parts of it), it would create a region of prio 1 or higher.
>>
>> If it's local, then you need a local dispatch table, no?
>
> Not working for the coalescing reason pointed out before. It's also more
> handy to rely on the core to do the proper dispatching then write your
> own logic over and over again. The core has to deal with overlapping anyway.

So....  do you do:

isa_register_region(ISABus *bus, MemoryRegion *mr, int priority)
{
     chipset_register_region(bus->chipset, mr, priority + 1);
}

I don't really understand how you can fold everything into one table and 
not allow devices to override their parents using priorities.

Regards,

Anthony Liguori

>
> Jan
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 14:37                                                     ` Anthony Liguori
@ 2011-05-19 14:40                                                       ` Avi Kivity
  2011-05-19 16:17                                                         ` Gleb Natapov
                                                                           ` (2 more replies)
  0 siblings, 3 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-19 14:40 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov

On 05/19/2011 05:37 PM, Anthony Liguori wrote:
>
> So....  do you do:
>
> isa_register_region(ISABus *bus, MemoryRegion *mr, int priority)
> {
>     chipset_register_region(bus->chipset, mr, priority + 1);
> }
>
> I don't really understand how you can fold everything into one table 
> and not allow devices to override their parents using priorities.

Think of how a window manager folds windows with priorities onto a flat 
framebuffer.

You do a depth-first walk of the tree.  For each child list, you iterate 
it from the lowest to highest priority, allowing later subregions 
override earlier subregions.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  9:24               ` Edgar E. Iglesias
@ 2011-05-19 14:49                 ` Peter Maydell
  0 siblings, 0 replies; 187+ messages in thread
From: Peter Maydell @ 2011-05-19 14:49 UTC (permalink / raw)
  To: Edgar E. Iglesias; +Cc: Jan Kiszka, Avi Kivity, Gleb Natapov, qemu-devel

On 19 May 2011 10:24, Edgar E. Iglesias <edgar.iglesias@gmail.com> wrote:
> On the CPU local aspect, I think it is increasingly common in the
> embedded space to see local busses with CPU local peripherals in
> addition to the "system" bus with "global" peripherals.

Yes: newer ARM cores have per-CPU builtin peripherals (timers
and the like). The interrupt controller also typically has some
registers which are per-CPU and some which are global. At the
moment these are rather weirdly placed outside the CPU and it's
the obligation of the board model to instantiate them.

(Cache modelling would also require a bit more care about the
distinction between which core made a memory request and where
per-core peripherals live in the cache hierarchy.)

The other sort-of-per-core thing is that for TrustZone (which
we don't currently support but might want to) if a core is in
'secure' mode it can potentially see a completely different memory
map. [In hardware this works by the secure/nonsecure bit being
passed around with memory transactions and devices or fabric
behaving differently depending on its value.]

It would also be nice if the APIs supported more heterogenous setups
(for instance a VersatileExpress model where we model a quad
core A9 main CPU and also the M3 microcontroller that does
system control and has a completely different view of the world.)

> Another thing that was discussed was the ability for devices to know
> who is accessing them, I think this is uncommon but still it does
> exist.

Yes, again the ARM GIC is an example. At the momemnt this is
implemented by the GIC implementation looking at
cpu_single_env->cpu_index.

Debug setups also sometimes have magic peripherals that behave
differently depending on who is accessing them.

-- PMM

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 14:40                                                       ` Avi Kivity
@ 2011-05-19 16:17                                                         ` Gleb Natapov
  2011-05-19 16:25                                                           ` Jan Kiszka
  2011-05-19 16:27                                                         ` Gleb Natapov
  2011-05-19 16:32                                                         ` Anthony Liguori
  2 siblings, 1 reply; 187+ messages in thread
From: Gleb Natapov @ 2011-05-19 16:17 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel

On Thu, May 19, 2011 at 05:40:50PM +0300, Avi Kivity wrote:
> On 05/19/2011 05:37 PM, Anthony Liguori wrote:
> >
> >So....  do you do:
> >
> >isa_register_region(ISABus *bus, MemoryRegion *mr, int priority)
> >{
> >    chipset_register_region(bus->chipset, mr, priority + 1);
> >}
> >
> >I don't really understand how you can fold everything into one
> >table and not allow devices to override their parents using
> >priorities.
> 
> Think of how a window manager folds windows with priorities onto a
> flat framebuffer.
> 
> You do a depth-first walk of the tree.  For each child list, you
> iterate it from the lowest to highest priority, allowing later
> subregions override earlier subregions.
> 
And how you set those priorities in a sensible way? Why two device on a
PCI bus will want to register their memory region with anything but
highest priority? And if you let PCI subsystem to assign priorities how
it will coordinate with ISA subsystem/memory controller what priorities
to assign to get meaningful system?

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 16:17                                                         ` Gleb Natapov
@ 2011-05-19 16:25                                                           ` Jan Kiszka
  2011-05-19 16:28                                                             ` Gleb Natapov
  0 siblings, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 16:25 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, qemu-devel

On 2011-05-19 18:17, Gleb Natapov wrote:
> On Thu, May 19, 2011 at 05:40:50PM +0300, Avi Kivity wrote:
>> On 05/19/2011 05:37 PM, Anthony Liguori wrote:
>>>
>>> So....  do you do:
>>>
>>> isa_register_region(ISABus *bus, MemoryRegion *mr, int priority)
>>> {
>>>    chipset_register_region(bus->chipset, mr, priority + 1);
>>> }
>>>
>>> I don't really understand how you can fold everything into one
>>> table and not allow devices to override their parents using
>>> priorities.
>>
>> Think of how a window manager folds windows with priorities onto a
>> flat framebuffer.
>>
>> You do a depth-first walk of the tree.  For each child list, you
>> iterate it from the lowest to highest priority, allowing later
>> subregions override earlier subregions.
>>
> And how you set those priorities in a sensible way? Why two device on a
> PCI bus will want to register their memory region with anything but
> highest priority? And if you let PCI subsystem to assign priorities how
> it will coordinate with ISA subsystem/memory controller what priorities
> to assign to get meaningful system?

Priorities >default will only be used for explicit overlays, e.g. RAM
over MMIO in PAM regions. Non-default priorities won't be assigned to
normal PCI bars or any other device's region.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 14:40                                                       ` Avi Kivity
  2011-05-19 16:17                                                         ` Gleb Natapov
@ 2011-05-19 16:27                                                         ` Gleb Natapov
  2011-05-20  8:59                                                           ` Avi Kivity
  2011-05-19 16:32                                                         ` Anthony Liguori
  2 siblings, 1 reply; 187+ messages in thread
From: Gleb Natapov @ 2011-05-19 16:27 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel

On Thu, May 19, 2011 at 05:40:50PM +0300, Avi Kivity wrote:
> On 05/19/2011 05:37 PM, Anthony Liguori wrote:
> >
> >So....  do you do:
> >
> >isa_register_region(ISABus *bus, MemoryRegion *mr, int priority)
> >{
> >    chipset_register_region(bus->chipset, mr, priority + 1);
> >}
> >
> >I don't really understand how you can fold everything into one
> >table and not allow devices to override their parents using
> >priorities.
> 
> Think of how a window manager folds windows with priorities onto a
> flat framebuffer.
> 
> You do a depth-first walk of the tree.  For each child list, you
> iterate it from the lowest to highest priority, allowing later
> subregions override earlier subregions.
> 
I do not think that window manager is a good analogy. Window can
overlap with only its siblings. In our memory tree each final node may
overlap with any other node in the tree.
 
--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 16:25                                                           ` Jan Kiszka
@ 2011-05-19 16:28                                                             ` Gleb Natapov
  2011-05-19 16:30                                                               ` Jan Kiszka
  0 siblings, 1 reply; 187+ messages in thread
From: Gleb Natapov @ 2011-05-19 16:28 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, qemu-devel

On Thu, May 19, 2011 at 06:25:14PM +0200, Jan Kiszka wrote:
> On 2011-05-19 18:17, Gleb Natapov wrote:
> > On Thu, May 19, 2011 at 05:40:50PM +0300, Avi Kivity wrote:
> >> On 05/19/2011 05:37 PM, Anthony Liguori wrote:
> >>>
> >>> So....  do you do:
> >>>
> >>> isa_register_region(ISABus *bus, MemoryRegion *mr, int priority)
> >>> {
> >>>    chipset_register_region(bus->chipset, mr, priority + 1);
> >>> }
> >>>
> >>> I don't really understand how you can fold everything into one
> >>> table and not allow devices to override their parents using
> >>> priorities.
> >>
> >> Think of how a window manager folds windows with priorities onto a
> >> flat framebuffer.
> >>
> >> You do a depth-first walk of the tree.  For each child list, you
> >> iterate it from the lowest to highest priority, allowing later
> >> subregions override earlier subregions.
> >>
> > And how you set those priorities in a sensible way? Why two device on a
> > PCI bus will want to register their memory region with anything but
> > highest priority? And if you let PCI subsystem to assign priorities how
> > it will coordinate with ISA subsystem/memory controller what priorities
> > to assign to get meaningful system?
> 
> Priorities >default will only be used for explicit overlays, e.g. RAM
> over MMIO in PAM regions. Non-default priorities won't be assigned to
> normal PCI bars or any other device's region.
> 
That does not explain who and how assign those priorities in globally
meaningful way.

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 16:28                                                             ` Gleb Natapov
@ 2011-05-19 16:30                                                               ` Jan Kiszka
  2011-05-19 16:36                                                                 ` Anthony Liguori
  2011-05-19 16:43                                                                 ` Gleb Natapov
  0 siblings, 2 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 16:30 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, qemu-devel

On 2011-05-19 18:28, Gleb Natapov wrote:
> On Thu, May 19, 2011 at 06:25:14PM +0200, Jan Kiszka wrote:
>> On 2011-05-19 18:17, Gleb Natapov wrote:
>>> On Thu, May 19, 2011 at 05:40:50PM +0300, Avi Kivity wrote:
>>>> On 05/19/2011 05:37 PM, Anthony Liguori wrote:
>>>>>
>>>>> So....  do you do:
>>>>>
>>>>> isa_register_region(ISABus *bus, MemoryRegion *mr, int priority)
>>>>> {
>>>>>    chipset_register_region(bus->chipset, mr, priority + 1);
>>>>> }
>>>>>
>>>>> I don't really understand how you can fold everything into one
>>>>> table and not allow devices to override their parents using
>>>>> priorities.
>>>>
>>>> Think of how a window manager folds windows with priorities onto a
>>>> flat framebuffer.
>>>>
>>>> You do a depth-first walk of the tree.  For each child list, you
>>>> iterate it from the lowest to highest priority, allowing later
>>>> subregions override earlier subregions.
>>>>
>>> And how you set those priorities in a sensible way? Why two device on a
>>> PCI bus will want to register their memory region with anything but
>>> highest priority? And if you let PCI subsystem to assign priorities how
>>> it will coordinate with ISA subsystem/memory controller what priorities
>>> to assign to get meaningful system?
>>
>> Priorities >default will only be used for explicit overlays, e.g. RAM
>> over MMIO in PAM regions. Non-default priorities won't be assigned to
>> normal PCI bars or any other device's region.
>>
> That does not explain who and how assign those priorities in globally
> meaningful way.

There are no global priorities. Priorities are only used inside each
level of the memory region hierarchy to generate a resulting, flattened
view for the next higher level. At that level, everything imported from
below has the default prio again, ie. the lowest one.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 14:40                                                       ` Avi Kivity
  2011-05-19 16:17                                                         ` Gleb Natapov
  2011-05-19 16:27                                                         ` Gleb Natapov
@ 2011-05-19 16:32                                                         ` Anthony Liguori
  2011-05-19 16:35                                                           ` Jan Kiszka
  2011-05-20  9:01                                                           ` Avi Kivity
  2 siblings, 2 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 16:32 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov

On 05/19/2011 09:40 AM, Avi Kivity wrote:
> On 05/19/2011 05:37 PM, Anthony Liguori wrote:
>>
>> So.... do you do:
>>
>> isa_register_region(ISABus *bus, MemoryRegion *mr, int priority)
>> {
>> chipset_register_region(bus->chipset, mr, priority + 1);
>> }
>>
>> I don't really understand how you can fold everything into one table
>> and not allow devices to override their parents using priorities.
>
> Think of how a window manager folds windows with priorities onto a flat
> framebuffer.
>
> You do a depth-first walk of the tree. For each child list, you iterate
> it from the lowest to highest priority, allowing later subregions
> override earlier subregions.
>

Okay, but this doesn't explain how you'll let RAM override the VGA 
mapping since RAM is not represented in the same child list as VGA (RAM 
is a child of the PMC whereas VGA is a child of ISA/PCI, both of which 
are at least one level removed from the PMC).

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 16:32                                                         ` Anthony Liguori
@ 2011-05-19 16:35                                                           ` Jan Kiszka
  2011-05-19 16:38                                                             ` Anthony Liguori
  2011-05-20  9:01                                                           ` Avi Kivity
  1 sibling, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 16:35 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, Gleb Natapov, qemu-devel

On 2011-05-19 18:32, Anthony Liguori wrote:
> On 05/19/2011 09:40 AM, Avi Kivity wrote:
>> On 05/19/2011 05:37 PM, Anthony Liguori wrote:
>>>
>>> So.... do you do:
>>>
>>> isa_register_region(ISABus *bus, MemoryRegion *mr, int priority)
>>> {
>>> chipset_register_region(bus->chipset, mr, priority + 1);
>>> }
>>>
>>> I don't really understand how you can fold everything into one table
>>> and not allow devices to override their parents using priorities.
>>
>> Think of how a window manager folds windows with priorities onto a flat
>> framebuffer.
>>
>> You do a depth-first walk of the tree. For each child list, you iterate
>> it from the lowest to highest priority, allowing later subregions
>> override earlier subregions.
>>
> 
> Okay, but this doesn't explain how you'll let RAM override the VGA 
> mapping since RAM is not represented in the same child list as VGA (RAM 
> is a child of the PMC whereas VGA is a child of ISA/PCI, both of which 
> are at least one level removed from the PMC).

You can always create a new memory region with higher priority, pointing
to the RAM window you want to have above VGA. That's what we do today as
well, just with different effects on the internal representation.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 16:30                                                               ` Jan Kiszka
@ 2011-05-19 16:36                                                                 ` Anthony Liguori
  2011-05-19 16:49                                                                   ` Jan Kiszka
  2011-05-20  8:56                                                                   ` Avi Kivity
  2011-05-19 16:43                                                                 ` Gleb Natapov
  1 sibling, 2 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 16:36 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, Gleb Natapov, qemu-devel

On 05/19/2011 11:30 AM, Jan Kiszka wrote:
> On 2011-05-19 18:28, Gleb Natapov wrote:
>> On Thu, May 19, 2011 at 06:25:14PM +0200, Jan Kiszka wrote:
>>> On 2011-05-19 18:17, Gleb Natapov wrote:
>>>> On Thu, May 19, 2011 at 05:40:50PM +0300, Avi Kivity wrote:
>>>>> On 05/19/2011 05:37 PM, Anthony Liguori wrote:
>>>>>>
>>>>>> So....  do you do:
>>>>>>
>>>>>> isa_register_region(ISABus *bus, MemoryRegion *mr, int priority)
>>>>>> {
>>>>>>     chipset_register_region(bus->chipset, mr, priority + 1);
>>>>>> }
>>>>>>
>>>>>> I don't really understand how you can fold everything into one
>>>>>> table and not allow devices to override their parents using
>>>>>> priorities.
>>>>>
>>>>> Think of how a window manager folds windows with priorities onto a
>>>>> flat framebuffer.
>>>>>
>>>>> You do a depth-first walk of the tree.  For each child list, you
>>>>> iterate it from the lowest to highest priority, allowing later
>>>>> subregions override earlier subregions.
>>>>>
>>>> And how you set those priorities in a sensible way? Why two device on a
>>>> PCI bus will want to register their memory region with anything but
>>>> highest priority? And if you let PCI subsystem to assign priorities how
>>>> it will coordinate with ISA subsystem/memory controller what priorities
>>>> to assign to get meaningful system?
>>>
>>> Priorities>default will only be used for explicit overlays, e.g. RAM
>>> over MMIO in PAM regions. Non-default priorities won't be assigned to
>>> normal PCI bars or any other device's region.
>>>
>> That does not explain who and how assign those priorities in globally
>> meaningful way.
>
> There are no global priorities. Priorities are only used inside each
> level of the memory region hierarchy to generate a resulting, flattened
> view for the next higher level. At that level, everything imported from
> below has the default prio again, ie. the lowest one.

Then SMM is impossible.

Why do we need priorities at all?  There should be no overlap at each 
level in the hierarchy.

If you have overlapping BARs, the PCI bus will always send the request 
to a single device based on something that's implementation specific. 
This works because each PCI device advertises the BAR locations and 
sizes in it's config space.

To dispatch a request, the PCI bus will walk the config space to find a 
match.  If you remove something that was previously causing an overlap, 
it'll the other device will now get the I/O requests.

To model this correctly, you need to let the PCI bus decide how to 
dispatch I/O requests (again, you need hierarchical dispatch).

In the absence of this, the PCI bus needs to look at all of the devices, 
figure out the flat mapping, and register it.  When a device is added or 
removed, it needs to recalculate the flat mapping and register it.

There is no need to have centralized logic to decide this.

Regards,

Anthony Liguori

>
> Jan
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 16:35                                                           ` Jan Kiszka
@ 2011-05-19 16:38                                                             ` Anthony Liguori
  2011-05-19 16:50                                                               ` Jan Kiszka
  2011-05-20  9:03                                                               ` Avi Kivity
  0 siblings, 2 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 16:38 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, Gleb Natapov, qemu-devel

On 05/19/2011 11:35 AM, Jan Kiszka wrote:
> On 2011-05-19 18:32, Anthony Liguori wrote:
>> On 05/19/2011 09:40 AM, Avi Kivity wrote:
>>> On 05/19/2011 05:37 PM, Anthony Liguori wrote:
>>>>
>>>> So.... do you do:
>>>>
>>>> isa_register_region(ISABus *bus, MemoryRegion *mr, int priority)
>>>> {
>>>> chipset_register_region(bus->chipset, mr, priority + 1);
>>>> }
>>>>
>>>> I don't really understand how you can fold everything into one table
>>>> and not allow devices to override their parents using priorities.
>>>
>>> Think of how a window manager folds windows with priorities onto a flat
>>> framebuffer.
>>>
>>> You do a depth-first walk of the tree. For each child list, you iterate
>>> it from the lowest to highest priority, allowing later subregions
>>> override earlier subregions.
>>>
>>
>> Okay, but this doesn't explain how you'll let RAM override the VGA
>> mapping since RAM is not represented in the same child list as VGA (RAM
>> is a child of the PMC whereas VGA is a child of ISA/PCI, both of which
>> are at least one level removed from the PMC).
>
> You can always create a new memory region with higher priority, pointing
> to the RAM window you want to have above VGA. That's what we do today as
> well, just with different effects on the internal representation.

But then we're no better than we are today.  I thought the whole point 
of this thread of discussion was to allow overlapping I/O regions to be 
handled in a better way than we do today?

Regards,

Anthony Liguori

>
> Jan
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 16:30                                                               ` Jan Kiszka
  2011-05-19 16:36                                                                 ` Anthony Liguori
@ 2011-05-19 16:43                                                                 ` Gleb Natapov
  2011-05-19 16:51                                                                   ` Jan Kiszka
  1 sibling, 1 reply; 187+ messages in thread
From: Gleb Natapov @ 2011-05-19 16:43 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, qemu-devel

On Thu, May 19, 2011 at 06:30:42PM +0200, Jan Kiszka wrote:
> On 2011-05-19 18:28, Gleb Natapov wrote:
> > On Thu, May 19, 2011 at 06:25:14PM +0200, Jan Kiszka wrote:
> >> On 2011-05-19 18:17, Gleb Natapov wrote:
> >>> On Thu, May 19, 2011 at 05:40:50PM +0300, Avi Kivity wrote:
> >>>> On 05/19/2011 05:37 PM, Anthony Liguori wrote:
> >>>>>
> >>>>> So....  do you do:
> >>>>>
> >>>>> isa_register_region(ISABus *bus, MemoryRegion *mr, int priority)
> >>>>> {
> >>>>>    chipset_register_region(bus->chipset, mr, priority + 1);
> >>>>> }
> >>>>>
> >>>>> I don't really understand how you can fold everything into one
> >>>>> table and not allow devices to override their parents using
> >>>>> priorities.
> >>>>
> >>>> Think of how a window manager folds windows with priorities onto a
> >>>> flat framebuffer.
> >>>>
> >>>> You do a depth-first walk of the tree.  For each child list, you
> >>>> iterate it from the lowest to highest priority, allowing later
> >>>> subregions override earlier subregions.
> >>>>
> >>> And how you set those priorities in a sensible way? Why two device on a
> >>> PCI bus will want to register their memory region with anything but
> >>> highest priority? And if you let PCI subsystem to assign priorities how
> >>> it will coordinate with ISA subsystem/memory controller what priorities
> >>> to assign to get meaningful system?
> >>
> >> Priorities >default will only be used for explicit overlays, e.g. RAM
> >> over MMIO in PAM regions. Non-default priorities won't be assigned to
> >> normal PCI bars or any other device's region.
> >>
> > That does not explain who and how assign those priorities in globally
> > meaningful way.
> 
> There are no global priorities. Priorities are only used inside each
> level of the memory region hierarchy to generate a resulting, flattened
> view for the next higher level. At that level, everything imported from
> below has the default prio again, ie. the lowest one.
> 
Ah, so you are advocating for filtering on each level then. Because
highest level (the one that actually uses memory API) will never see
two regions with different priorities since layer bellow will flatten
the memory layout. So why do you need priorities if this is the case?
The layer that does flattening is the layer that assign priorities
anyway.

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 16:36                                                                 ` Anthony Liguori
@ 2011-05-19 16:49                                                                   ` Jan Kiszka
  2011-05-19 17:12                                                                     ` Gleb Natapov
  2011-05-20  8:58                                                                     ` Avi Kivity
  2011-05-20  8:56                                                                   ` Avi Kivity
  1 sibling, 2 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 16:49 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, Gleb Natapov, qemu-devel

On 2011-05-19 18:36, Anthony Liguori wrote:
> On 05/19/2011 11:30 AM, Jan Kiszka wrote:
>> On 2011-05-19 18:28, Gleb Natapov wrote:
>>> On Thu, May 19, 2011 at 06:25:14PM +0200, Jan Kiszka wrote:
>>>> On 2011-05-19 18:17, Gleb Natapov wrote:
>>>>> On Thu, May 19, 2011 at 05:40:50PM +0300, Avi Kivity wrote:
>>>>>> On 05/19/2011 05:37 PM, Anthony Liguori wrote:
>>>>>>>
>>>>>>> So....  do you do:
>>>>>>>
>>>>>>> isa_register_region(ISABus *bus, MemoryRegion *mr, int priority)
>>>>>>> {
>>>>>>>     chipset_register_region(bus->chipset, mr, priority + 1);
>>>>>>> }
>>>>>>>
>>>>>>> I don't really understand how you can fold everything into one
>>>>>>> table and not allow devices to override their parents using
>>>>>>> priorities.
>>>>>>
>>>>>> Think of how a window manager folds windows with priorities onto a
>>>>>> flat framebuffer.
>>>>>>
>>>>>> You do a depth-first walk of the tree.  For each child list, you
>>>>>> iterate it from the lowest to highest priority, allowing later
>>>>>> subregions override earlier subregions.
>>>>>>
>>>>> And how you set those priorities in a sensible way? Why two device on a
>>>>> PCI bus will want to register their memory region with anything but
>>>>> highest priority? And if you let PCI subsystem to assign priorities how
>>>>> it will coordinate with ISA subsystem/memory controller what priorities
>>>>> to assign to get meaningful system?
>>>>
>>>> Priorities>default will only be used for explicit overlays, e.g. RAM
>>>> over MMIO in PAM regions. Non-default priorities won't be assigned to
>>>> normal PCI bars or any other device's region.
>>>>
>>> That does not explain who and how assign those priorities in globally
>>> meaningful way.
>>
>> There are no global priorities. Priorities are only used inside each
>> level of the memory region hierarchy to generate a resulting, flattened
>> view for the next higher level. At that level, everything imported from
>> below has the default prio again, ie. the lowest one.
> 
> Then SMM is impossible.

For sure it is. The CPU and the chipset, each at their mapping level,
create a corresponding RAM region and register it with higher prio at
the SMRAM start address (CPU and chipset will need to exchange that
address or otherwise coordinate the mapping information - the price for
per-CPU SMRAM).

> 
> Why do we need priorities at all?  There should be no overlap at each 
> level in the hierarchy.
> 
> If you have overlapping BARs, the PCI bus will always send the request 
> to a single device based on something that's implementation specific. 
> This works because each PCI device advertises the BAR locations and 
> sizes in it's config space.

That's not a use case for priorities at all. Priorities are useful for
PAM and SMRAM-like scenarios.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 16:38                                                             ` Anthony Liguori
@ 2011-05-19 16:50                                                               ` Jan Kiszka
  2011-05-20  9:03                                                               ` Avi Kivity
  1 sibling, 0 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 16:50 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, Gleb Natapov, qemu-devel

On 2011-05-19 18:38, Anthony Liguori wrote:
> On 05/19/2011 11:35 AM, Jan Kiszka wrote:
>> On 2011-05-19 18:32, Anthony Liguori wrote:
>>> On 05/19/2011 09:40 AM, Avi Kivity wrote:
>>>> On 05/19/2011 05:37 PM, Anthony Liguori wrote:
>>>>>
>>>>> So.... do you do:
>>>>>
>>>>> isa_register_region(ISABus *bus, MemoryRegion *mr, int priority)
>>>>> {
>>>>> chipset_register_region(bus->chipset, mr, priority + 1);
>>>>> }
>>>>>
>>>>> I don't really understand how you can fold everything into one table
>>>>> and not allow devices to override their parents using priorities.
>>>>
>>>> Think of how a window manager folds windows with priorities onto a flat
>>>> framebuffer.
>>>>
>>>> You do a depth-first walk of the tree. For each child list, you iterate
>>>> it from the lowest to highest priority, allowing later subregions
>>>> override earlier subregions.
>>>>
>>>
>>> Okay, but this doesn't explain how you'll let RAM override the VGA
>>> mapping since RAM is not represented in the same child list as VGA (RAM
>>> is a child of the PMC whereas VGA is a child of ISA/PCI, both of which
>>> are at least one level removed from the PMC).
>>
>> You can always create a new memory region with higher priority, pointing
>> to the RAM window you want to have above VGA. That's what we do today as
>> well, just with different effects on the internal representation.
> 
> But then we're no better than we are today.  I thought the whole point 
> of this thread of discussion was to allow overlapping I/O regions to be 
> handled in a better way than we do today?

We would be much better than today. Today some broken magic is applied
to account for the fact that overlapping registrations destroy the
information below. In the future, that will be preserved, and removing
the overlap will restore previous mappings - or even those which were
modified in the meantime.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 16:43                                                                 ` Gleb Natapov
@ 2011-05-19 16:51                                                                   ` Jan Kiszka
  0 siblings, 0 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 16:51 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, qemu-devel

On 2011-05-19 18:43, Gleb Natapov wrote:
> On Thu, May 19, 2011 at 06:30:42PM +0200, Jan Kiszka wrote:
>> On 2011-05-19 18:28, Gleb Natapov wrote:
>>> On Thu, May 19, 2011 at 06:25:14PM +0200, Jan Kiszka wrote:
>>>> On 2011-05-19 18:17, Gleb Natapov wrote:
>>>>> On Thu, May 19, 2011 at 05:40:50PM +0300, Avi Kivity wrote:
>>>>>> On 05/19/2011 05:37 PM, Anthony Liguori wrote:
>>>>>>>
>>>>>>> So....  do you do:
>>>>>>>
>>>>>>> isa_register_region(ISABus *bus, MemoryRegion *mr, int priority)
>>>>>>> {
>>>>>>>    chipset_register_region(bus->chipset, mr, priority + 1);
>>>>>>> }
>>>>>>>
>>>>>>> I don't really understand how you can fold everything into one
>>>>>>> table and not allow devices to override their parents using
>>>>>>> priorities.
>>>>>>
>>>>>> Think of how a window manager folds windows with priorities onto a
>>>>>> flat framebuffer.
>>>>>>
>>>>>> You do a depth-first walk of the tree.  For each child list, you
>>>>>> iterate it from the lowest to highest priority, allowing later
>>>>>> subregions override earlier subregions.
>>>>>>
>>>>> And how you set those priorities in a sensible way? Why two device on a
>>>>> PCI bus will want to register their memory region with anything but
>>>>> highest priority? And if you let PCI subsystem to assign priorities how
>>>>> it will coordinate with ISA subsystem/memory controller what priorities
>>>>> to assign to get meaningful system?
>>>>
>>>> Priorities >default will only be used for explicit overlays, e.g. RAM
>>>> over MMIO in PAM regions. Non-default priorities won't be assigned to
>>>> normal PCI bars or any other device's region.
>>>>
>>> That does not explain who and how assign those priorities in globally
>>> meaningful way.
>>
>> There are no global priorities. Priorities are only used inside each
>> level of the memory region hierarchy to generate a resulting, flattened
>> view for the next higher level. At that level, everything imported from
>> below has the default prio again, ie. the lowest one.
>>
> Ah, so you are advocating for filtering on each level then. Because
> highest level (the one that actually uses memory API) will never see
> two regions with different priorities since layer bellow will flatten
> the memory layout. So why do you need priorities if this is the case?
> The layer that does flattening is the layer that assign priorities
> anyway.

See my reply to Anthony.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 16:49                                                                   ` Jan Kiszka
@ 2011-05-19 17:12                                                                     ` Gleb Natapov
  2011-05-19 18:11                                                                       ` Jan Kiszka
  2011-05-20  8:58                                                                     ` Avi Kivity
  1 sibling, 1 reply; 187+ messages in thread
From: Gleb Natapov @ 2011-05-19 17:12 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, qemu-devel

On Thu, May 19, 2011 at 06:49:48PM +0200, Jan Kiszka wrote:
> On 2011-05-19 18:36, Anthony Liguori wrote:
> > On 05/19/2011 11:30 AM, Jan Kiszka wrote:
> >> On 2011-05-19 18:28, Gleb Natapov wrote:
> >>> On Thu, May 19, 2011 at 06:25:14PM +0200, Jan Kiszka wrote:
> >>>> On 2011-05-19 18:17, Gleb Natapov wrote:
> >>>>> On Thu, May 19, 2011 at 05:40:50PM +0300, Avi Kivity wrote:
> >>>>>> On 05/19/2011 05:37 PM, Anthony Liguori wrote:
> >>>>>>>
> >>>>>>> So....  do you do:
> >>>>>>>
> >>>>>>> isa_register_region(ISABus *bus, MemoryRegion *mr, int priority)
> >>>>>>> {
> >>>>>>>     chipset_register_region(bus->chipset, mr, priority + 1);
> >>>>>>> }
> >>>>>>>
> >>>>>>> I don't really understand how you can fold everything into one
> >>>>>>> table and not allow devices to override their parents using
> >>>>>>> priorities.
> >>>>>>
> >>>>>> Think of how a window manager folds windows with priorities onto a
> >>>>>> flat framebuffer.
> >>>>>>
> >>>>>> You do a depth-first walk of the tree.  For each child list, you
> >>>>>> iterate it from the lowest to highest priority, allowing later
> >>>>>> subregions override earlier subregions.
> >>>>>>
> >>>>> And how you set those priorities in a sensible way? Why two device on a
> >>>>> PCI bus will want to register their memory region with anything but
> >>>>> highest priority? And if you let PCI subsystem to assign priorities how
> >>>>> it will coordinate with ISA subsystem/memory controller what priorities
> >>>>> to assign to get meaningful system?
> >>>>
> >>>> Priorities>default will only be used for explicit overlays, e.g. RAM
> >>>> over MMIO in PAM regions. Non-default priorities won't be assigned to
> >>>> normal PCI bars or any other device's region.
> >>>>
> >>> That does not explain who and how assign those priorities in globally
> >>> meaningful way.
> >>
> >> There are no global priorities. Priorities are only used inside each
> >> level of the memory region hierarchy to generate a resulting, flattened
> >> view for the next higher level. At that level, everything imported from
> >> below has the default prio again, ie. the lowest one.
> > 
> > Then SMM is impossible.
> 
> For sure it is. The CPU and the chipset, each at their mapping level,
> create a corresponding RAM region and register it with higher prio at
> the SMRAM start address (CPU and chipset will need to exchange that
> address or otherwise coordinate the mapping information - the price for
> per-CPU SMRAM).
> 
So to get priorities right two components need to know a priori about
overlap and coordinate the priorities?

> > 
> > Why do we need priorities at all?  There should be no overlap at each 
> > level in the hierarchy.
> > 
> > If you have overlapping BARs, the PCI bus will always send the request 
> > to a single device based on something that's implementation specific. 
> > This works because each PCI device advertises the BAR locations and 
> > sizes in it's config space.
> 
> That's not a use case for priorities at all. Priorities are useful for
> PAM and SMRAM-like scenarios.
> 
It looks like you are talking about very shallow model were overlap may
happen only in chipset and chipset directly controls all of the physical
address space. But we need to have solution for more complex topologies
where PAM/SMRAM like scenarios may happen on each level. You are dismissing
PCI as an example because all memory regions there are of the same
priority, but this is just the special case of more generic scenario.
Why this is not the "use case for priorities"?

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:37     ` Jan Kiszka
  2011-05-19 13:41       ` Avi Kivity
@ 2011-05-19 17:39       ` Gleb Natapov
  2011-05-19 18:03         ` Anthony Liguori
  2011-05-19 18:11         ` Jan Kiszka
  1 sibling, 2 replies; 187+ messages in thread
From: Gleb Natapov @ 2011-05-19 17:39 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, qemu-devel

On Thu, May 19, 2011 at 03:37:58PM +0200, Jan Kiszka wrote:
> On 2011-05-19 15:36, Anthony Liguori wrote:
> > On 05/18/2011 02:40 PM, Jan Kiszka wrote:
> >> On 2011-05-18 15:12, Avi Kivity wrote:
> >>> void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t
> >>> addr);
> >>
> >> OK, let's allow overlapping, but make it explicit:
> >>
> >> void cpu_register_memory_region_overlap(MemoryRegion *mr,
> >>                                          target_phys_addr_t addr,
> >>                                          int priority);
> > 
> > The device doesn't actually know how overlapping is handled.  This is
> > based on the bus hierarchy.
> 
> Devices don't register their regions, buses do.
> 
Today PCI device may register region that overlaps with any other
registered memory region without even knowing it. Guest can write any
RAM address into PCI BAR and this RAM address will be come mmio are. More
interesting is what happens when guest reprogram PCI BAR to other address
- the RAM that was at the previous address just disappears. Obviously
  this is crazy behaviour, but the question is how do we want to handle
it? One option is to disallow such overlapping registration, another is
to restore RAM mapping after PCI BAR is reprogrammed. If we chose second
one the PCI will not know that _overlap() should be called.

Another example may be APIC region and PCI. They overlap, but neither
CPU nor PCI knows about it.

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 17:39       ` Gleb Natapov
@ 2011-05-19 18:03         ` Anthony Liguori
  2011-05-19 18:28           ` Gleb Natapov
  2011-05-19 18:11         ` Jan Kiszka
  1 sibling, 1 reply; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 18:03 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, Avi Kivity, qemu-devel

On 05/19/2011 12:39 PM, Gleb Natapov wrote:
> On Thu, May 19, 2011 at 03:37:58PM +0200, Jan Kiszka wrote:
>> On 2011-05-19 15:36, Anthony Liguori wrote:
>>> On 05/18/2011 02:40 PM, Jan Kiszka wrote:
>>>> On 2011-05-18 15:12, Avi Kivity wrote:
>>>>> void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t
>>>>> addr);
>>>>
>>>> OK, let's allow overlapping, but make it explicit:
>>>>
>>>> void cpu_register_memory_region_overlap(MemoryRegion *mr,
>>>>                                           target_phys_addr_t addr,
>>>>                                           int priority);
>>>
>>> The device doesn't actually know how overlapping is handled.  This is
>>> based on the bus hierarchy.
>>
>> Devices don't register their regions, buses do.
>>
> Today PCI device may register region that overlaps with any other
> registered memory region without even knowing it. Guest can write any
> RAM address into PCI BAR and this RAM address will be come mmio are.

Right, but this is not how a real machine works.

With the exception of the few regions that the chipset treats specially, 
RAM accesses don't get a chance to be intercepted by the PCI bus.

> More
> interesting is what happens when guest reprogram PCI BAR to other address
> - the RAM that was at the previous address just disappears. Obviously
>    this is crazy behaviour, but the question is how do we want to handle

The CPU should continue to access RAM at this address.  It's unclear to 
me what the right behavior is for device-to-device I/O but I'm pretty 
certain it doesn't matter.

> it? One option is to disallow such overlapping registration, another is
> to restore RAM mapping after PCI BAR is reprogrammed. If we chose second
> one the PCI will not know that _overlap() should be called.
>
> Another example may be APIC region and PCI. They overlap, but neither
> CPU nor PCI knows about it.

And APIC always wins when accesses are coming from the CPU.

Regards,

Anthony Liguori

>
> --
> 			Gleb.
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 13:55                       ` Avi Kivity
@ 2011-05-19 18:06                         ` Anthony Liguori
  2011-05-19 18:21                           ` Jan Kiszka
  0 siblings, 1 reply; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 18:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov

On 05/19/2011 08:55 AM, Avi Kivity wrote:
> On 05/19/2011 04:50 PM, Anthony Liguori wrote:
>>
>> But the i440fx doesn't register the VGA region. The PIIX3 (ISA bus)
>> does, so how does it know what the priority of that mapping is?
>>
>
> The PCI bridge also has a say, no?

For legacy VGA memory?  That's a good question.  I've always assumed 
that legacy VGA memory is handled directly in the chipset by redirecting 
writes to the first VGA adapter it encounters (which usually happens to 
be the builtin one these days).

I'm not sure it's possible to have a VGA device behind a bridge that 
also handles legacy VGA memory because the bridge pretty clearly can 
only have BARs within a certain region of memory (based on the bridge's 
config space).

Regards,

Anthony Liguori

>
> (and it would be a VGA region over memory, not the other way around).
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 17:12                                                                     ` Gleb Natapov
@ 2011-05-19 18:11                                                                       ` Jan Kiszka
  0 siblings, 0 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 18:11 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4326 bytes --]

On 2011-05-19 19:12, Gleb Natapov wrote:
> On Thu, May 19, 2011 at 06:49:48PM +0200, Jan Kiszka wrote:
>> On 2011-05-19 18:36, Anthony Liguori wrote:
>>> On 05/19/2011 11:30 AM, Jan Kiszka wrote:
>>>> On 2011-05-19 18:28, Gleb Natapov wrote:
>>>>> On Thu, May 19, 2011 at 06:25:14PM +0200, Jan Kiszka wrote:
>>>>>> On 2011-05-19 18:17, Gleb Natapov wrote:
>>>>>>> On Thu, May 19, 2011 at 05:40:50PM +0300, Avi Kivity wrote:
>>>>>>>> On 05/19/2011 05:37 PM, Anthony Liguori wrote:
>>>>>>>>>
>>>>>>>>> So....  do you do:
>>>>>>>>>
>>>>>>>>> isa_register_region(ISABus *bus, MemoryRegion *mr, int priority)
>>>>>>>>> {
>>>>>>>>>     chipset_register_region(bus->chipset, mr, priority + 1);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> I don't really understand how you can fold everything into one
>>>>>>>>> table and not allow devices to override their parents using
>>>>>>>>> priorities.
>>>>>>>>
>>>>>>>> Think of how a window manager folds windows with priorities onto a
>>>>>>>> flat framebuffer.
>>>>>>>>
>>>>>>>> You do a depth-first walk of the tree.  For each child list, you
>>>>>>>> iterate it from the lowest to highest priority, allowing later
>>>>>>>> subregions override earlier subregions.
>>>>>>>>
>>>>>>> And how you set those priorities in a sensible way? Why two device on a
>>>>>>> PCI bus will want to register their memory region with anything but
>>>>>>> highest priority? And if you let PCI subsystem to assign priorities how
>>>>>>> it will coordinate with ISA subsystem/memory controller what priorities
>>>>>>> to assign to get meaningful system?
>>>>>>
>>>>>> Priorities>default will only be used for explicit overlays, e.g. RAM
>>>>>> over MMIO in PAM regions. Non-default priorities won't be assigned to
>>>>>> normal PCI bars or any other device's region.
>>>>>>
>>>>> That does not explain who and how assign those priorities in globally
>>>>> meaningful way.
>>>>
>>>> There are no global priorities. Priorities are only used inside each
>>>> level of the memory region hierarchy to generate a resulting, flattened
>>>> view for the next higher level. At that level, everything imported from
>>>> below has the default prio again, ie. the lowest one.
>>>
>>> Then SMM is impossible.
>>
>> For sure it is. The CPU and the chipset, each at their mapping level,
>> create a corresponding RAM region and register it with higher prio at
>> the SMRAM start address (CPU and chipset will need to exchange that
>> address or otherwise coordinate the mapping information - the price for
>> per-CPU SMRAM).
>>
> So to get priorities right two components need to know a priori about
> overlap and coordinate the priorities?

Nope, the integrator, i.e. the bridge (an abstract one, please) needs to
know that it registers possibly overlapping regions. It declares that
some region is allowed to overlap by using the corresponding service,
optionally providing a priority > default in order get a well-define
ordering.

> 
>>>
>>> Why do we need priorities at all?  There should be no overlap at each 
>>> level in the hierarchy.
>>>
>>> If you have overlapping BARs, the PCI bus will always send the request 
>>> to a single device based on something that's implementation specific. 
>>> This works because each PCI device advertises the BAR locations and 
>>> sizes in it's config space.
>>
>> That's not a use case for priorities at all. Priorities are useful for
>> PAM and SMRAM-like scenarios.
>>
> It looks like you are talking about very shallow model were overlap may
> happen only in chipset and chipset directly controls all of the physical
> address space. But we need to have solution for more complex topologies
> where PAM/SMRAM like scenarios may happen on each level.

That's precisely my goal. PAM/SMRAM is just one example for such
overlays at any bridge level, not just the chipset.

> You are dismissing
> PCI as an example because all memory regions there are of the same
> priority, but this is just the special case of more generic scenario.
> Why this is not the "use case for priorities"?

Because we know that PCI bars can overlap and are allowed to, and that
this will generate some random result. So we don't need to worry about
assigning priorities, we just need to declare overlaps valid.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 17:39       ` Gleb Natapov
  2011-05-19 18:03         ` Anthony Liguori
@ 2011-05-19 18:11         ` Jan Kiszka
  2011-05-19 18:22           ` Gleb Natapov
  1 sibling, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 18:11 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2229 bytes --]

On 2011-05-19 19:39, Gleb Natapov wrote:
> On Thu, May 19, 2011 at 03:37:58PM +0200, Jan Kiszka wrote:
>> On 2011-05-19 15:36, Anthony Liguori wrote:
>>> On 05/18/2011 02:40 PM, Jan Kiszka wrote:
>>>> On 2011-05-18 15:12, Avi Kivity wrote:
>>>>> void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t
>>>>> addr);
>>>>
>>>> OK, let's allow overlapping, but make it explicit:
>>>>
>>>> void cpu_register_memory_region_overlap(MemoryRegion *mr,
>>>>                                          target_phys_addr_t addr,
>>>>                                          int priority);
>>>
>>> The device doesn't actually know how overlapping is handled.  This is
>>> based on the bus hierarchy.
>>
>> Devices don't register their regions, buses do.
>>
> Today PCI device may register region that overlaps with any other
> registered memory region without even knowing it. Guest can write any
> RAM address into PCI BAR and this RAM address will be come mmio are. More
> interesting is what happens when guest reprogram PCI BAR to other address
> - the RAM that was at the previous address just disappears. Obviously
>   this is crazy behaviour, but the question is how do we want to handle
> it? One option is to disallow such overlapping registration, another is
> to restore RAM mapping after PCI BAR is reprogrammed. If we chose second
> one the PCI will not know that _overlap() should be called.

BARs may overlap with other BARs or with RAM. That's well-known, so PCI
bridged need to register their regions with the _overlap variant
unconditionally. In contrast to the current PhysPageDesc mechanism, the
new region management will not cause any harm to overlapping regions so
that they can "recover" when the overlap is gone.

> 
> Another example may be APIC region and PCI. They overlap, but neither
> CPU nor PCI knows about it.

And they do not need to. The APIC regions will be managed by the per-CPU
region management, reusing the tool box we need for all bridges. It will
register the APIC page with a priority higher than the default one, thus
overriding everything that comes from the host bridge. I think that
reflects pretty well real machine behaviour.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 14:11                         ` Avi Kivity
@ 2011-05-19 18:18                           ` Anthony Liguori
  2011-05-19 18:50                             ` Jan Kiszka
  2011-05-20  9:15                             ` Avi Kivity
  0 siblings, 2 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 18:18 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov, Peter Maydell

On 05/19/2011 09:11 AM, Avi Kivity wrote:
> On 05/19/2011 05:04 PM, Anthony Liguori wrote:
>>
>> Right, the chipset register is mainly used to program the contents of
>> SMM.
>>
>> There is a single access pin that has effectively the same semantics
>> as setting the chipset register.
>>
>> It's not a per-CPU setting--that's the point. You can't have one CPU
>> reading SMM memory at the exactly same time as accessing VGA.
>>
>> But I guess you can never have two simultaneous accesses anyway so
>> perhaps it's splitting hairs :-)
>
> Exactly - it just works.

Well, not really.

kvm.ko has a global mapping of RAM regions and currently only allows 
code execution from RAM.

This means the only way for QEMU to enable SMM support is to program the 
global RAM regions table to enable allow RAM access for the VGA region.

The problem with this is that it's perfectly conceivable to have CPU 0 
in SMM mode while CPU 1 is doing MMIO to the VGA planar.

The same problem exists with PAM.  It would be much easier to implement 
PAM correctly in QEMU if it were possible to execute code via MMIO as we 
could just mark the BIOS memory as non-RAM and deal with the dispatch 
ourselves.

Would it be fundamentally hard to support this in KVM?  I guess you 
would need to put the VCPU in single step mode and maintain a page to 
copy the results into.

Regards,

Anthony Liguori

>
> btw, a way to implement it would be to have two memory maps, one for SMM
> and one for non-SMM, and select between them based on the CPU mode.
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 18:06                         ` Anthony Liguori
@ 2011-05-19 18:21                           ` Jan Kiszka
  0 siblings, 0 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 18:21 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, Gleb Natapov, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2406 bytes --]

On 2011-05-19 20:06, Anthony Liguori wrote:
> On 05/19/2011 08:55 AM, Avi Kivity wrote:
>> On 05/19/2011 04:50 PM, Anthony Liguori wrote:
>>>
>>> But the i440fx doesn't register the VGA region. The PIIX3 (ISA bus)
>>> does, so how does it know what the priority of that mapping is?
>>>
>>
>> The PCI bridge also has a say, no?
> 
> For legacy VGA memory?  That's a good question.  I've always assumed
> that legacy VGA memory is handled directly in the chipset by redirecting
> writes to the first VGA adapter it encounters (which usually happens to
> be the builtin one these days).

Nope. It's well defined in the PCI specs that every PCI-PCI bridge can
(or have to? need to check) control the flow of legacy VGA to its
downstream devices.

> 
> I'm not sure it's possible to have a VGA device behind a bridge that
> also handles legacy VGA memory because the bridge pretty clearly can
> only have BARs within a certain region of memory (based on the bridge's
> config space).

That's part of my notebook PCI tree, I bet you have something similar:

 \-[0000:00]-+-00.0  Intel Corporation Core Processor DRAM Controller
             +-01.0-[01]--+-00.0  nVidia Corporation GT216 [Quadro FX 880M]
             |            \-00.1  nVidia Corporation High Definition Audio Controller

So even this single, though not built-in VGA adapter is behind a bridge.

And if you look closer, you can find:

00:01.0 PCI bridge: Intel Corporation Core Processor PCI Express x16 Root Port (rev 02) (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 00002000-00002fff
        Memory behind bridge: cc000000-cdefffff
        Prefetchable memory behind bridge: 00000000ce000000-00000000dfffffff
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA+ MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
...

Note that 'VGA+' in BridgeCtl. It allows the NVIDIA adapter to handle
legacy VGA.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 18:11         ` Jan Kiszka
@ 2011-05-19 18:22           ` Gleb Natapov
  2011-05-19 18:27             ` Jan Kiszka
  2011-05-20  9:10             ` Avi Kivity
  0 siblings, 2 replies; 187+ messages in thread
From: Gleb Natapov @ 2011-05-19 18:22 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, qemu-devel

On Thu, May 19, 2011 at 08:11:39PM +0200, Jan Kiszka wrote:
> On 2011-05-19 19:39, Gleb Natapov wrote:
> > On Thu, May 19, 2011 at 03:37:58PM +0200, Jan Kiszka wrote:
> >> On 2011-05-19 15:36, Anthony Liguori wrote:
> >>> On 05/18/2011 02:40 PM, Jan Kiszka wrote:
> >>>> On 2011-05-18 15:12, Avi Kivity wrote:
> >>>>> void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t
> >>>>> addr);
> >>>>
> >>>> OK, let's allow overlapping, but make it explicit:
> >>>>
> >>>> void cpu_register_memory_region_overlap(MemoryRegion *mr,
> >>>>                                          target_phys_addr_t addr,
> >>>>                                          int priority);
> >>>
> >>> The device doesn't actually know how overlapping is handled.  This is
> >>> based on the bus hierarchy.
> >>
> >> Devices don't register their regions, buses do.
> >>
> > Today PCI device may register region that overlaps with any other
> > registered memory region without even knowing it. Guest can write any
> > RAM address into PCI BAR and this RAM address will be come mmio are. More
> > interesting is what happens when guest reprogram PCI BAR to other address
> > - the RAM that was at the previous address just disappears. Obviously
> >   this is crazy behaviour, but the question is how do we want to handle
> > it? One option is to disallow such overlapping registration, another is
> > to restore RAM mapping after PCI BAR is reprogrammed. If we chose second
> > one the PCI will not know that _overlap() should be called.
> 
> BARs may overlap with other BARs or with RAM. That's well-known, so PCI
> bridged need to register their regions with the _overlap variant
> unconditionally. In contrast to the current PhysPageDesc mechanism, the
With what priority? If it needs to call _overlap unconditionally why not
always call _overlap and drop not _overlap variant?

> new region management will not cause any harm to overlapping regions so
> that they can "recover" when the overlap is gone.
> 
> > 
> > Another example may be APIC region and PCI. They overlap, but neither
> > CPU nor PCI knows about it.
> 
> And they do not need to. The APIC regions will be managed by the per-CPU
> region management, reusing the tool box we need for all bridges. It will
> register the APIC page with a priority higher than the default one, thus
> overriding everything that comes from the host bridge. I think that
> reflects pretty well real machine behaviour.
> 
What is "higher"? How does it know that priority is high enough? I
thought, from reading other replies, that priorities are meaningful
only on the same hierarchy level (which kinda make sense), but now you
are saying that you will override PCI address from another part of
the topology?

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 18:22           ` Gleb Natapov
@ 2011-05-19 18:27             ` Jan Kiszka
  2011-05-19 18:40               ` Gleb Natapov
  2011-05-20  9:10             ` Avi Kivity
  1 sibling, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 18:27 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3479 bytes --]

On 2011-05-19 20:22, Gleb Natapov wrote:
> On Thu, May 19, 2011 at 08:11:39PM +0200, Jan Kiszka wrote:
>> On 2011-05-19 19:39, Gleb Natapov wrote:
>>> On Thu, May 19, 2011 at 03:37:58PM +0200, Jan Kiszka wrote:
>>>> On 2011-05-19 15:36, Anthony Liguori wrote:
>>>>> On 05/18/2011 02:40 PM, Jan Kiszka wrote:
>>>>>> On 2011-05-18 15:12, Avi Kivity wrote:
>>>>>>> void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t
>>>>>>> addr);
>>>>>>
>>>>>> OK, let's allow overlapping, but make it explicit:
>>>>>>
>>>>>> void cpu_register_memory_region_overlap(MemoryRegion *mr,
>>>>>>                                          target_phys_addr_t addr,
>>>>>>                                          int priority);
>>>>>
>>>>> The device doesn't actually know how overlapping is handled.  This is
>>>>> based on the bus hierarchy.
>>>>
>>>> Devices don't register their regions, buses do.
>>>>
>>> Today PCI device may register region that overlaps with any other
>>> registered memory region without even knowing it. Guest can write any
>>> RAM address into PCI BAR and this RAM address will be come mmio are. More
>>> interesting is what happens when guest reprogram PCI BAR to other address
>>> - the RAM that was at the previous address just disappears. Obviously
>>>   this is crazy behaviour, but the question is how do we want to handle
>>> it? One option is to disallow such overlapping registration, another is
>>> to restore RAM mapping after PCI BAR is reprogrammed. If we chose second
>>> one the PCI will not know that _overlap() should be called.
>>
>> BARs may overlap with other BARs or with RAM. That's well-known, so PCI
>> bridged need to register their regions with the _overlap variant
>> unconditionally. In contrast to the current PhysPageDesc mechanism, the
> With what priority? If it needs to call _overlap unconditionally why not
> always call _overlap and drop not _overlap variant?

Because we should catch accidental overlaps in all those non PCI devices
with hard-wired addressing. That's a bug in the device/machine model and
should be reported as such by QEMU.

> 
>> new region management will not cause any harm to overlapping regions so
>> that they can "recover" when the overlap is gone.
>>
>>>
>>> Another example may be APIC region and PCI. They overlap, but neither
>>> CPU nor PCI knows about it.
>>
>> And they do not need to. The APIC regions will be managed by the per-CPU
>> region management, reusing the tool box we need for all bridges. It will
>> register the APIC page with a priority higher than the default one, thus
>> overriding everything that comes from the host bridge. I think that
>> reflects pretty well real machine behaviour.
>>
> What is "higher"? How does it know that priority is high enough?

Because no one else manages priorities at a specific hierarchy level.
There is only one.

> I
> thought, from reading other replies, that priorities are meaningful
> only on the same hierarchy level (which kinda make sense), but now you
> are saying that you will override PCI address from another part of
> the topology?

Everything from below in the hierarchy is fed in with default priority,
the lowest one. So to let some region created at this level override
those regions, just pick default+1. If you want to create more overlay
levels (can't imagine a good scenario, though), pick
default+1..default+n. It's really that simple.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 18:03         ` Anthony Liguori
@ 2011-05-19 18:28           ` Gleb Natapov
  2011-05-19 18:33             ` Anthony Liguori
  0 siblings, 1 reply; 187+ messages in thread
From: Gleb Natapov @ 2011-05-19 18:28 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, Avi Kivity, qemu-devel

On Thu, May 19, 2011 at 01:03:01PM -0500, Anthony Liguori wrote:
> On 05/19/2011 12:39 PM, Gleb Natapov wrote:
> >On Thu, May 19, 2011 at 03:37:58PM +0200, Jan Kiszka wrote:
> >>On 2011-05-19 15:36, Anthony Liguori wrote:
> >>>On 05/18/2011 02:40 PM, Jan Kiszka wrote:
> >>>>On 2011-05-18 15:12, Avi Kivity wrote:
> >>>>>void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t
> >>>>>addr);
> >>>>
> >>>>OK, let's allow overlapping, but make it explicit:
> >>>>
> >>>>void cpu_register_memory_region_overlap(MemoryRegion *mr,
> >>>>                                          target_phys_addr_t addr,
> >>>>                                          int priority);
> >>>
> >>>The device doesn't actually know how overlapping is handled.  This is
> >>>based on the bus hierarchy.
> >>
> >>Devices don't register their regions, buses do.
> >>
> >Today PCI device may register region that overlaps with any other
> >registered memory region without even knowing it. Guest can write any
> >RAM address into PCI BAR and this RAM address will be come mmio are.
> 
> Right, but this is not how a real machine works.
> 
Very likely.

> With the exception of the few regions that the chipset treats
> specially, RAM accesses don't get a chance to be intercepted by the
> PCI bus.
> 
> >More
> >interesting is what happens when guest reprogram PCI BAR to other address
> >- the RAM that was at the previous address just disappears. Obviously
> >   this is crazy behaviour, but the question is how do we want to handle
> 
> The CPU should continue to access RAM at this address.
I think it should continue using it even when PCI BAR is still mapped to the
RAM address.

>                                                         It's unclear
> to me what the right behavior is for device-to-device I/O but I'm
> pretty certain it doesn't matter.
> 
For PC may be. But IIRC this thread already had request to support
different memory view from different devices.

> >it? One option is to disallow such overlapping registration, another is
> >to restore RAM mapping after PCI BAR is reprogrammed. If we chose second
> >one the PCI will not know that _overlap() should be called.
> >
> >Another example may be APIC region and PCI. They overlap, but neither
> >CPU nor PCI knows about it.
> 
> And APIC always wins when accesses are coming from the CPU.
> 
Of course, my question is how proposed API handles this.

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 18:28           ` Gleb Natapov
@ 2011-05-19 18:33             ` Anthony Liguori
  0 siblings, 0 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 18:33 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, Avi Kivity, qemu-devel

On 05/19/2011 01:28 PM, Gleb Natapov wrote:
> On Thu, May 19, 2011 at 01:03:01PM -0500, Anthony Liguori wrote:
>> On 05/19/2011 12:39 PM, Gleb Natapov wrote:
>> With the exception of the few regions that the chipset treats
>> specially, RAM accesses don't get a chance to be intercepted by the
>> PCI bus.
>>
>>> More
>>> interesting is what happens when guest reprogram PCI BAR to other address
>>> - the RAM that was at the previous address just disappears. Obviously
>>>    this is crazy behaviour, but the question is how do we want to handle
>>
>> The CPU should continue to access RAM at this address.
> I think it should continue using it even when PCI BAR is still mapped to the
> RAM address.

Agreed.  Minus PAM and SMM, RAM is always RAM.  When a CPU accesses it, 
it's always the same.

This is not the case for device accesses to RAM but this is probably a 
separate topic.

>>                                                          It's unclear
>> to me what the right behavior is for device-to-device I/O but I'm
>> pretty certain it doesn't matter.
>>
> For PC may be. But IIRC this thread already had request to support
> different memory view from different devices.

All platforms have in common the fact that I/O is dispatched in a 
hierarchical fashion.  The reason is quite simple--physics only allow so 
many point-to-point connections between components.  So I/O access is 
always going to fan out.  The components closer to the source of the I/O 
access are always able to overrule the lower components.

Regards,

Anthony Liguori

>>> it? One option is to disallow such overlapping registration, another is
>>> to restore RAM mapping after PCI BAR is reprogrammed. If we chose second
>>> one the PCI will not know that _overlap() should be called.
>>>
>>> Another example may be APIC region and PCI. They overlap, but neither
>>> CPU nor PCI knows about it.
>>
>> And APIC always wins when accesses are coming from the CPU.
>>
> Of course, my question is how proposed API handles this.



>
> --
> 			Gleb.
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 18:27             ` Jan Kiszka
@ 2011-05-19 18:40               ` Gleb Natapov
  2011-05-19 18:45                 ` Jan Kiszka
  0 siblings, 1 reply; 187+ messages in thread
From: Gleb Natapov @ 2011-05-19 18:40 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, qemu-devel

On Thu, May 19, 2011 at 08:27:50PM +0200, Jan Kiszka wrote:
> On 2011-05-19 20:22, Gleb Natapov wrote:
> > On Thu, May 19, 2011 at 08:11:39PM +0200, Jan Kiszka wrote:
> >> On 2011-05-19 19:39, Gleb Natapov wrote:
> >>> On Thu, May 19, 2011 at 03:37:58PM +0200, Jan Kiszka wrote:
> >>>> On 2011-05-19 15:36, Anthony Liguori wrote:
> >>>>> On 05/18/2011 02:40 PM, Jan Kiszka wrote:
> >>>>>> On 2011-05-18 15:12, Avi Kivity wrote:
> >>>>>>> void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t
> >>>>>>> addr);
> >>>>>>
> >>>>>> OK, let's allow overlapping, but make it explicit:
> >>>>>>
> >>>>>> void cpu_register_memory_region_overlap(MemoryRegion *mr,
> >>>>>>                                          target_phys_addr_t addr,
> >>>>>>                                          int priority);
> >>>>>
> >>>>> The device doesn't actually know how overlapping is handled.  This is
> >>>>> based on the bus hierarchy.
> >>>>
> >>>> Devices don't register their regions, buses do.
> >>>>
> >>> Today PCI device may register region that overlaps with any other
> >>> registered memory region without even knowing it. Guest can write any
> >>> RAM address into PCI BAR and this RAM address will be come mmio are. More
> >>> interesting is what happens when guest reprogram PCI BAR to other address
> >>> - the RAM that was at the previous address just disappears. Obviously
> >>>   this is crazy behaviour, but the question is how do we want to handle
> >>> it? One option is to disallow such overlapping registration, another is
> >>> to restore RAM mapping after PCI BAR is reprogrammed. If we chose second
> >>> one the PCI will not know that _overlap() should be called.
> >>
> >> BARs may overlap with other BARs or with RAM. That's well-known, so PCI
> >> bridged need to register their regions with the _overlap variant
> >> unconditionally. In contrast to the current PhysPageDesc mechanism, the
> > With what priority? If it needs to call _overlap unconditionally why not
> > always call _overlap and drop not _overlap variant?
> 
> Because we should catch accidental overlaps in all those non PCI devices
> with hard-wired addressing. That's a bug in the device/machine model and
> should be reported as such by QEMU.
Why should we complicate API to catch unlikely errors? If you want to
debug that add capability to dump memory map from the monitor.

> 
> > 
> >> new region management will not cause any harm to overlapping regions so
> >> that they can "recover" when the overlap is gone.
> >>
> >>>
> >>> Another example may be APIC region and PCI. They overlap, but neither
> >>> CPU nor PCI knows about it.
> >>
> >> And they do not need to. The APIC regions will be managed by the per-CPU
> >> region management, reusing the tool box we need for all bridges. It will
> >> register the APIC page with a priority higher than the default one, thus
> >> overriding everything that comes from the host bridge. I think that
> >> reflects pretty well real machine behaviour.
> >>
> > What is "higher"? How does it know that priority is high enough?
> 
> Because no one else manages priorities at a specific hierarchy level.
> There is only one.
> 
PCI and CPU are on different hierarchy levels. PCI is under the PIIX and
CPU is on a system BUS.

> > I
> > thought, from reading other replies, that priorities are meaningful
> > only on the same hierarchy level (which kinda make sense), but now you
> > are saying that you will override PCI address from another part of
> > the topology?
> 
> Everything from below in the hierarchy is fed in with default priority,
> the lowest one. So to let some region created at this level override
> those regions, just pick default+1. If you want to create more overlay
> levels (can't imagine a good scenario, though), pick
> default+1..default+n. It's really that simple.
> 
Except that PCI and CPU are not on the same level.

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 18:40               ` Gleb Natapov
@ 2011-05-19 18:45                 ` Jan Kiszka
  2011-05-19 18:50                   ` Gleb Natapov
  0 siblings, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 18:45 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4310 bytes --]

On 2011-05-19 20:40, Gleb Natapov wrote:
> On Thu, May 19, 2011 at 08:27:50PM +0200, Jan Kiszka wrote:
>> On 2011-05-19 20:22, Gleb Natapov wrote:
>>> On Thu, May 19, 2011 at 08:11:39PM +0200, Jan Kiszka wrote:
>>>> On 2011-05-19 19:39, Gleb Natapov wrote:
>>>>> On Thu, May 19, 2011 at 03:37:58PM +0200, Jan Kiszka wrote:
>>>>>> On 2011-05-19 15:36, Anthony Liguori wrote:
>>>>>>> On 05/18/2011 02:40 PM, Jan Kiszka wrote:
>>>>>>>> On 2011-05-18 15:12, Avi Kivity wrote:
>>>>>>>>> void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t
>>>>>>>>> addr);
>>>>>>>>
>>>>>>>> OK, let's allow overlapping, but make it explicit:
>>>>>>>>
>>>>>>>> void cpu_register_memory_region_overlap(MemoryRegion *mr,
>>>>>>>>                                          target_phys_addr_t addr,
>>>>>>>>                                          int priority);
>>>>>>>
>>>>>>> The device doesn't actually know how overlapping is handled.  This is
>>>>>>> based on the bus hierarchy.
>>>>>>
>>>>>> Devices don't register their regions, buses do.
>>>>>>
>>>>> Today PCI device may register region that overlaps with any other
>>>>> registered memory region without even knowing it. Guest can write any
>>>>> RAM address into PCI BAR and this RAM address will be come mmio are. More
>>>>> interesting is what happens when guest reprogram PCI BAR to other address
>>>>> - the RAM that was at the previous address just disappears. Obviously
>>>>>   this is crazy behaviour, but the question is how do we want to handle
>>>>> it? One option is to disallow such overlapping registration, another is
>>>>> to restore RAM mapping after PCI BAR is reprogrammed. If we chose second
>>>>> one the PCI will not know that _overlap() should be called.
>>>>
>>>> BARs may overlap with other BARs or with RAM. That's well-known, so PCI
>>>> bridged need to register their regions with the _overlap variant
>>>> unconditionally. In contrast to the current PhysPageDesc mechanism, the
>>> With what priority? If it needs to call _overlap unconditionally why not
>>> always call _overlap and drop not _overlap variant?
>>
>> Because we should catch accidental overlaps in all those non PCI devices
>> with hard-wired addressing. That's a bug in the device/machine model and
>> should be reported as such by QEMU.
> Why should we complicate API to catch unlikely errors? If you want to
> debug that add capability to dump memory map from the monitor.

Because we need to switch tons of code that so far saw a fairly
different reaction of the core to overlapping regions.

> 
>>
>>>
>>>> new region management will not cause any harm to overlapping regions so
>>>> that they can "recover" when the overlap is gone.
>>>>
>>>>>
>>>>> Another example may be APIC region and PCI. They overlap, but neither
>>>>> CPU nor PCI knows about it.
>>>>
>>>> And they do not need to. The APIC regions will be managed by the per-CPU
>>>> region management, reusing the tool box we need for all bridges. It will
>>>> register the APIC page with a priority higher than the default one, thus
>>>> overriding everything that comes from the host bridge. I think that
>>>> reflects pretty well real machine behaviour.
>>>>
>>> What is "higher"? How does it know that priority is high enough?
>>
>> Because no one else manages priorities at a specific hierarchy level.
>> There is only one.
>>
> PCI and CPU are on different hierarchy levels. PCI is under the PIIX and
> CPU is on a system BUS.

The priority for the APIC mapping will be applied at CPU level, of
course. So it will override everything, not just PCI.

> 
>>> I
>>> thought, from reading other replies, that priorities are meaningful
>>> only on the same hierarchy level (which kinda make sense), but now you
>>> are saying that you will override PCI address from another part of
>>> the topology?
>>
>> Everything from below in the hierarchy is fed in with default priority,
>> the lowest one. So to let some region created at this level override
>> those regions, just pick default+1. If you want to create more overlay
>> levels (can't imagine a good scenario, though), pick
>> default+1..default+n. It's really that simple.
>>
> Except that PCI and CPU are not on the same level.

See above.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 18:45                 ` Jan Kiszka
@ 2011-05-19 18:50                   ` Gleb Natapov
  2011-05-19 18:55                     ` Jan Kiszka
  0 siblings, 1 reply; 187+ messages in thread
From: Gleb Natapov @ 2011-05-19 18:50 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, qemu-devel

On Thu, May 19, 2011 at 08:45:23PM +0200, Jan Kiszka wrote:
> On 2011-05-19 20:40, Gleb Natapov wrote:
> > On Thu, May 19, 2011 at 08:27:50PM +0200, Jan Kiszka wrote:
> >> On 2011-05-19 20:22, Gleb Natapov wrote:
> >>> On Thu, May 19, 2011 at 08:11:39PM +0200, Jan Kiszka wrote:
> >>>> On 2011-05-19 19:39, Gleb Natapov wrote:
> >>>>> On Thu, May 19, 2011 at 03:37:58PM +0200, Jan Kiszka wrote:
> >>>>>> On 2011-05-19 15:36, Anthony Liguori wrote:
> >>>>>>> On 05/18/2011 02:40 PM, Jan Kiszka wrote:
> >>>>>>>> On 2011-05-18 15:12, Avi Kivity wrote:
> >>>>>>>>> void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t
> >>>>>>>>> addr);
> >>>>>>>>
> >>>>>>>> OK, let's allow overlapping, but make it explicit:
> >>>>>>>>
> >>>>>>>> void cpu_register_memory_region_overlap(MemoryRegion *mr,
> >>>>>>>>                                          target_phys_addr_t addr,
> >>>>>>>>                                          int priority);
> >>>>>>>
> >>>>>>> The device doesn't actually know how overlapping is handled.  This is
> >>>>>>> based on the bus hierarchy.
> >>>>>>
> >>>>>> Devices don't register their regions, buses do.
> >>>>>>
> >>>>> Today PCI device may register region that overlaps with any other
> >>>>> registered memory region without even knowing it. Guest can write any
> >>>>> RAM address into PCI BAR and this RAM address will be come mmio are. More
> >>>>> interesting is what happens when guest reprogram PCI BAR to other address
> >>>>> - the RAM that was at the previous address just disappears. Obviously
> >>>>>   this is crazy behaviour, but the question is how do we want to handle
> >>>>> it? One option is to disallow such overlapping registration, another is
> >>>>> to restore RAM mapping after PCI BAR is reprogrammed. If we chose second
> >>>>> one the PCI will not know that _overlap() should be called.
> >>>>
> >>>> BARs may overlap with other BARs or with RAM. That's well-known, so PCI
> >>>> bridged need to register their regions with the _overlap variant
> >>>> unconditionally. In contrast to the current PhysPageDesc mechanism, the
> >>> With what priority? If it needs to call _overlap unconditionally why not
> >>> always call _overlap and drop not _overlap variant?
> >>
> >> Because we should catch accidental overlaps in all those non PCI devices
> >> with hard-wired addressing. That's a bug in the device/machine model and
> >> should be reported as such by QEMU.
> > Why should we complicate API to catch unlikely errors? If you want to
> > debug that add capability to dump memory map from the monitor.
> 
> Because we need to switch tons of code that so far saw a fairly
> different reaction of the core to overlapping regions.
> 
How so? Today if there is accidental overlap device will not function properly.
With new API it will be the same.

> > 
> >>
> >>>
> >>>> new region management will not cause any harm to overlapping regions so
> >>>> that they can "recover" when the overlap is gone.
> >>>>
> >>>>>
> >>>>> Another example may be APIC region and PCI. They overlap, but neither
> >>>>> CPU nor PCI knows about it.
> >>>>
> >>>> And they do not need to. The APIC regions will be managed by the per-CPU
> >>>> region management, reusing the tool box we need for all bridges. It will
> >>>> register the APIC page with a priority higher than the default one, thus
> >>>> overriding everything that comes from the host bridge. I think that
> >>>> reflects pretty well real machine behaviour.
> >>>>
> >>> What is "higher"? How does it know that priority is high enough?
> >>
> >> Because no one else manages priorities at a specific hierarchy level.
> >> There is only one.
> >>
> > PCI and CPU are on different hierarchy levels. PCI is under the PIIX and
> > CPU is on a system BUS.
> 
> The priority for the APIC mapping will be applied at CPU level, of
> course. So it will override everything, not just PCI.
> 
So you do not need explicit priority because the place in hierarchy
implicitly provides you with one.

> > 
> >>> I
> >>> thought, from reading other replies, that priorities are meaningful
> >>> only on the same hierarchy level (which kinda make sense), but now you
> >>> are saying that you will override PCI address from another part of
> >>> the topology?
> >>
> >> Everything from below in the hierarchy is fed in with default priority,
> >> the lowest one. So to let some region created at this level override
> >> those regions, just pick default+1. If you want to create more overlay
> >> levels (can't imagine a good scenario, though), pick
> >> default+1..default+n. It's really that simple.
> >>
> > Except that PCI and CPU are not on the same level.
> 
> See above.
> 
> Jan
> 



--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 18:18                           ` Anthony Liguori
@ 2011-05-19 18:50                             ` Jan Kiszka
  2011-05-19 19:02                               ` Anthony Liguori
  2011-05-20  9:15                             ` Avi Kivity
  1 sibling, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 18:50 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Peter Maydell, Avi Kivity, Gleb Natapov, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1618 bytes --]

On 2011-05-19 20:18, Anthony Liguori wrote:
> On 05/19/2011 09:11 AM, Avi Kivity wrote:
>> On 05/19/2011 05:04 PM, Anthony Liguori wrote:
>>>
>>> Right, the chipset register is mainly used to program the contents of
>>> SMM.
>>>
>>> There is a single access pin that has effectively the same semantics
>>> as setting the chipset register.
>>>
>>> It's not a per-CPU setting--that's the point. You can't have one CPU
>>> reading SMM memory at the exactly same time as accessing VGA.
>>>
>>> But I guess you can never have two simultaneous accesses anyway so
>>> perhaps it's splitting hairs :-)
>>
>> Exactly - it just works.
> 
> Well, not really.
> 
> kvm.ko has a global mapping of RAM regions and currently only allows
> code execution from RAM.
> 
> This means the only way for QEMU to enable SMM support is to program the
> global RAM regions table to enable allow RAM access for the VGA region.
> 
> The problem with this is that it's perfectly conceivable to have CPU 0
> in SMM mode while CPU 1 is doing MMIO to the VGA planar.
> 
> The same problem exists with PAM.  It would be much easier to implement
> PAM correctly in QEMU if it were possible to execute code via MMIO as we
> could just mark the BIOS memory as non-RAM and deal with the dispatch
> ourselves.

If we already have to change KVM (I guess we have to), let's better add
per-CPU memory slot support. That will allow to switch between VGA and
SMRAM without costly dispatching. At this chance, I think we also need
some support for half-MMIO (MMIO on write, RAM on read) for proper flash
support.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 18:50                   ` Gleb Natapov
@ 2011-05-19 18:55                     ` Jan Kiszka
  2011-05-19 19:02                       ` Jan Kiszka
  2011-05-20  7:23                       ` Gleb Natapov
  0 siblings, 2 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 18:55 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4453 bytes --]

On 2011-05-19 20:50, Gleb Natapov wrote:
> On Thu, May 19, 2011 at 08:45:23PM +0200, Jan Kiszka wrote:
>> On 2011-05-19 20:40, Gleb Natapov wrote:
>>> On Thu, May 19, 2011 at 08:27:50PM +0200, Jan Kiszka wrote:
>>>> On 2011-05-19 20:22, Gleb Natapov wrote:
>>>>> On Thu, May 19, 2011 at 08:11:39PM +0200, Jan Kiszka wrote:
>>>>>> On 2011-05-19 19:39, Gleb Natapov wrote:
>>>>>>> On Thu, May 19, 2011 at 03:37:58PM +0200, Jan Kiszka wrote:
>>>>>>>> On 2011-05-19 15:36, Anthony Liguori wrote:
>>>>>>>>> On 05/18/2011 02:40 PM, Jan Kiszka wrote:
>>>>>>>>>> On 2011-05-18 15:12, Avi Kivity wrote:
>>>>>>>>>>> void cpu_register_memory_region(MemoryRegion *mr, target_phys_addr_t
>>>>>>>>>>> addr);
>>>>>>>>>>
>>>>>>>>>> OK, let's allow overlapping, but make it explicit:
>>>>>>>>>>
>>>>>>>>>> void cpu_register_memory_region_overlap(MemoryRegion *mr,
>>>>>>>>>>                                          target_phys_addr_t addr,
>>>>>>>>>>                                          int priority);
>>>>>>>>>
>>>>>>>>> The device doesn't actually know how overlapping is handled.  This is
>>>>>>>>> based on the bus hierarchy.
>>>>>>>>
>>>>>>>> Devices don't register their regions, buses do.
>>>>>>>>
>>>>>>> Today PCI device may register region that overlaps with any other
>>>>>>> registered memory region without even knowing it. Guest can write any
>>>>>>> RAM address into PCI BAR and this RAM address will be come mmio are. More
>>>>>>> interesting is what happens when guest reprogram PCI BAR to other address
>>>>>>> - the RAM that was at the previous address just disappears. Obviously
>>>>>>>   this is crazy behaviour, but the question is how do we want to handle
>>>>>>> it? One option is to disallow such overlapping registration, another is
>>>>>>> to restore RAM mapping after PCI BAR is reprogrammed. If we chose second
>>>>>>> one the PCI will not know that _overlap() should be called.
>>>>>>
>>>>>> BARs may overlap with other BARs or with RAM. That's well-known, so PCI
>>>>>> bridged need to register their regions with the _overlap variant
>>>>>> unconditionally. In contrast to the current PhysPageDesc mechanism, the
>>>>> With what priority? If it needs to call _overlap unconditionally why not
>>>>> always call _overlap and drop not _overlap variant?
>>>>
>>>> Because we should catch accidental overlaps in all those non PCI devices
>>>> with hard-wired addressing. That's a bug in the device/machine model and
>>>> should be reported as such by QEMU.
>>> Why should we complicate API to catch unlikely errors? If you want to
>>> debug that add capability to dump memory map from the monitor.
>>
>> Because we need to switch tons of code that so far saw a fairly
>> different reaction of the core to overlapping regions.
>>
> How so? Today if there is accidental overlap device will not function properly.
> With new API it will be the same.

I rather expect subtle differences as overlapping registration changes
existing regions, in the future those will recover.

> 
>>>
>>>>
>>>>>
>>>>>> new region management will not cause any harm to overlapping regions so
>>>>>> that they can "recover" when the overlap is gone.
>>>>>>
>>>>>>>
>>>>>>> Another example may be APIC region and PCI. They overlap, but neither
>>>>>>> CPU nor PCI knows about it.
>>>>>>
>>>>>> And they do not need to. The APIC regions will be managed by the per-CPU
>>>>>> region management, reusing the tool box we need for all bridges. It will
>>>>>> register the APIC page with a priority higher than the default one, thus
>>>>>> overriding everything that comes from the host bridge. I think that
>>>>>> reflects pretty well real machine behaviour.
>>>>>>
>>>>> What is "higher"? How does it know that priority is high enough?
>>>>
>>>> Because no one else manages priorities at a specific hierarchy level.
>>>> There is only one.
>>>>
>>> PCI and CPU are on different hierarchy levels. PCI is under the PIIX and
>>> CPU is on a system BUS.
>>
>> The priority for the APIC mapping will be applied at CPU level, of
>> course. So it will override everything, not just PCI.
>>
> So you do not need explicit priority because the place in hierarchy
> implicitly provides you with one.

Yes. Alternatively, you could add a prio offset to all mappings when
climbing one level up, provided that offset is smaller than the prio
range locally available to each level.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 18:50                             ` Jan Kiszka
@ 2011-05-19 19:02                               ` Anthony Liguori
  2011-05-19 19:10                                 ` Jan Kiszka
  0 siblings, 1 reply; 187+ messages in thread
From: Anthony Liguori @ 2011-05-19 19:02 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Peter Maydell, Avi Kivity, Gleb Natapov, qemu-devel

On 05/19/2011 01:50 PM, Jan Kiszka wrote:
> On 2011-05-19 20:18, Anthony Liguori wrote:
>> Well, not really.
>>
>> kvm.ko has a global mapping of RAM regions and currently only allows
>> code execution from RAM.
>>
>> This means the only way for QEMU to enable SMM support is to program the
>> global RAM regions table to enable allow RAM access for the VGA region.
>>
>> The problem with this is that it's perfectly conceivable to have CPU 0
>> in SMM mode while CPU 1 is doing MMIO to the VGA planar.
>>
>> The same problem exists with PAM.  It would be much easier to implement
>> PAM correctly in QEMU if it were possible to execute code via MMIO as we
>> could just mark the BIOS memory as non-RAM and deal with the dispatch
>> ourselves.
>
> If we already have to change KVM (I guess we have to), let's better add
> per-CPU memory slot support. That will allow to switch between VGA and
> SMRAM without costly dispatching. At this chance, I think we also need
> some support for half-MMIO (MMIO on write, RAM on read) for proper flash
> support.

This is needed for PAM too.

But RAM isn't mapped per-CPU so this is at best an optimization.  You 
can (and do) execute instructions out of non-RAM memory though.  I think 
if we lifted this restriction in KVM, it would allow us to handle 
SMRAM/PAM in a more thorough way.

Regards,

Anthony Liguori

>
> Jan
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 18:55                     ` Jan Kiszka
@ 2011-05-19 19:02                       ` Jan Kiszka
  2011-05-20  7:23                       ` Gleb Natapov
  1 sibling, 0 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 19:02 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 315 bytes --]

On 2011-05-19 20:55, Jan Kiszka wrote:
> Alternatively, you could add a prio offset to all mappings when
> climbing one level up, provided that offset is smaller than the prio
> range locally available to each level.

Err, forget that, wrong analogy. It's more a round up after flattening
the view.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 19:02                               ` Anthony Liguori
@ 2011-05-19 19:10                                 ` Jan Kiszka
  0 siblings, 0 replies; 187+ messages in thread
From: Jan Kiszka @ 2011-05-19 19:10 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Peter Maydell, Avi Kivity, Gleb Natapov, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1980 bytes --]

On 2011-05-19 21:02, Anthony Liguori wrote:
> On 05/19/2011 01:50 PM, Jan Kiszka wrote:
>> On 2011-05-19 20:18, Anthony Liguori wrote:
>>> Well, not really.
>>>
>>> kvm.ko has a global mapping of RAM regions and currently only allows
>>> code execution from RAM.
>>>
>>> This means the only way for QEMU to enable SMM support is to program the
>>> global RAM regions table to enable allow RAM access for the VGA region.
>>>
>>> The problem with this is that it's perfectly conceivable to have CPU 0
>>> in SMM mode while CPU 1 is doing MMIO to the VGA planar.
>>>
>>> The same problem exists with PAM.  It would be much easier to implement
>>> PAM correctly in QEMU if it were possible to execute code via MMIO as we
>>> could just mark the BIOS memory as non-RAM and deal with the dispatch
>>> ourselves.
>>
>> If we already have to change KVM (I guess we have to), let's better add
>> per-CPU memory slot support. That will allow to switch between VGA and
>> SMRAM without costly dispatching. At this chance, I think we also need
>> some support for half-MMIO (MMIO on write, RAM on read) for proper flash
>> support.
> 
> This is needed for PAM too.

Yeah, right, there were some to-do comments I strictly ignored

> 
> But RAM isn't mapped per-CPU so this is at best an optimization.

SMRAM is mapped per CPU, depending on the execution mode. That is the point.

>  You
> can (and do) execute instructions out of non-RAM memory though.  I think
> if we lifted this restriction in KVM, it would allow us to handle
> SMRAM/PAM in a more thorough way.

I do not disagree that having such feature would be nice for certain
corner cases. But executing code by jumping to user space on every
instruction fetch, maybe just to dispatch between SMRAM and normal MMIO
on a per-CPU base, will give you horrible performance. That can at best
be an intermediate step, though I bet it's better to address both at
roughly the same time.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 18:55                     ` Jan Kiszka
  2011-05-19 19:02                       ` Jan Kiszka
@ 2011-05-20  7:23                       ` Gleb Natapov
  2011-05-20  7:40                         ` Jan Kiszka
  1 sibling, 1 reply; 187+ messages in thread
From: Gleb Natapov @ 2011-05-20  7:23 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, qemu-devel

On Thu, May 19, 2011 at 08:55:49PM +0200, Jan Kiszka wrote:
> >>>> Because we should catch accidental overlaps in all those non PCI devices
> >>>> with hard-wired addressing. That's a bug in the device/machine model and
> >>>> should be reported as such by QEMU.
> >>> Why should we complicate API to catch unlikely errors? If you want to
> >>> debug that add capability to dump memory map from the monitor.
> >>
> >> Because we need to switch tons of code that so far saw a fairly
> >> different reaction of the core to overlapping regions.
> >>
> > How so? Today if there is accidental overlap device will not function properly.
> > With new API it will be the same.
> 
> I rather expect subtle differences as overlapping registration changes
> existing regions, in the future those will recover.
> 
Where do you expect the differences will come from? Conversion to the new
API shouldn't change the order of the registration and if the last
registration will override previous one the end result should be the
same as we have today.

> >>>>>> new region management will not cause any harm to overlapping regions so
> >>>>>> that they can "recover" when the overlap is gone.
> >>>>>>
> >>>>>>>
> >>>>>>> Another example may be APIC region and PCI. They overlap, but neither
> >>>>>>> CPU nor PCI knows about it.
> >>>>>>
> >>>>>> And they do not need to. The APIC regions will be managed by the per-CPU
> >>>>>> region management, reusing the tool box we need for all bridges. It will
> >>>>>> register the APIC page with a priority higher than the default one, thus
> >>>>>> overriding everything that comes from the host bridge. I think that
> >>>>>> reflects pretty well real machine behaviour.
> >>>>>>
> >>>>> What is "higher"? How does it know that priority is high enough?
> >>>>
> >>>> Because no one else manages priorities at a specific hierarchy level.
> >>>> There is only one.
> >>>>
> >>> PCI and CPU are on different hierarchy levels. PCI is under the PIIX and
> >>> CPU is on a system BUS.
> >>
> >> The priority for the APIC mapping will be applied at CPU level, of
> >> course. So it will override everything, not just PCI.
> >>
> > So you do not need explicit priority because the place in hierarchy
> > implicitly provides you with one.
> 
> Yes.
OK :) So you agree that we can do without priorities :)

>       Alternatively, you could add a prio offset to all mappings when
> climbing one level up, provided that offset is smaller than the prio
> range locally available to each level.
> 
Then a memory region final priority will depend on a tree height. If two
disjointed tree branches of different height will claim the same memory
region the higher one will have higher priority. I think this priority
management is a can of worms.

Only the lowest level (aka system bus) will use memory API directly. PCI
device will call PCI subsystem. PCI subsystem, instead of assigning
arbitrary priorities to all overlappings, may just resolve them and pass
flattened view to the chipset. Chipset in turn will look for overlappings
between PCI memory areas and RAM/ISA/other memory areas that are outside
of PCI windows and resolve all those passing the flattened view to system
bus where APIC/PCI conflict will be resolved and finally memory API will
be used to create memory map. In such a model I do not see the need for
priorities. All overlappings are resolved in the most logical place,
the one that has the best knowledge about how to resolve the conflict.
The will be no code duplication. Overlapping resolution code will be in
separate library used by all layers.

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-20  7:23                       ` Gleb Natapov
@ 2011-05-20  7:40                         ` Jan Kiszka
  2011-05-20 11:25                           ` Gleb Natapov
  0 siblings, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-20  7:40 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5109 bytes --]

On 2011-05-20 09:23, Gleb Natapov wrote:
> On Thu, May 19, 2011 at 08:55:49PM +0200, Jan Kiszka wrote:
>>>>>> Because we should catch accidental overlaps in all those non PCI devices
>>>>>> with hard-wired addressing. That's a bug in the device/machine model and
>>>>>> should be reported as such by QEMU.
>>>>> Why should we complicate API to catch unlikely errors? If you want to
>>>>> debug that add capability to dump memory map from the monitor.
>>>>
>>>> Because we need to switch tons of code that so far saw a fairly
>>>> different reaction of the core to overlapping regions.
>>>>
>>> How so? Today if there is accidental overlap device will not function properly.
>>> With new API it will be the same.
>>
>> I rather expect subtle differences as overlapping registration changes
>> existing regions, in the future those will recover.
>>
> Where do you expect the differences will come from? Conversion to the new
> API shouldn't change the order of the registration and if the last
> registration will override previous one the end result should be the
> same as we have today.

A) Removing regions will change significantly. So far this is done by
setting a region to IO_MEM_UNASSIGNED, keeping truncation. With the new
API that will be a true removal which will additionally restore hidden
regions.

B) Uncontrolled overlapping is a bug that should be caught by the core,
and a new API is a perfect chance to do this.

> 
>>>>>>>> new region management will not cause any harm to overlapping regions so
>>>>>>>> that they can "recover" when the overlap is gone.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Another example may be APIC region and PCI. They overlap, but neither
>>>>>>>>> CPU nor PCI knows about it.
>>>>>>>>
>>>>>>>> And they do not need to. The APIC regions will be managed by the per-CPU
>>>>>>>> region management, reusing the tool box we need for all bridges. It will
>>>>>>>> register the APIC page with a priority higher than the default one, thus
>>>>>>>> overriding everything that comes from the host bridge. I think that
>>>>>>>> reflects pretty well real machine behaviour.
>>>>>>>>
>>>>>>> What is "higher"? How does it know that priority is high enough?
>>>>>>
>>>>>> Because no one else manages priorities at a specific hierarchy level.
>>>>>> There is only one.
>>>>>>
>>>>> PCI and CPU are on different hierarchy levels. PCI is under the PIIX and
>>>>> CPU is on a system BUS.
>>>>
>>>> The priority for the APIC mapping will be applied at CPU level, of
>>>> course. So it will override everything, not just PCI.
>>>>
>>> So you do not need explicit priority because the place in hierarchy
>>> implicitly provides you with one.
>>
>> Yes.
> OK :) So you agree that we can do without priorities :)

Nope, see below how your own example depends on them.

> 
>>       Alternatively, you could add a prio offset to all mappings when
>> climbing one level up, provided that offset is smaller than the prio
>> range locally available to each level.
>>
> Then a memory region final priority will depend on a tree height. If two
> disjointed tree branches of different height will claim the same memory
> region the higher one will have higher priority. I think this priority
> management is a can of worms.

It is not as it remains a pure local thing and helps implementing the
sketched scenarios. Believe, I tried to fix PAM/SMRAM already.

> 
> Only the lowest level (aka system bus) will use memory API directly.

Not necessarily. It depends on how much added value buses like PCI or
ISA or whatever can offer for managing I/O regions. For some purposes,
it may as well be fine to just call the memory_* service directly and
pass the result of some operation to the bus API later on.

> PCI
> device will call PCI subsystem. PCI subsystem, instead of assigning
> arbitrary priorities to all overlappings,

Again: PCI will _not_ assign arbitrary priorities but only
MEMORY_REGION_DEFAULT_PRIORITY, likely 0.

> may just resolve them and pass
> flattened view to the chipset. Chipset in turn will look for overlappings
> between PCI memory areas and RAM/ISA/other memory areas that are outside
> of PCI windows and resolve all those passing the flattened view to system
> bus where APIC/PCI conflict will be resolved and finally memory API will
> be used to create memory map. In such a model I do not see the need for
> priorities. All overlappings are resolved in the most logical place,
> the one that has the best knowledge about how to resolve the conflict.
> The will be no code duplication. Overlapping resolution code will be in
> separate library used by all layers.

That does not specify how the PCI bridge or the chipset will tell that
overlapping resolution lib _how_ overlapping regions shall be translated
into a flat representation. And precisely here come priorities into
play. It is the way to tell that lib either "region A shall override
region B" if A has higher prio or "if region A and B overlap, do
whatever you want" if both have the same prio.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 16:36                                                                 ` Anthony Liguori
  2011-05-19 16:49                                                                   ` Jan Kiszka
@ 2011-05-20  8:56                                                                   ` Avi Kivity
  2011-05-20 14:51                                                                     ` Anthony Liguori
  1 sibling, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-20  8:56 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov

On 05/19/2011 07:36 PM, Anthony Liguori wrote:
>> There are no global priorities. Priorities are only used inside each
>> level of the memory region hierarchy to generate a resulting, flattened
>> view for the next higher level. At that level, everything imported from
>> below has the default prio again, ie. the lowest one.
>
>
> Then SMM is impossible.
>

It doesn't follow.

> Why do we need priorities at all?  There should be no overlap at each 
> level in the hierarchy.

Of course there is overlap.  PCI BARs overlap each other, the VGA 
windows and ROM overlap RAM.

>
> If you have overlapping BARs, the PCI bus will always send the request 
> to a single device based on something that's implementation specific. 
> This works because each PCI device advertises the BAR locations and 
> sizes in it's config space.

BARs in general don't need priority, except we need to decide if BARs 
overlap RAM of vice-versa.

>
> To dispatch a request, the PCI bus will walk the config space to find 
> a match.  If you remove something that was previously causing an 
> overlap, it'll the other device will now get the I/O requests.

That's what *exactl* what priority means.  Which device is in front, and 
which is in the back.

>
> To model this correctly, you need to let the PCI bus decide how to 
> dispatch I/O requests (again, you need hierarchical dispatch).

And again, this API gives you hierarchical dispatch, with the addition 
that some of it is done at registration time so we can prepare the RAM 
slots.

>
> In the absence of this, the PCI bus needs to look at all of the 
> devices, figure out the flat mapping, and register it.  When a device 
> is added or removed, it needs to recalculate the flat mapping and 
> register it.

However we do this, we need to look at all devices.

>
> There is no need to have centralized logic to decide this.
>

I think you're completely missing the point of my proposal.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 16:49                                                                   ` Jan Kiszka
  2011-05-19 17:12                                                                     ` Gleb Natapov
@ 2011-05-20  8:58                                                                     ` Avi Kivity
  1 sibling, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-20  8:58 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Gleb Natapov, qemu-devel

On 05/19/2011 07:49 PM, Jan Kiszka wrote:
> >
> >  If you have overlapping BARs, the PCI bus will always send the request
> >  to a single device based on something that's implementation specific.
> >  This works because each PCI device advertises the BAR locations and
> >  sizes in it's config space.
>
> That's not a use case for priorities at all. Priorities are useful for
> PAM and SMRAM-like scenarios.

Correct.  Priorities are also useful to decide if BARs hide RAM or 
vice-versa (determined by the PCI container's priority vs. the RAM 
container priorities, not individual BARs' priorities).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 16:27                                                         ` Gleb Natapov
@ 2011-05-20  8:59                                                           ` Avi Kivity
  2011-05-20 11:57                                                             ` Gleb Natapov
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-20  8:59 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel

On 05/19/2011 07:27 PM, Gleb Natapov wrote:
> >  Think of how a window manager folds windows with priorities onto a
> >  flat framebuffer.
> >
> >  You do a depth-first walk of the tree.  For each child list, you
> >  iterate it from the lowest to highest priority, allowing later
> >  subregions override earlier subregions.
> >
> I do not think that window manager is a good analogy. Window can
> overlap with only its siblings. In our memory tree each final node may
> overlap with any other node in the tree.
>

Transparent windows.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 16:32                                                         ` Anthony Liguori
  2011-05-19 16:35                                                           ` Jan Kiszka
@ 2011-05-20  9:01                                                           ` Avi Kivity
  2011-05-20 15:33                                                             ` Anthony Liguori
  1 sibling, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-20  9:01 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov

On 05/19/2011 07:32 PM, Anthony Liguori wrote:
>> Think of how a window manager folds windows with priorities onto a flat
>> framebuffer.
>>
>> You do a depth-first walk of the tree. For each child list, you iterate
>> it from the lowest to highest priority, allowing later subregions
>> override earlier subregions.
>>
>
>
> Okay, but this doesn't explain how you'll let RAM override the VGA 
> mapping since RAM is not represented in the same child list as VGA 
> (RAM is a child of the PMC whereas VGA is a child of ISA/PCI, both of 
> which are at least one level removed from the PMC).

VGA will override RAM.

Memory controller
  |
  +-- RAM container (prio 0)
  |
  +-- PCI container (prio 1)
       |
       +--- vga window


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 16:38                                                             ` Anthony Liguori
  2011-05-19 16:50                                                               ` Jan Kiszka
@ 2011-05-20  9:03                                                               ` Avi Kivity
  1 sibling, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-20  9:03 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov

On 05/19/2011 07:38 PM, Anthony Liguori wrote:
>> You can always create a new memory region with higher priority, pointing
>> to the RAM window you want to have above VGA. That's what we do today as
>> well, just with different effects on the internal representation.
>
>
> But then we're no better than we are today.  I thought the whole point 
> of this thread of discussion was to allow overlapping I/O regions to 
> be handled in a better way than we do today?

It is, and the goal is achieved.  Right now the code saves the old 
contents in isa_page_descs.  With the new approach it calls 
memory_region_del_subregion() and the previous contents magically appear 
(or new contents if they changed in the meanwhile).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 18:22           ` Gleb Natapov
  2011-05-19 18:27             ` Jan Kiszka
@ 2011-05-20  9:10             ` Avi Kivity
  2011-05-20 12:08               ` Gleb Natapov
  1 sibling, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-20  9:10 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel

On 05/19/2011 09:22 PM, Gleb Natapov wrote:
> >
> >  BARs may overlap with other BARs or with RAM. That's well-known, so PCI
> >  bridged need to register their regions with the _overlap variant
> >  unconditionally. In contrast to the current PhysPageDesc mechanism, the
> With what priority?

It doesn't matter, since the spec doesn't define priorities among PCI BARs.

> If it needs to call _overlap unconditionally why not
> always call _overlap and drop not _overlap variant?

Other uses need non-overlapping registration.

> >
> >  And they do not need to. The APIC regions will be managed by the per-CPU
> >  region management, reusing the tool box we need for all bridges. It will
> >  register the APIC page with a priority higher than the default one, thus
> >  overriding everything that comes from the host bridge. I think that
> >  reflects pretty well real machine behaviour.
> >
> What is "higher"? How does it know that priority is high enough?

It is well known that 1 > 0, for example.

> I
> thought, from reading other replies, that priorities are meaningful
> only on the same hierarchy level (which kinda make sense), but now you
> are saying that you will override PCI address from another part of
> the topology?

-- per-cpu memory
     |
     +--- apic page (prio 1)
     |
     +--- global memory (prio 0)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19 18:18                           ` Anthony Liguori
  2011-05-19 18:50                             ` Jan Kiszka
@ 2011-05-20  9:15                             ` Avi Kivity
  1 sibling, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-20  9:15 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov, Peter Maydell

On 05/19/2011 09:18 PM, Anthony Liguori wrote:
> On 05/19/2011 09:11 AM, Avi Kivity wrote:
>> On 05/19/2011 05:04 PM, Anthony Liguori wrote:
>>>
>>> Right, the chipset register is mainly used to program the contents of
>>> SMM.
>>>
>>> There is a single access pin that has effectively the same semantics
>>> as setting the chipset register.
>>>
>>> It's not a per-CPU setting--that's the point. You can't have one CPU
>>> reading SMM memory at the exactly same time as accessing VGA.
>>>
>>> But I guess you can never have two simultaneous accesses anyway so
>>> perhaps it's splitting hairs :-)
>>
>> Exactly - it just works.
>
> Well, not really.
>
> kvm.ko has a global mapping of RAM regions and currently only allows 
> code execution from RAM.
>
> This means the only way for QEMU to enable SMM support is to program 
> the global RAM regions table to enable allow RAM access for the VGA 
> region.
>
> The problem with this is that it's perfectly conceivable to have CPU 0 
> in SMM mode while CPU 1 is doing MMIO to the VGA planar.

kvm needs updates to support SMM; I already outlined them several months 
ago.

>
> The same problem exists with PAM. 

PAM is a completely different problem.  The changes are global and fit 
kvm slot management.

> It would be much easier to implement PAM correctly in QEMU if it were 
> possible to execute code via MMIO as we could just mark the BIOS 
> memory as non-RAM and deal with the dispatch ourselves.
>
> Would it be fundamentally hard to support this in KVM?  I guess you 
> would need to put the VCPU in single step mode and maintain a page to 
> copy the results into.

You need to emulate everything.  We're probably not far from that.  
However there may be a significant performance loss.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-20  7:40                         ` Jan Kiszka
@ 2011-05-20 11:25                           ` Gleb Natapov
  2011-05-22  7:50                             ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Gleb Natapov @ 2011-05-20 11:25 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, qemu-devel

On Fri, May 20, 2011 at 09:40:13AM +0200, Jan Kiszka wrote:
> On 2011-05-20 09:23, Gleb Natapov wrote:
> > On Thu, May 19, 2011 at 08:55:49PM +0200, Jan Kiszka wrote:
> >>>>>> Because we should catch accidental overlaps in all those non PCI devices
> >>>>>> with hard-wired addressing. That's a bug in the device/machine model and
> >>>>>> should be reported as such by QEMU.
> >>>>> Why should we complicate API to catch unlikely errors? If you want to
> >>>>> debug that add capability to dump memory map from the monitor.
> >>>>
> >>>> Because we need to switch tons of code that so far saw a fairly
> >>>> different reaction of the core to overlapping regions.
> >>>>
> >>> How so? Today if there is accidental overlap device will not function properly.
> >>> With new API it will be the same.
> >>
> >> I rather expect subtle differences as overlapping registration changes
> >> existing regions, in the future those will recover.
> >>
> > Where do you expect the differences will come from? Conversion to the new
> > API shouldn't change the order of the registration and if the last
> > registration will override previous one the end result should be the
> > same as we have today.
> 
> A) Removing regions will change significantly. So far this is done by
> setting a region to IO_MEM_UNASSIGNED, keeping truncation. With the new
> API that will be a true removal which will additionally restore hidden
> regions.
> 
And what problem do you expect may arise from that? Currently accessing
such region after unassign will result in undefined behaviour, so this
code is non working today, you can't make it worse.

> B) Uncontrolled overlapping is a bug that should be caught by the core,
> and a new API is a perfect chance to do this.
> 
Well, this will indeed introduce the difference in behaviour :) The guest
that ran before will abort now. Are you actually aware of any such
overlaps in the current code base?

But if priorities are gona stay why not fail if two regions with the
same priority overlap? If that happens it means that the memory creation
didn't pass the point where conflict should have been resolved (by
assigning different priorities) and this means that overlap is
unintentional, no?

> > 
> >>>>>>>> new region management will not cause any harm to overlapping regions so
> >>>>>>>> that they can "recover" when the overlap is gone.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Another example may be APIC region and PCI. They overlap, but neither
> >>>>>>>>> CPU nor PCI knows about it.
> >>>>>>>>
> >>>>>>>> And they do not need to. The APIC regions will be managed by the per-CPU
> >>>>>>>> region management, reusing the tool box we need for all bridges. It will
> >>>>>>>> register the APIC page with a priority higher than the default one, thus
> >>>>>>>> overriding everything that comes from the host bridge. I think that
> >>>>>>>> reflects pretty well real machine behaviour.
> >>>>>>>>
> >>>>>>> What is "higher"? How does it know that priority is high enough?
> >>>>>>
> >>>>>> Because no one else manages priorities at a specific hierarchy level.
> >>>>>> There is only one.
> >>>>>>
> >>>>> PCI and CPU are on different hierarchy levels. PCI is under the PIIX and
> >>>>> CPU is on a system BUS.
> >>>>
> >>>> The priority for the APIC mapping will be applied at CPU level, of
> >>>> course. So it will override everything, not just PCI.
> >>>>
> >>> So you do not need explicit priority because the place in hierarchy
> >>> implicitly provides you with one.
> >>
> >> Yes.
> > OK :) So you agree that we can do without priorities :)
> 
> Nope, see below how your own example depends on them.
> 
It depends on them in very defined way. Only layer that knows exactly
what is going on defines priorities. The priorities do not leak on any
other level or global database. It is different from propagating priority
from PCI BAR to core memory API.

I am starting to see how you can represent all this local decisions as
priority numbers and then travel this weighted tree to find what memory
region should be accessed (memory registration _has_ to be hierarchical
for that to work in meaningful way). I still don't see why it is better
than flattening the tree in the point of conflict.
 
> > 
> >>       Alternatively, you could add a prio offset to all mappings when
> >> climbing one level up, provided that offset is smaller than the prio
> >> range locally available to each level.
> >>
> > Then a memory region final priority will depend on a tree height. If two
> > disjointed tree branches of different height will claim the same memory
> > region the higher one will have higher priority. I think this priority
> > management is a can of worms.
> 
> It is not as it remains a pure local thing and helps implementing the
> sketched scenarios. Believe, I tried to fix PAM/SMRAM already.
If it remains local thing then I misunderstand what do you mean by
"could add a prio offset to all mappings when climbing one level up".
Doesn't sound like local things to me any more.

What problem did you have with PAM except low number of KVM slots btw?

> 
> > 
> > Only the lowest level (aka system bus) will use memory API directly.
> 
> Not necessarily. It depends on how much added value buses like PCI or
> ISA or whatever can offer for managing I/O regions. For some purposes,
> it may as well be fine to just call the memory_* service directly and
> pass the result of some operation to the bus API later on.
Depend on what memory_* service you are talking about. Just creating
unattached memory region is OK. But if two independent pieces of code
want to map two different memory regions into the same phys address I do
not see who will resolve the conflict.

> 
> > PCI
> > device will call PCI subsystem. PCI subsystem, instead of assigning
> > arbitrary priorities to all overlappings,
> 
> Again: PCI will _not_ assign arbitrary priorities but only
> MEMORY_REGION_DEFAULT_PRIORITY, likely 0.
That is as arbitrary as it can get. Just assigning
MEMORY_REGION_DEFAULT_PRIORITY/2^0xfff will work equally well, so what
is not arbitrary about that number?

BTW why wouldn't PCI layer assign different priorities to overlapping
regions to let the core know which one should be actually available? Why
leave this decision to the core if it is clearly belongs to PCI?

> 
> > may just resolve them and pass
> > flattened view to the chipset. Chipset in turn will look for overlappings
> > between PCI memory areas and RAM/ISA/other memory areas that are outside
> > of PCI windows and resolve all those passing the flattened view to system
> > bus where APIC/PCI conflict will be resolved and finally memory API will
> > be used to create memory map. In such a model I do not see the need for
> > priorities. All overlappings are resolved in the most logical place,
> > the one that has the best knowledge about how to resolve the conflict.
> > The will be no code duplication. Overlapping resolution code will be in
> > separate library used by all layers.
> 
> That does not specify how the PCI bridge or the chipset will tell that
> overlapping resolution lib _how_ overlapping regions shall be translated
> into a flat representation. And precisely here come priorities into
> play. It is the way to tell that lib either "region A shall override
> region B" if A has higher prio or "if region A and B overlap, do
> whatever you want" if both have the same prio.
> 
Yep! And the question is why shouldn't this be done on the level that
knows most about the conflict but propagated to the core. I am not
arguing that priorities do not exists! Obviously they are. I am
questioning the usefulness of priorities be part of memory core API.

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-20  8:59                                                           ` Avi Kivity
@ 2011-05-20 11:57                                                             ` Gleb Natapov
  2011-05-22  7:37                                                               ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Gleb Natapov @ 2011-05-20 11:57 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel

On Fri, May 20, 2011 at 11:59:58AM +0300, Avi Kivity wrote:
> On 05/19/2011 07:27 PM, Gleb Natapov wrote:
> >>  Think of how a window manager folds windows with priorities onto a
> >>  flat framebuffer.
> >>
> >>  You do a depth-first walk of the tree.  For each child list, you
> >>  iterate it from the lowest to highest priority, allowing later
> >>  subregions override earlier subregions.
> >>
> >I do not think that window manager is a good analogy. Window can
> >overlap with only its siblings. In our memory tree each final node may
> >overlap with any other node in the tree.
> >
> 
> Transparent windows.
> 
No, still not that. Think about child windows that resides outside of its
parent windows on screen. In our memory region terms think about PCI BAR
is registered to overlap with RAM at address 0x1000 for instance. PCI
BAR memory region and RAM memory region are on very different branches
of the global tree.

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-20  9:10             ` Avi Kivity
@ 2011-05-20 12:08               ` Gleb Natapov
  2011-05-22  7:56                 ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Gleb Natapov @ 2011-05-20 12:08 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel

On Fri, May 20, 2011 at 12:10:22PM +0300, Avi Kivity wrote:
> On 05/19/2011 09:22 PM, Gleb Natapov wrote:
> >>
> >>  BARs may overlap with other BARs or with RAM. That's well-known, so PCI
> >>  bridged need to register their regions with the _overlap variant
> >>  unconditionally. In contrast to the current PhysPageDesc mechanism, the
> >With what priority?
> 
> It doesn't matter, since the spec doesn't define priorities among PCI BARs.
> 
And among PCI BAR and memory (the case the question above referred too).

> >If it needs to call _overlap unconditionally why not
> >always call _overlap and drop not _overlap variant?
> 
> Other uses need non-overlapping registration.
And who prohibit them from creating one?

> 
> >>
> >>  And they do not need to. The APIC regions will be managed by the per-CPU
> >>  region management, reusing the tool box we need for all bridges. It will
> >>  register the APIC page with a priority higher than the default one, thus
> >>  overriding everything that comes from the host bridge. I think that
> >>  reflects pretty well real machine behaviour.
> >>
> >What is "higher"? How does it know that priority is high enough?
> 
> It is well known that 1 > 0, for example.
> 
That is if you have global scale. In the case I am asking about you do
not. Even if PCI will register memory region that overlaps APIC address
with priority 1000 APIC memory region should still be able to override
it even with priority 0. Voila 1000 < 0? Where is your sarcasm now? :)

But Jan already answered this one. Actually what really matters is the
place of the node in a topology, not priority. But then for all of this
to make sense registration has to be hierarchical.

> >I
> >thought, from reading other replies, that priorities are meaningful
> >only on the same hierarchy level (which kinda make sense), but now you
> >are saying that you will override PCI address from another part of
> >the topology?
> 
> -- per-cpu memory
>     |
>     +--- apic page (prio 1)
>     |
>     +--- global memory (prio 0)
> 
> -- 
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-20  8:56                                                                   ` Avi Kivity
@ 2011-05-20 14:51                                                                     ` Anthony Liguori
  2011-05-20 16:43                                                                       ` Olivier Galibert
  2011-05-22  7:36                                                                       ` Avi Kivity
  0 siblings, 2 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-20 14:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov

On 05/20/2011 03:56 AM, Avi Kivity wrote:
> On 05/19/2011 07:36 PM, Anthony Liguori wrote:
>>> There are no global priorities. Priorities are only used inside each
>>> level of the memory region hierarchy to generate a resulting, flattened
>>> view for the next higher level. At that level, everything imported from
>>> below has the default prio again, ie. the lowest one.
>>
>>
>> Then SMM is impossible.
>>
>
> It doesn't follow.
>
>> Why do we need priorities at all? There should be no overlap at each
>> level in the hierarchy.
>
> Of course there is overlap. PCI BARs overlap each other, the VGA windows
> and ROM overlap RAM.

Here's what I'm still struggling with:

If children normally overlap their parents, but child priorities are 
always less than their parents, then what's the benefit of having 
anything more than two priorities settings.

As far as I can understand it, a priority of 0 means "let children 
windows overlap" whereas a priority of 1 means "don't let children 
windows overlap".

Is there a use-case for a priority above 1 and if so, what does it mean?

>> If you have overlapping BARs, the PCI bus will always send the request
>> to a single device based on something that's implementation specific.
>> This works because each PCI device advertises the BAR locations and
>> sizes in it's config space.
>
> BARs in general don't need priority, except we need to decide if BARs
> overlap RAM of vice-versa.
>
>>
>> To dispatch a request, the PCI bus will walk the config space to find
>> a match. If you remove something that was previously causing an
>> overlap, it'll the other device will now get the I/O requests.
>
> That's what *exactl* what priority means. Which device is in front, and
> which is in the back.

Why not use registration order to resolve this type of conflict?  What 
are the use cases to use priorities where registration order wouldn't be 
adequate?

>> There is no need to have centralized logic to decide this.
>>
>
> I think you're completely missing the point of my proposal.

I'm struggling to find the mental model for priorities.  I may just be 
dense here but the analogy of transparent window ordering isn't helping me.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-20  9:01                                                           ` Avi Kivity
@ 2011-05-20 15:33                                                             ` Anthony Liguori
  2011-05-20 15:59                                                               ` Jan Kiszka
  0 siblings, 1 reply; 187+ messages in thread
From: Anthony Liguori @ 2011-05-20 15:33 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov

On 05/20/2011 04:01 AM, Avi Kivity wrote:
> On 05/19/2011 07:32 PM, Anthony Liguori wrote:
>>> Think of how a window manager folds windows with priorities onto a flat
>>> framebuffer.
>>>
>>> You do a depth-first walk of the tree. For each child list, you iterate
>>> it from the lowest to highest priority, allowing later subregions
>>> override earlier subregions.
>>>
>>
>>
>> Okay, but this doesn't explain how you'll let RAM override the VGA
>> mapping since RAM is not represented in the same child list as VGA
>> (RAM is a child of the PMC whereas VGA is a child of ISA/PCI, both of
>> which are at least one level removed from the PMC).
>
> VGA will override RAM.
>
> Memory controller
> |
> +-- RAM container (prio 0)
> |
> +-- PCI container (prio 1)
> |
> +--- vga window

Unless the RAM controller increases it's priority, right?  That's how 
you would implement SMM, by doing priority++?

But if you have:

Memory controller
|
+-- RAM container (prio 0)
|
+-- PCI container (prio 1)
|
+-- PCI-X container (prio 2)
|
+--- vga window

Now you need to do priority = 3?

Jan had mentioned previously about registering a new temporary window. 
I assume the registration always gets highest_priority++, or do you have 
to explicitly specify that PCI container gets priority=1?

Regards,

Anthony Liguori

>
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-20 15:33                                                             ` Anthony Liguori
@ 2011-05-20 15:59                                                               ` Jan Kiszka
  2011-05-22  7:38                                                                 ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-20 15:59 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, Gleb Natapov, qemu-devel

On 2011-05-20 17:33, Anthony Liguori wrote:
> On 05/20/2011 04:01 AM, Avi Kivity wrote:
>> On 05/19/2011 07:32 PM, Anthony Liguori wrote:
>>>> Think of how a window manager folds windows with priorities onto a flat
>>>> framebuffer.
>>>>
>>>> You do a depth-first walk of the tree. For each child list, you iterate
>>>> it from the lowest to highest priority, allowing later subregions
>>>> override earlier subregions.
>>>>
>>>
>>>
>>> Okay, but this doesn't explain how you'll let RAM override the VGA
>>> mapping since RAM is not represented in the same child list as VGA
>>> (RAM is a child of the PMC whereas VGA is a child of ISA/PCI, both of
>>> which are at least one level removed from the PMC).
>>
>> VGA will override RAM.
>>
>> Memory controller
>> |
>> +-- RAM container (prio 0)
>> |
>> +-- PCI container (prio 1)
>> |
>> +--- vga window
> 
> Unless the RAM controller increases it's priority, right?  That's how 
> you would implement SMM, by doing priority++?
> 
> But if you have:
> 
> Memory controller
> |
> +-- RAM container (prio 0)
> |
> +-- PCI container (prio 1)
> |
> +-- PCI-X container (prio 2)
> |
> +--- vga window
> 
> Now you need to do priority = 3?
> 
> Jan had mentioned previously about registering a new temporary window. 
> I assume the registration always gets highest_priority++, or do you have 
> to explicitly specify that PCI container gets priority=1?

The latter.

And I really prefer to have this explicit over deriving the priority
from the registration order. That's way too fragile/unhandy. If you
decide to replace a region of lower priority later on, you need to
reregister everything at that level.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-20 14:51                                                                     ` Anthony Liguori
@ 2011-05-20 16:43                                                                       ` Olivier Galibert
  2011-05-20 17:32                                                                         ` Anthony Liguori
  2011-05-22  7:36                                                                       ` Avi Kivity
  1 sibling, 1 reply; 187+ messages in thread
From: Olivier Galibert @ 2011-05-20 16:43 UTC (permalink / raw)
  To: qemu-devel

On Fri, May 20, 2011 at 09:51:41AM -0500, Anthony Liguori wrote:
> Is there a use-case for a priority above 1 and if so, what does it mean?

In a modern northbridge mmconfig has priority over external access,
and other internal registers (apic for instance) have priority over
mmconfig.  Bios sometimes map mmconfig is a zone colliding with
something else but where it doesn't really care because the whole
range is not needed.

  OG.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-19  8:30                 ` Jan Kiszka
  2011-05-19  8:44                   ` Avi Kivity
  2011-05-19 13:52                   ` Anthony Liguori
@ 2011-05-20 17:30                   ` Blue Swirl
  2011-05-22  7:23                     ` Avi Kivity
  2 siblings, 1 reply; 187+ messages in thread
From: Blue Swirl @ 2011-05-20 17:30 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Peter Maydell, Avi Kivity, Gleb Natapov, qemu-devel

On Thu, May 19, 2011 at 11:30 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
> On 2011-05-19 10:26, Gleb Natapov wrote:
>> On Wed, May 18, 2011 at 09:27:55PM +0200, Jan Kiszka wrote:
>>>> if an I/O is to the APIC page,
>>>>    it's handled by the APIC
>>>
>>> That's not that simple. We need to tell apart:
>>>  - if a cpu issued the request, and which one => forward to APIC
>> And cpu mode may affect where access is forwarded to. If cpu is in SMM
>> mode access to frame buffer may be forwarded to a memory (depends on
>> chipset configuration).
>
> So we have a second use case for CPU-local I/O regions?

SuperSparc MXCC (memory cache controller) should be CPU specific.
Currently we handle this for accesses via ASI, but the registers could
be mapped with MMU and then the ASI-less access would not be handled.

Another case would be the cache-as-ram mode for some x86 CPUs, which
Coreboot people would like to see IIRC.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-20 16:43                                                                       ` Olivier Galibert
@ 2011-05-20 17:32                                                                         ` Anthony Liguori
  0 siblings, 0 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-20 17:32 UTC (permalink / raw)
  To: Olivier Galibert; +Cc: qemu-devel

On 05/20/2011 11:43 AM, Olivier Galibert wrote:
> On Fri, May 20, 2011 at 09:51:41AM -0500, Anthony Liguori wrote:
>> Is there a use-case for a priority above 1 and if so, what does it mean?
>
> In a modern northbridge mmconfig has priority over external access,
> and other internal registers (apic for instance) have priority over
> mmconfig.  Bios sometimes map mmconfig is a zone colliding with
> something else but where it doesn't really care because the whole
> range is not needed.

But priority and registration order are roughly equivalent though for 
things like this, no?

And the question is, "how do I control what order to register in?" is 
equivalent to the question, "how do I determine which region gets which 
priority?".

Both have an answer of, something has to understand all of the regions 
that may overlap and assign an explicit order/priority.

Regards,

Anthony Liguori

>
>    OG.
>
>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-20 17:30                   ` Blue Swirl
@ 2011-05-22  7:23                     ` Avi Kivity
  0 siblings, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-22  7:23 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Peter Maydell, Jan Kiszka, qemu-devel, Gleb Natapov

On 05/20/2011 08:30 PM, Blue Swirl wrote:
> Another case would be the cache-as-ram mode for some x86 CPUs, which
> Coreboot people would like to see IIRC.

That's probably best handled as a cache emulation layer, as this is not 
associated with any specific address range.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-20 14:51                                                                     ` Anthony Liguori
  2011-05-20 16:43                                                                       ` Olivier Galibert
@ 2011-05-22  7:36                                                                       ` Avi Kivity
  1 sibling, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-22  7:36 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov

On 05/20/2011 05:51 PM, Anthony Liguori wrote:
>> Of course there is overlap. PCI BARs overlap each other, the VGA windows
>> and ROM overlap RAM.
>
>
> Here's what I'm still struggling with:
>
> If children normally overlap their parents, but child priorities are 
> always less than their parents, then what's the benefit of having 
> anything more than two priorities settings.
>
> As far as I can understand it, a priority of 0 means "let children 
> windows overlap" whereas a priority of 1 means "don't let children 
> windows overlap".
>
> Is there a use-case for a priority above 1 and if so, what does it mean?

Children always overlap their parents.  Priorities are among children of 
the same parent.

I expect there won't be a use case for priority 2, but that's not 
because of some inherent property of the design; the hardware designers 
simply haven't got around to designing such whacky hardware.

Note that a container is transparent, so layering several containers on 
top of each other simply generates a flattening of the tree.   You can 
have an opaque container by having a lowest-priority child that covers 
the entire address space.

>> That's what *exactl* what priority means. Which device is in front, and
>> which is in the back.
>
> Why not use registration order to resolve this type of conflict?  What 
> are the use cases to use priorities where registration order wouldn't 
> be adequate?

Registration order is not something you want in a declarative API.  
There is a non-priority equivalent, and that is to unregister 
(_del_subregion) all subregions in the contended region, except the one 
you want.

That doesn't work for PCI, since the "contended region" isn't known.  
You want RAM to hide PCI BARs (or perhaps vice versa), you need to tell 
the system which one takes precedence.


>
>>> There is no need to have centralized logic to decide this.
>>>
>>
>> I think you're completely missing the point of my proposal.
>
> I'm struggling to find the mental model for priorities.  I may just be 
> dense here but the analogy of transparent window ordering isn't 
> helping me.
>

If you like, you can think of "priority" as "explicit registration 
order".  The advantage is that is works dynamically, if a region is 
unregistered and re-registered, that doesn't mess up your registration 
order.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-20 11:57                                                             ` Gleb Natapov
@ 2011-05-22  7:37                                                               ` Avi Kivity
  2011-05-22  8:06                                                                 ` Gleb Natapov
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-22  7:37 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel

On 05/20/2011 02:57 PM, Gleb Natapov wrote:
> On Fri, May 20, 2011 at 11:59:58AM +0300, Avi Kivity wrote:
> >  On 05/19/2011 07:27 PM, Gleb Natapov wrote:
> >  >>   Think of how a window manager folds windows with priorities onto a
> >  >>   flat framebuffer.
> >  >>
> >  >>   You do a depth-first walk of the tree.  For each child list, you
> >  >>   iterate it from the lowest to highest priority, allowing later
> >  >>   subregions override earlier subregions.
> >  >>
> >  >I do not think that window manager is a good analogy. Window can
> >  >overlap with only its siblings. In our memory tree each final node may
> >  >overlap with any other node in the tree.
> >  >
> >
> >  Transparent windows.
> >
> No, still not that. Think about child windows that resides outside of its
> parent windows on screen. In our memory region terms think about PCI BAR
> is registered to overlap with RAM at address 0x1000 for instance. PCI
> BAR memory region and RAM memory region are on very different branches
> of the global tree.

Right.  But what's the problem with that?

Which one takes precedence is determined by the priorities of the RAM 
subregion vs. the PCI bus subregion.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-20 15:59                                                               ` Jan Kiszka
@ 2011-05-22  7:38                                                                 ` Avi Kivity
  2011-05-22 15:42                                                                   ` Anthony Liguori
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-22  7:38 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Gleb Natapov, qemu-devel

On 05/20/2011 06:59 PM, Jan Kiszka wrote:
> >
> >  Jan had mentioned previously about registering a new temporary window.
> >  I assume the registration always gets highest_priority++, or do you have
> >  to explicitly specify that PCI container gets priority=1?
>
> The latter.
>
> And I really prefer to have this explicit over deriving the priority
> from the registration order. That's way too fragile/unhandy. If you
> decide to replace a region of lower priority later on, you need to
> reregister everything at that level.

Exactly.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-20 11:25                           ` Gleb Natapov
@ 2011-05-22  7:50                             ` Avi Kivity
  2011-05-22  8:41                               ` Gleb Natapov
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-22  7:50 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel

On 05/20/2011 02:25 PM, Gleb Natapov wrote:
> >
> >  A) Removing regions will change significantly. So far this is done by
> >  setting a region to IO_MEM_UNASSIGNED, keeping truncation. With the new
> >  API that will be a true removal which will additionally restore hidden
> >  regions.
> >
> And what problem do you expect may arise from that? Currently accessing
> such region after unassign will result in undefined behaviour, so this
> code is non working today, you can't make it worse.
>

If the conversion were perfect then yes.  However there is a possibility 
that the conversion will not be perfect.

It's also good to have to have the code document its intentions.  If you 
see _overlap() you know there is dynamic address decoding going on, or 
something clever.

> >  B) Uncontrolled overlapping is a bug that should be caught by the core,
> >  and a new API is a perfect chance to do this.
> >
> Well, this will indeed introduce the difference in behaviour :) The guest
> that ran before will abort now. Are you actually aware of any such
> overlaps in the current code base?

Put a BAR over another BAR, then unmap it.

> But if priorities are gona stay why not fail if two regions with the
> same priority overlap? If that happens it means that the memory creation
> didn't pass the point where conflict should have been resolved (by
> assigning different priorities) and this means that overlap is
> unintentional, no?

It may be intentional, as in the case of PCI, or PAM (though you can do 
PAM without priorities, by removing all but one of the subregions in the 
area).

> I am starting to see how you can represent all this local decisions as
> priority numbers and then travel this weighted tree to find what memory
> region should be accessed (memory registration _has_ to be hierarchical
> for that to work in meaningful way).

Priorities don't travel up the tree.  They're used to resolve local 
conflicts *only*.

>   I still don't see why it is better
> than flattening the tree in the point of conflict.

How do you decide which subregion wins?
> >  Not necessarily. It depends on how much added value buses like PCI or
> >  ISA or whatever can offer for managing I/O regions. For some purposes,
> >  it may as well be fine to just call the memory_* service directly and
> >  pass the result of some operation to the bus API later on.
> Depend on what memory_* service you are talking about. Just creating
> unattached memory region is OK. But if two independent pieces of code
> want to map two different memory regions into the same phys address I do
> not see who will resolve the conflict.

They have to ask the bus to _add_subregion().  Only the bus knows about 
the priorities (or the bus can ask them to create the subregions).

> >
> >  >  PCI
> >  >  device will call PCI subsystem. PCI subsystem, instead of assigning
> >  >  arbitrary priorities to all overlappings,
> >
> >  Again: PCI will _not_ assign arbitrary priorities but only
> >  MEMORY_REGION_DEFAULT_PRIORITY, likely 0.
> That is as arbitrary as it can get. Just assigning
> MEMORY_REGION_DEFAULT_PRIORITY/2^0xfff will work equally well, so what
> is not arbitrary about that number?

That's just splitting hairs.  Array indexes start from zero, an 
arbitrary but convenient number.

> BTW why wouldn't PCI layer assign different priorities to overlapping
> regions to let the core know which one should be actually available? Why
> leave this decision to the core if it is clearly belongs to PCI?

You mean overlapping BARs?  If PCI wants BAR 1 to override BAR 2, then 
it can indicate it with priorities.  If it doesn't want to, it can use 
the same priority for all regions.

> >
> >  That does not specify how the PCI bridge or the chipset will tell that
> >  overlapping resolution lib _how_ overlapping regions shall be translated
> >  into a flat representation. And precisely here come priorities into
> >  play. It is the way to tell that lib either "region A shall override
> >  region B" if A has higher prio or "if region A and B overlap, do
> >  whatever you want" if both have the same prio.
> >
> Yep! And the question is why shouldn't this be done on the level that
> knows most about the conflict but propagated to the core. I am not
> arguing that priorities do not exists! Obviously they are. I am
> questioning the usefulness of priorities be part of memory core API.
>

The chipset knows about the priorities.  How to communicate them to the 
core?

- at runtime, with hierarchical dispatch of ->read() and ->write(): 
slow, and doesn't work at all for RAM.
- using registration order: fragile
- using priorities

We need to get the information out of the chipset and into the core, so 
the core can make use of it (like flattening the tree to produce kvm slots).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-20 12:08               ` Gleb Natapov
@ 2011-05-22  7:56                 ` Avi Kivity
  0 siblings, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-22  7:56 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel

On 05/20/2011 03:08 PM, Gleb Natapov wrote:
> On Fri, May 20, 2011 at 12:10:22PM +0300, Avi Kivity wrote:
> >  On 05/19/2011 09:22 PM, Gleb Natapov wrote:
> >  >>
> >  >>   BARs may overlap with other BARs or with RAM. That's well-known, so PCI
> >  >>   bridged need to register their regions with the _overlap variant
> >  >>   unconditionally. In contrast to the current PhysPageDesc mechanism, the
> >  >With what priority?
> >
> >  It doesn't matter, since the spec doesn't define priorities among PCI BARs.
> >
> And among PCI BAR and memory (the case the question above referred too).

One of them gets priority 0, the other gets priority 1.  Depending on 
who should win according to the spec.

>
> >  >If it needs to call _overlap unconditionally why not
> >  >always call _overlap and drop not _overlap variant?
> >
> >  Other uses need non-overlapping registration.
> And who prohibit them from creating one?

Nothing.

> >
> >  >>
> >  >>   And they do not need to. The APIC regions will be managed by the per-CPU
> >  >>   region management, reusing the tool box we need for all bridges. It will
> >  >>   register the APIC page with a priority higher than the default one, thus
> >  >>   overriding everything that comes from the host bridge. I think that
> >  >>   reflects pretty well real machine behaviour.
> >  >>
> >  >What is "higher"? How does it know that priority is high enough?
> >
> >  It is well known that 1>  0, for example.
> >
> That is if you have global scale. In the case I am asking about you do
> not. Even if PCI will register memory region that overlaps APIC address
> with priority 1000 APIC memory region should still be able to override
> it even with priority 0. Voila 1000<  0? Where is your sarcasm now? :)

This can be done in two ways:

1. Assign APIC the highest priority.  Priority is not determined by the 
guest, but by qemu. Problem solved.
2. Use a hierarchy:


root
   |
   +-- RAM (prio 0)
   |
   +-- PCI (prio 1, lots of children)
   |
   +-- APIC (prio 2)


Nothing under the PCI subregion can obscure the APIC.

>
> But Jan already answered this one. Actually what really matters is the
> place of the node in a topology, not priority.

You need priority for the children of the same parent.

>   But then for all of this
> to make sense registration has to be hierarchical.

Well, that's the whole point.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-22  7:37                                                               ` Avi Kivity
@ 2011-05-22  8:06                                                                 ` Gleb Natapov
  2011-05-22  8:09                                                                   ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Gleb Natapov @ 2011-05-22  8:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel

On Sun, May 22, 2011 at 10:37:48AM +0300, Avi Kivity wrote:
> On 05/20/2011 02:57 PM, Gleb Natapov wrote:
> >On Fri, May 20, 2011 at 11:59:58AM +0300, Avi Kivity wrote:
> >>  On 05/19/2011 07:27 PM, Gleb Natapov wrote:
> >>  >>   Think of how a window manager folds windows with priorities onto a
> >>  >>   flat framebuffer.
> >>  >>
> >>  >>   You do a depth-first walk of the tree.  For each child list, you
> >>  >>   iterate it from the lowest to highest priority, allowing later
> >>  >>   subregions override earlier subregions.
> >>  >>
> >>  >I do not think that window manager is a good analogy. Window can
> >>  >overlap with only its siblings. In our memory tree each final node may
> >>  >overlap with any other node in the tree.
> >>  >
> >>
> >>  Transparent windows.
> >>
> >No, still not that. Think about child windows that resides outside of its
> >parent windows on screen. In our memory region terms think about PCI BAR
> >is registered to overlap with RAM at address 0x1000 for instance. PCI
> >BAR memory region and RAM memory region are on very different branches
> >of the global tree.
> 
> Right.  But what's the problem with that?
> 
None, unless you want to make PCI BAR visible at address 0x1000 (like what
will happen today) the case above. 

> Which one takes precedence is determined by the priorities of the
> RAM subregion vs. the PCI bus subregion.
> 
Yes, and that is why PCI subsystem or platform code can't directly uses
memory API to register PCI memory region/RAM memory region respectively,
because they wouldn't know what priorities to specify. Only chipset code
knows, so the RAM/PCI memory registration should go through chipset
code, but even chipset code doesn't know everything. It knows nothing
about cpu local memory regions for instance, so all registrations should
go through system bus in the end. Is this how API suppose to be used?

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-22  8:06                                                                 ` Gleb Natapov
@ 2011-05-22  8:09                                                                   ` Avi Kivity
  2011-05-22  8:59                                                                     ` Gleb Natapov
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-22  8:09 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel

On 05/22/2011 11:06 AM, Gleb Natapov wrote:
> On Sun, May 22, 2011 at 10:37:48AM +0300, Avi Kivity wrote:
> >  On 05/20/2011 02:57 PM, Gleb Natapov wrote:
> >  >On Fri, May 20, 2011 at 11:59:58AM +0300, Avi Kivity wrote:
> >  >>   On 05/19/2011 07:27 PM, Gleb Natapov wrote:
> >  >>   >>    Think of how a window manager folds windows with priorities onto a
> >  >>   >>    flat framebuffer.
> >  >>   >>
> >  >>   >>    You do a depth-first walk of the tree.  For each child list, you
> >  >>   >>    iterate it from the lowest to highest priority, allowing later
> >  >>   >>    subregions override earlier subregions.
> >  >>   >>
> >  >>   >I do not think that window manager is a good analogy. Window can
> >  >>   >overlap with only its siblings. In our memory tree each final node may
> >  >>   >overlap with any other node in the tree.
> >  >>   >
> >  >>
> >  >>   Transparent windows.
> >  >>
> >  >No, still not that. Think about child windows that resides outside of its
> >  >parent windows on screen. In our memory region terms think about PCI BAR
> >  >is registered to overlap with RAM at address 0x1000 for instance. PCI
> >  >BAR memory region and RAM memory region are on very different branches
> >  >of the global tree.
> >
> >  Right.  But what's the problem with that?
> >
> None, unless you want to make PCI BAR visible at address 0x1000 (like what
> will happen today) the case above.

There is no problem.  If the PCI bus priority is higher than RAM 
priority, then PCI BARs will override RAM.

> >  Which one takes precedence is determined by the priorities of the
> >  RAM subregion vs. the PCI bus subregion.
> >
> Yes, and that is why PCI subsystem or platform code can't directly uses
> memory API to register PCI memory region/RAM memory region respectively,
> because they wouldn't know what priorities to specify. Only chipset code
> knows, so the RAM/PCI memory registration should go through chipset
> code,

Correct.  Chipset code creates RAM and PCI regions, and gives the PCI 
region to the PCI bus.  Devices give BAR subregions to the PCI bus.  The 
PCI bus does the registration.

>   but even chipset code doesn't know everything. It knows nothing
> about cpu local memory regions for instance, so all registrations should
> go through system bus in the end. Is this how API suppose to be used?

Yes.  Every point where a decision is made on how to route memory 
accesses is a modelled as a container node.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-22  7:50                             ` Avi Kivity
@ 2011-05-22  8:41                               ` Gleb Natapov
  2011-05-22 10:53                                 ` Jan Kiszka
  2011-05-23 22:29                                 ` Jamie Lokier
  0 siblings, 2 replies; 187+ messages in thread
From: Gleb Natapov @ 2011-05-22  8:41 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel

On Sun, May 22, 2011 at 10:50:22AM +0300, Avi Kivity wrote:
> On 05/20/2011 02:25 PM, Gleb Natapov wrote:
> >>
> >>  A) Removing regions will change significantly. So far this is done by
> >>  setting a region to IO_MEM_UNASSIGNED, keeping truncation. With the new
> >>  API that will be a true removal which will additionally restore hidden
> >>  regions.
> >>
> >And what problem do you expect may arise from that? Currently accessing
> >such region after unassign will result in undefined behaviour, so this
> >code is non working today, you can't make it worse.
> >
> 
> If the conversion were perfect then yes.  However there is a
> possibility that the conversion will not be perfect.
> 
> It's also good to have to have the code document its intentions.  If
> you see _overlap() you know there is dynamic address decoding going
> on, or something clever.
> 
> >>  B) Uncontrolled overlapping is a bug that should be caught by the core,
> >>  and a new API is a perfect chance to do this.
> >>
> >Well, this will indeed introduce the difference in behaviour :) The guest
> >that ran before will abort now. Are you actually aware of any such
> >overlaps in the current code base?
> 
> Put a BAR over another BAR, then unmap it.
> 
_overlap will not help with that. PCI BARs can overlap, so _overlap will
be used to register them. You do not what to abort qemu when guest
configure overlapping PCI BARs don't you?

> >But if priorities are gona stay why not fail if two regions with the
> >same priority overlap? If that happens it means that the memory creation
> >didn't pass the point where conflict should have been resolved (by
> >assigning different priorities) and this means that overlap is
> >unintentional, no?
> 
> It may be intentional, as in the case of PCI, or PAM (though you can
> do PAM without priorities, by removing all but one of the subregions
> in the area).
> 
When two PCI BARs overlap, somebody somewhere has to decide which one
of them will be visible. You can register both of them with same priority
and let the core to decide this at the time it calculates flattened view or
you can assign different priorities at PCI layer. In the later case you do
not need _overlap API.

> >I am starting to see how you can represent all this local decisions as
> >priority numbers and then travel this weighted tree to find what memory
> >region should be accessed (memory registration _has_ to be hierarchical
> >for that to work in meaningful way).
> 
> Priorities don't travel up the tree.  They're used to resolve local
> conflicts *only*.
Yes, that what I mean by "traveling weighted tree". At each node you
look at local priorities to decide where to move.

> 
> >  I still don't see why it is better
> >than flattening the tree in the point of conflict.
> 
> How do you decide which subregion wins?
The same way you decide what priority you assign to each region. You
do that at the point where you know the priorities. Suppose chipset has
RAM from 0x0000 to 0x1fff and PCI from 0x2000 0x2fff and PCI layer tires
to map a BAR from 0x1500 to 0x3500. Since chipset code knows that RAM
has higher priority and it knows where PCI windows ends it can create two
memory regions from that. One is RAM from 0x0000 to 0x1fff another PCI BAR
from 0x2000 to 0x2fff. How your tree will look like in this case BTW?


> >>  Not necessarily. It depends on how much added value buses like PCI or
> >>  ISA or whatever can offer for managing I/O regions. For some purposes,
> >>  it may as well be fine to just call the memory_* service directly and
> >>  pass the result of some operation to the bus API later on.
> >Depend on what memory_* service you are talking about. Just creating
> >unattached memory region is OK. But if two independent pieces of code
> >want to map two different memory regions into the same phys address I do
> >not see who will resolve the conflict.
> 
> They have to ask the bus to _add_subregion().  Only the bus knows
> about the priorities (or the bus can ask them to create the
> subregions).
Yes, that what I wrote and Jan responded to. Buses all the way down.

> 
> >>
> >>  >  PCI
> >>  >  device will call PCI subsystem. PCI subsystem, instead of assigning
> >>  >  arbitrary priorities to all overlappings,
> >>
> >>  Again: PCI will _not_ assign arbitrary priorities but only
> >>  MEMORY_REGION_DEFAULT_PRIORITY, likely 0.
> >That is as arbitrary as it can get. Just assigning
> >MEMORY_REGION_DEFAULT_PRIORITY/2^0xfff will work equally well, so what
> >is not arbitrary about that number?
> 
> That's just splitting hairs.  Array indexes start from zero, an
> arbitrary but convenient number.
The point is that instead of assigning priorities PCI subsystem may
resolve the conflict before passing registration to a chipset.

> 
> >BTW why wouldn't PCI layer assign different priorities to overlapping
> >regions to let the core know which one should be actually available? Why
> >leave this decision to the core if it is clearly belongs to PCI?
> 
> You mean overlapping BARs?  If PCI wants BAR 1 to override BAR 2,
> then it can indicate it with priorities.  If it doesn't want to, it
> can use the same priority for all regions.
> 
> >>
> >>  That does not specify how the PCI bridge or the chipset will tell that
> >>  overlapping resolution lib _how_ overlapping regions shall be translated
> >>  into a flat representation. And precisely here come priorities into
> >>  play. It is the way to tell that lib either "region A shall override
> >>  region B" if A has higher prio or "if region A and B overlap, do
> >>  whatever you want" if both have the same prio.
> >>
> >Yep! And the question is why shouldn't this be done on the level that
> >knows most about the conflict but propagated to the core. I am not
> >arguing that priorities do not exists! Obviously they are. I am
> >questioning the usefulness of priorities be part of memory core API.
> >
> 
> The chipset knows about the priorities.  How to communicate them to
> the core?
> 
> - at runtime, with hierarchical dispatch of ->read() and ->write():
> slow, and doesn't work at all for RAM.
> - using registration order: fragile
> - using priorities
> 
- by resolving overlapping and registering flattened list with the core.
  (See example above).

> We need to get the information out of the chipset and into the core,
> so the core can make use of it (like flattening the tree to produce
> kvm slots).
> 
> -- 
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-22  8:09                                                                   ` Avi Kivity
@ 2011-05-22  8:59                                                                     ` Gleb Natapov
  2011-05-22 12:26                                                                       ` Avi Kivity
  0 siblings, 1 reply; 187+ messages in thread
From: Gleb Natapov @ 2011-05-22  8:59 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel

On Sun, May 22, 2011 at 11:09:08AM +0300, Avi Kivity wrote:
> On 05/22/2011 11:06 AM, Gleb Natapov wrote:
> >On Sun, May 22, 2011 at 10:37:48AM +0300, Avi Kivity wrote:
> >>  On 05/20/2011 02:57 PM, Gleb Natapov wrote:
> >>  >On Fri, May 20, 2011 at 11:59:58AM +0300, Avi Kivity wrote:
> >>  >>   On 05/19/2011 07:27 PM, Gleb Natapov wrote:
> >>  >>   >>    Think of how a window manager folds windows with priorities onto a
> >>  >>   >>    flat framebuffer.
> >>  >>   >>
> >>  >>   >>    You do a depth-first walk of the tree.  For each child list, you
> >>  >>   >>    iterate it from the lowest to highest priority, allowing later
> >>  >>   >>    subregions override earlier subregions.
> >>  >>   >>
> >>  >>   >I do not think that window manager is a good analogy. Window can
> >>  >>   >overlap with only its siblings. In our memory tree each final node may
> >>  >>   >overlap with any other node in the tree.
> >>  >>   >
> >>  >>
> >>  >>   Transparent windows.
> >>  >>
> >>  >No, still not that. Think about child windows that resides outside of its
> >>  >parent windows on screen. In our memory region terms think about PCI BAR
> >>  >is registered to overlap with RAM at address 0x1000 for instance. PCI
> >>  >BAR memory region and RAM memory region are on very different branches
> >>  >of the global tree.
> >>
> >>  Right.  But what's the problem with that?
> >>
> >None, unless you want to make PCI BAR visible at address 0x1000 (like what
> >will happen today) the case above.
> 
> There is no problem.  If the PCI bus priority is higher than RAM
> priority, then PCI BARs will override RAM.
> 
So if memory region has no subregion that covers part of the region
lower prio region is used? Now the same with the pictures:

-- root
  -- PCI 0x00000 - 0x2ffff (prio 0)
    -- BAR A 0x1000 - 0x1fff
    -- BAR B 0x20000 - 0x20fff
  -- RAM 0x00000 - 0x1ffff (prio 1)
  
In the tree above at address 0x0 PCI has higher priority, but since
there is no subregion at this range next prio region is used instead
(RAM). Is this correct? If yes how core knows that container is
transparent like that (RAM container is not)?

> >>  Which one takes precedence is determined by the priorities of the
> >>  RAM subregion vs. the PCI bus subregion.
> >>
> >Yes, and that is why PCI subsystem or platform code can't directly uses
> >memory API to register PCI memory region/RAM memory region respectively,
> >because they wouldn't know what priorities to specify. Only chipset code
> >knows, so the RAM/PCI memory registration should go through chipset
> >code,
> 
> Correct.  Chipset code creates RAM and PCI regions, and gives the
> PCI region to the PCI bus.  Devices give BAR subregions to the PCI
> bus.  The PCI bus does the registration.
OK. What happens if device tries to create subregion outside of PCI
region provided by chipset to PCI bus?

> 
> >  but even chipset code doesn't know everything. It knows nothing
> >about cpu local memory regions for instance, so all registrations should
> >go through system bus in the end. Is this how API suppose to be used?
> 
> Yes.  Every point where a decision is made on how to route memory
> accesses is a modelled as a container node.
> 
Excellent. I would argue that this is exactly the point where an
overlapping can be resolved too :)

--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-22  8:41                               ` Gleb Natapov
@ 2011-05-22 10:53                                 ` Jan Kiszka
  2011-05-22 11:29                                   ` Avi Kivity
  2011-05-23 22:29                                 ` Jamie Lokier
  1 sibling, 1 reply; 187+ messages in thread
From: Jan Kiszka @ 2011-05-22 10:53 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 759 bytes --]

On 2011-05-22 10:41, Gleb Natapov wrote:
>> The chipset knows about the priorities.  How to communicate them to
>> the core?
>>
>> - at runtime, with hierarchical dispatch of ->read() and ->write():
>> slow, and doesn't work at all for RAM.
>> - using registration order: fragile
>> - using priorities
>>
> - by resolving overlapping and registering flattened list with the core.
>   (See example above).

[Registration would happens with the help of the core against the next
higher layer.]

To do this, you need to
 - open-code the resolution logic at every level (very bad idea)
 - provide library services to obtain a flattened representation

Please try to specify such an API without any parameters that are
priority-like.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-22 10:53                                 ` Jan Kiszka
@ 2011-05-22 11:29                                   ` Avi Kivity
  2011-05-23  8:45                                     ` Gleb Natapov
  0 siblings, 1 reply; 187+ messages in thread
From: Avi Kivity @ 2011-05-22 11:29 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Gleb Natapov, qemu-devel

On 05/22/2011 01:53 PM, Jan Kiszka wrote:
> On 2011-05-22 10:41, Gleb Natapov wrote:
> >>  The chipset knows about the priorities.  How to communicate them to
> >>  the core?
> >>
> >>  - at runtime, with hierarchical dispatch of ->read() and ->write():
> >>  slow, and doesn't work at all for RAM.
> >>  - using registration order: fragile
> >>  - using priorities
> >>
> >  - by resolving overlapping and registering flattened list with the core.
> >    (See example above).
>
> [Registration would happens with the help of the core against the next
> higher layer.]
>
> To do this, you need to
>   - open-code the resolution logic at every level (very bad idea)
>   - provide library services to obtain a flattened representation
>
> Please try to specify such an API without any parameters that are
> priority-like.

Another way of saying the same thing:  having the chipset code resolve 
conflicts, and having the chipset code assign priorities, are 
equivalent.  But having priorities allows flattening to take place 
without further involvement of the chipset code.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-22  8:59                                                                     ` Gleb Natapov
@ 2011-05-22 12:26                                                                       ` Avi Kivity
  0 siblings, 0 replies; 187+ messages in thread
From: Avi Kivity @ 2011-05-22 12:26 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel

On 05/22/2011 11:59 AM, Gleb Natapov wrote:
> >  There is no problem.  If the PCI bus priority is higher than RAM
> >  priority, then PCI BARs will override RAM.
> >
> So if memory region has no subregion that covers part of the region
> lower prio region is used? Now the same with the pictures:
>
> -- root
>    -- PCI 0x00000 - 0x2ffff (prio 0)
>      -- BAR A 0x1000 - 0x1fff
>      -- BAR B 0x20000 - 0x20fff
>    -- RAM 0x00000 - 0x1ffff (prio 1)
>
> In the tree above at address 0x0 PCI has higher priority,

In a previous mail, I asserted that 1 > 0.  Under this assumption, PCI 
has lower priority.

> but since
> there is no subregion at this range next prio region is used instead
> (RAM). Is this correct?

Assuming pci prio = 1 and ram prio = 0, yes.

> If yes how core knows that container is
> transparent like that (RAM container is not)?

All containers are transparent.  Only RAM and MMIO regions are not.

> >  >>   Which one takes precedence is determined by the priorities of the
> >  >>   RAM subregion vs. the PCI bus subregion.
> >  >>
> >  >Yes, and that is why PCI subsystem or platform code can't directly uses
> >  >memory API to register PCI memory region/RAM memory region respectively,
> >  >because they wouldn't know what priorities to specify. Only chipset code
> >  >knows, so the RAM/PCI memory registration should go through chipset
> >  >code,
> >
> >  Correct.  Chipset code creates RAM and PCI regions, and gives the
> >  PCI region to the PCI bus.  Devices give BAR subregions to the PCI
> >  bus.  The PCI bus does the registration.
> OK. What happens if device tries to create subregion outside of PCI
> region provided by chipset to PCI bus?

What does real hardware do?

The region API has an offset and size, and clips anything outside that.  
However PCI may be more complicated due to 64-bit BARs, we may need an 
extra API to control clipping.

> >
> >  >   but even chipset code doesn't know everything. It knows nothing
> >  >about cpu local memory regions for instance, so all registrations should
> >  >go through system bus in the end. Is this how API suppose to be used?
> >
> >  Yes.  Every point where a decision is made on how to route memory
> >  accesses is a modelled as a container node.
> >
> Excellent. I would argue that this is exactly the point where an
> overlapping can be resolved too :)

It is.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-22  7:38                                                                 ` Avi Kivity
@ 2011-05-22 15:42                                                                   ` Anthony Liguori
  0 siblings, 0 replies; 187+ messages in thread
From: Anthony Liguori @ 2011-05-22 15:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov

On 05/22/2011 02:38 AM, Avi Kivity wrote:
> On 05/20/2011 06:59 PM, Jan Kiszka wrote:
>> >
>> > Jan had mentioned previously about registering a new temporary window.
>> > I assume the registration always gets highest_priority++, or do you
>> have
>> > to explicitly specify that PCI container gets priority=1?
>>
>> The latter.
>>
>> And I really prefer to have this explicit over deriving the priority
>> from the registration order. That's way too fragile/unhandy. If you
>> decide to replace a region of lower priority later on, you need to
>> reregister everything at that level.
>
> Exactly.

So priority is equivalent to registration order except it's explicit?

So this is more palatable to me but I do prefer registration order since 
it promises less.

But that said, this is a relatively minor feature IMHO given that I now 
understand the scope so I'm not going to object strongly.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-22 11:29                                   ` Avi Kivity
@ 2011-05-23  8:45                                     ` Gleb Natapov
  0 siblings, 0 replies; 187+ messages in thread
From: Gleb Natapov @ 2011-05-23  8:45 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, qemu-devel

On Sun, May 22, 2011 at 02:29:27PM +0300, Avi Kivity wrote:
> On 05/22/2011 01:53 PM, Jan Kiszka wrote:
> >On 2011-05-22 10:41, Gleb Natapov wrote:
> >>>  The chipset knows about the priorities.  How to communicate them to
> >>>  the core?
> >>>
> >>>  - at runtime, with hierarchical dispatch of ->read() and ->write():
> >>>  slow, and doesn't work at all for RAM.
> >>>  - using registration order: fragile
> >>>  - using priorities
> >>>
> >>  - by resolving overlapping and registering flattened list with the core.
> >>    (See example above).
> >
> >[Registration would happens with the help of the core against the next
> >higher layer.]
> >
> >To do this, you need to
> >  - open-code the resolution logic at every level (very bad idea)
> >  - provide library services to obtain a flattened representation
> >
> >Please try to specify such an API without any parameters that are
> >priority-like.
> 
> Another way of saying the same thing:  having the chipset code
> resolve conflicts, and having the chipset code assign priorities,
> are equivalent.  But having priorities allows flattening to take
> place without further involvement of the chipset code.
> 
Yes, that's basically sums it up. My main concern with API was that it
was not clear how two independent pieces of code would know how to
assign priorities. But if all registrations happen hierarchically there
will be no such "independent pieces of code".


--
			Gleb.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [Qemu-devel] [RFC] Memory API
  2011-05-22  8:41                               ` Gleb Natapov
  2011-05-22 10:53                                 ` Jan Kiszka
@ 2011-05-23 22:29                                 ` Jamie Lokier
  1 sibling, 0 replies; 187+ messages in thread
From: Jamie Lokier @ 2011-05-23 22:29 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, Avi Kivity, qemu-devel

Gleb Natapov wrote:
> On Sun, May 22, 2011 at 10:50:22AM +0300, Avi Kivity wrote:
> > On 05/20/2011 02:25 PM, Gleb Natapov wrote:
> > >>
> > >>  A) Removing regions will change significantly. So far this is done by
> > >>  setting a region to IO_MEM_UNASSIGNED, keeping truncation. With the new
> > >>  API that will be a true removal which will additionally restore hidden
> > >>  regions.
> > >>
> > >And what problem do you expect may arise from that? Currently accessing
> > >such region after unassign will result in undefined behaviour, so this
> > >code is non working today, you can't make it worse.
> > >
> > 
> > If the conversion were perfect then yes.  However there is a
> > possibility that the conversion will not be perfect.
> > 
> > It's also good to have to have the code document its intentions.  If
> > you see _overlap() you know there is dynamic address decoding going
> > on, or something clever.
> > 
> > >>  B) Uncontrolled overlapping is a bug that should be caught by the core,
> > >>  and a new API is a perfect chance to do this.
> > >>
> > >Well, this will indeed introduce the difference in behaviour :) The guest
> > >that ran before will abort now. Are you actually aware of any such
> > >overlaps in the current code base?
> > 
> > Put a BAR over another BAR, then unmap it.
> > 
> _overlap will not help with that. PCI BARs can overlap, so _overlap will
> be used to register them. You do not what to abort qemu when guest
> configure overlapping PCI BARs don't you?

I'd rather guests have no way to abort qemu, except by explicit
agreement... even if they program BARs randomly or do anything else.
Right now my virtual server provider won't let me run my own kernels
because they are paranoid that a non-approved kernel might crash KVM.
Which is reasonable.  Even so, it's possible to reprogram BARs from
guest userspace.

Hot-adding devices, including ones with MMIO or IO addresses that
overlap another existing device, shouldn't make qemu abort either.
Perhaps disable the device, perhaps respond with an error, that's all.

Even then, if hot-adding some ISA device overlaps an existing PCI BAR,
it would be preferable if the devices (probably both of them) simply
didn't receive any bus cycles until the BARs were moved elsewhere,
maybe triggered PCI bus errors or MCEs or something like that, rather
than introducing never-tested-in-practice management-visible state
such as a "disabled" or "refused" device.

I don't know if qemu has devices like this, but many real ISA devices
have software-configurable IO, MMI and IRQ settings (ISAPNP) - it's
not just PCI.

I thoroughly approve of the plan to keep track of overlapping regions
so that adding/removing them has no side effect.  When they conflict
at equal priorities I suggest a good behaviour would be:

   - No access to the underlying device
   - MCE interrupt or equivalent, signalling a bus error

Then the order of registration doesn't make any difference, which is good.

-- Jamie

^ permalink raw reply	[flat|nested] 187+ messages in thread

end of thread, other threads:[~2011-05-23 22:29 UTC | newest]

Thread overview: 187+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-18 13:12 [Qemu-devel] [RFC] Memory API Avi Kivity
2011-05-18 14:05 ` Jan Kiszka
2011-05-18 14:36   ` Avi Kivity
2011-05-18 15:11     ` Jan Kiszka
2011-05-18 15:17       ` Peter Maydell
2011-05-18 15:30         ` Jan Kiszka
2011-05-18 19:10           ` Anthony Liguori
2011-05-18 19:27             ` Jan Kiszka
2011-05-18 19:34               ` Anthony Liguori
2011-05-18 20:02                 ` Alex Williamson
2011-05-18 20:11                   ` Jan Kiszka
2011-05-18 20:13                     ` Alex Williamson
2011-05-18 20:07                 ` Jan Kiszka
2011-05-18 20:41                   ` Anthony Liguori
2011-05-19  8:26               ` Gleb Natapov
2011-05-19  8:30                 ` Jan Kiszka
2011-05-19  8:44                   ` Avi Kivity
2011-05-19 13:59                     ` Anthony Liguori
2011-05-19 13:52                   ` Anthony Liguori
2011-05-19 13:56                     ` Avi Kivity
2011-05-19 13:57                     ` Jan Kiszka
2011-05-19 14:04                       ` Anthony Liguori
2011-05-19 14:06                         ` Jan Kiszka
2011-05-19 14:11                         ` Avi Kivity
2011-05-19 18:18                           ` Anthony Liguori
2011-05-19 18:50                             ` Jan Kiszka
2011-05-19 19:02                               ` Anthony Liguori
2011-05-19 19:10                                 ` Jan Kiszka
2011-05-20  9:15                             ` Avi Kivity
2011-05-20 17:30                   ` Blue Swirl
2011-05-22  7:23                     ` Avi Kivity
2011-05-19  6:31             ` Jan Kiszka
2011-05-19 13:23               ` Anthony Liguori
2011-05-19 13:25                 ` Jan Kiszka
2011-05-19 13:26                 ` Avi Kivity
2011-05-19 13:35                   ` Anthony Liguori
2011-05-19 13:36                     ` Jan Kiszka
2011-05-19 13:43                       ` Avi Kivity
2011-05-19 13:39                     ` Avi Kivity
2011-05-19  8:09             ` Avi Kivity
2011-05-18 15:23       ` Avi Kivity
2011-05-18 15:36         ` Jan Kiszka
2011-05-18 15:42           ` Avi Kivity
2011-05-18 16:00             ` Jan Kiszka
2011-05-18 16:14               ` Avi Kivity
2011-05-18 16:39                 ` Jan Kiszka
2011-05-18 16:47                   ` Avi Kivity
2011-05-18 17:07                     ` Jan Kiszka
2011-05-18 17:15                       ` Avi Kivity
2011-05-18 17:40                         ` Jan Kiszka
2011-05-18 20:13                     ` Richard Henderson
2011-05-19  8:04                       ` Avi Kivity
2011-05-19  9:08             ` Gleb Natapov
2011-05-19  9:10               ` Avi Kivity
2011-05-19  9:14                 ` Gleb Natapov
2011-05-19 11:44                   ` Avi Kivity
2011-05-19 11:54                     ` Gleb Natapov
2011-05-19 11:57                       ` Jan Kiszka
2011-05-19 11:58                         ` Gleb Natapov
2011-05-19 12:02                           ` Avi Kivity
2011-05-19 12:21                             ` Gleb Natapov
2011-05-19 12:02                           ` Jan Kiszka
2011-05-19 11:57                       ` Avi Kivity
2011-05-19 12:20                         ` Jan Kiszka
2011-05-19 12:50                           ` Avi Kivity
2011-05-19 12:58                             ` Jan Kiszka
2011-05-19 13:00                               ` Avi Kivity
2011-05-19 13:03                                 ` Jan Kiszka
2011-05-19 13:07                                   ` Avi Kivity
2011-05-19 13:26                                     ` Jan Kiszka
2011-05-19 13:30                                       ` Avi Kivity
2011-05-19 13:43                                         ` Jan Kiszka
2011-05-19 13:47                                           ` Avi Kivity
2011-05-19 13:49                                         ` Anthony Liguori
2011-05-19 13:53                                           ` Avi Kivity
2011-05-19 14:15                                             ` Anthony Liguori
2011-05-19 14:20                                               ` Jan Kiszka
2011-05-19 14:25                                                 ` Anthony Liguori
2011-05-19 14:28                                                   ` Jan Kiszka
2011-05-19 14:31                                                     ` Avi Kivity
2011-05-19 14:37                                                     ` Anthony Liguori
2011-05-19 14:40                                                       ` Avi Kivity
2011-05-19 16:17                                                         ` Gleb Natapov
2011-05-19 16:25                                                           ` Jan Kiszka
2011-05-19 16:28                                                             ` Gleb Natapov
2011-05-19 16:30                                                               ` Jan Kiszka
2011-05-19 16:36                                                                 ` Anthony Liguori
2011-05-19 16:49                                                                   ` Jan Kiszka
2011-05-19 17:12                                                                     ` Gleb Natapov
2011-05-19 18:11                                                                       ` Jan Kiszka
2011-05-20  8:58                                                                     ` Avi Kivity
2011-05-20  8:56                                                                   ` Avi Kivity
2011-05-20 14:51                                                                     ` Anthony Liguori
2011-05-20 16:43                                                                       ` Olivier Galibert
2011-05-20 17:32                                                                         ` Anthony Liguori
2011-05-22  7:36                                                                       ` Avi Kivity
2011-05-19 16:43                                                                 ` Gleb Natapov
2011-05-19 16:51                                                                   ` Jan Kiszka
2011-05-19 16:27                                                         ` Gleb Natapov
2011-05-20  8:59                                                           ` Avi Kivity
2011-05-20 11:57                                                             ` Gleb Natapov
2011-05-22  7:37                                                               ` Avi Kivity
2011-05-22  8:06                                                                 ` Gleb Natapov
2011-05-22  8:09                                                                   ` Avi Kivity
2011-05-22  8:59                                                                     ` Gleb Natapov
2011-05-22 12:26                                                                       ` Avi Kivity
2011-05-19 16:32                                                         ` Anthony Liguori
2011-05-19 16:35                                                           ` Jan Kiszka
2011-05-19 16:38                                                             ` Anthony Liguori
2011-05-19 16:50                                                               ` Jan Kiszka
2011-05-20  9:03                                                               ` Avi Kivity
2011-05-20  9:01                                                           ` Avi Kivity
2011-05-20 15:33                                                             ` Anthony Liguori
2011-05-20 15:59                                                               ` Jan Kiszka
2011-05-22  7:38                                                                 ` Avi Kivity
2011-05-22 15:42                                                                   ` Anthony Liguori
2011-05-19 14:21                                               ` Avi Kivity
2011-05-19 13:44                 ` Anthony Liguori
2011-05-19 13:47                   ` Jan Kiszka
2011-05-19 13:50                     ` Anthony Liguori
2011-05-19 13:55                       ` Jan Kiszka
2011-05-19 13:55                       ` Avi Kivity
2011-05-19 18:06                         ` Anthony Liguori
2011-05-19 18:21                           ` Jan Kiszka
2011-05-19 13:49                   ` Avi Kivity
2011-05-19  9:24               ` Edgar E. Iglesias
2011-05-19 14:49                 ` Peter Maydell
2011-05-18 16:33         ` Anthony Liguori
2011-05-18 16:41           ` Avi Kivity
2011-05-18 17:04             ` Anthony Liguori
2011-05-18 17:13               ` Avi Kivity
2011-05-18 16:42           ` Jan Kiszka
2011-05-18 17:05             ` Avi Kivity
2011-05-18 15:14   ` Anthony Liguori
2011-05-18 15:26     ` Avi Kivity
2011-05-18 16:21       ` Avi Kivity
2011-05-18 16:42         ` Jan Kiszka
2011-05-18 16:49           ` Avi Kivity
2011-05-18 17:11             ` Anthony Liguori
2011-05-18 17:38               ` Jan Kiszka
2011-05-18 15:08 ` Anthony Liguori
2011-05-18 15:37   ` Avi Kivity
2011-05-18 19:36     ` Jan Kiszka
2011-05-18 15:47   ` Stefan Weil
2011-05-18 16:06     ` Avi Kivity
2011-05-18 16:51       ` Richard Henderson
2011-05-18 16:53         ` Avi Kivity
2011-05-18 17:03           ` Richard Henderson
2011-05-18 15:58 ` Avi Kivity
2011-05-18 16:26 ` Richard Henderson
2011-05-18 16:37   ` Avi Kivity
2011-05-18 17:17 ` Avi Kivity
2011-05-18 19:40 ` Jan Kiszka
2011-05-19  8:06   ` Avi Kivity
2011-05-19  8:08     ` Jan Kiszka
2011-05-19  8:13       ` Avi Kivity
2011-05-19  8:25         ` Jan Kiszka
2011-05-19  8:43           ` Avi Kivity
2011-05-19  9:24             ` Jan Kiszka
2011-05-19 11:58               ` Avi Kivity
2011-05-19 13:36   ` Anthony Liguori
2011-05-19 13:37     ` Jan Kiszka
2011-05-19 13:41       ` Avi Kivity
2011-05-19 17:39       ` Gleb Natapov
2011-05-19 18:03         ` Anthony Liguori
2011-05-19 18:28           ` Gleb Natapov
2011-05-19 18:33             ` Anthony Liguori
2011-05-19 18:11         ` Jan Kiszka
2011-05-19 18:22           ` Gleb Natapov
2011-05-19 18:27             ` Jan Kiszka
2011-05-19 18:40               ` Gleb Natapov
2011-05-19 18:45                 ` Jan Kiszka
2011-05-19 18:50                   ` Gleb Natapov
2011-05-19 18:55                     ` Jan Kiszka
2011-05-19 19:02                       ` Jan Kiszka
2011-05-20  7:23                       ` Gleb Natapov
2011-05-20  7:40                         ` Jan Kiszka
2011-05-20 11:25                           ` Gleb Natapov
2011-05-22  7:50                             ` Avi Kivity
2011-05-22  8:41                               ` Gleb Natapov
2011-05-22 10:53                                 ` Jan Kiszka
2011-05-22 11:29                                   ` Avi Kivity
2011-05-23  8:45                                     ` Gleb Natapov
2011-05-23 22:29                                 ` Jamie Lokier
2011-05-20  9:10             ` Avi Kivity
2011-05-20 12:08               ` Gleb Natapov
2011-05-22  7:56                 ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.