All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
@ 2024-04-10 10:03 Eugenio Pérez
  2024-04-10 10:03 ` [RFC 1/2] iova_tree: add an id member to DMAMap Eugenio Pérez
                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Eugenio Pérez @ 2024-04-10 10:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu, Jonah Palmer,
	Dragos Tatulea, Jason Wang

The guest may have overlapped memory regions, where different GPA leads
to the same HVA.  This causes a problem when overlapped regions
(different GPA but same translated HVA) exists in the tree, as looking
them by HVA will return them twice.

To solve this, track GPA in the DMA entry that acs as unique identifiers
to the maps.  When the map needs to be removed, iova tree is able to
find the right one.

Users that does not go to this extra layer of indirection can use the
iova tree as usual, with id = 0.

This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
time to reproduce the issue.  This has been tested only without overlapping
maps.  If it works with overlapping maps, it will be intergrated in the main
series.

Comments are welcome.  Thanks!

Eugenio Pérez (2):
  iova_tree: add an id member to DMAMap
  vdpa: identify aliased maps in iova_tree

 hw/virtio/vhost-vdpa.c   | 2 ++
 include/qemu/iova-tree.h | 5 +++--
 util/iova-tree.c         | 3 ++-
 3 files changed, 7 insertions(+), 3 deletions(-)

-- 
2.44.0



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-04-10 10:03 [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree Eugenio Pérez
@ 2024-04-10 10:03 ` Eugenio Pérez
  2024-04-18 20:46   ` Si-Wei Liu
  2024-04-10 10:03 ` [RFC 2/2] vdpa: identify aliased maps in iova_tree Eugenio Pérez
  2024-04-12  6:46 ` [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree Jason Wang
  2 siblings, 1 reply; 37+ messages in thread
From: Eugenio Pérez @ 2024-04-10 10:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu, Jonah Palmer,
	Dragos Tatulea, Jason Wang

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/qemu/iova-tree.h | 5 +++--
 util/iova-tree.c         | 3 ++-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
     hwaddr iova;
     hwaddr translated_addr;
     hwaddr size;                /* Inclusive */
+    uint64_t id;
     IOMMUAccessFlags perm;
 } QEMU_PACKED DMAMap;
 typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
  * @map: the mapping to search
  *
  * Search for a mapping in the iova tree that translated_addr overlaps with the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
  *
  * Return: DMAMap pointer if found, or NULL if not found.  Note that
  * the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
 
     needle = args->needle;
     if (map->translated_addr + map->size < needle->translated_addr ||
-        needle->translated_addr + needle->size < map->translated_addr) {
+        needle->translated_addr + needle->size < map->translated_addr ||
+        needle->id != map->id) {
         return false;
     }
 
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC 2/2] vdpa: identify aliased maps in iova_tree
  2024-04-10 10:03 [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree Eugenio Pérez
  2024-04-10 10:03 ` [RFC 1/2] iova_tree: add an id member to DMAMap Eugenio Pérez
@ 2024-04-10 10:03 ` Eugenio Pérez
  2024-04-12  6:46 ` [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree Jason Wang
  2 siblings, 0 replies; 37+ messages in thread
From: Eugenio Pérez @ 2024-04-10 10:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu, Jonah Palmer,
	Dragos Tatulea, Jason Wang

The guest may have overlapped memory regions, where different GPA leads
to the same HVA.  This causes a problem when overlapped regions
(different GPA but same translated HVA) exists in the tree, as looking
them by HVA will return them twice.

To solve this, track GPA in the DMA entry that acs as unique identifiers
to the maps.  When the map needs to be removed, iova tree is able to
find the right one.

Users that does not go to this extra layer of indirection can use the
iova tree as usual, with id = 0.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index e827b9175f..90adff597c 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -361,6 +361,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
         mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
         mem_region.size = int128_get64(llsize) - 1,
         mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
+        mem_region.id = iova;
 
         r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
         if (unlikely(r != IOVA_OK)) {
@@ -443,6 +444,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
         DMAMap mem_region = {
             .translated_addr = (hwaddr)(uintptr_t)vaddr,
             .size = int128_get64(llsize) - 1,
+            .id = iova,
         };
 
         result = vhost_iova_tree_find_iova(s->iova_tree, &mem_region);
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
  2024-04-10 10:03 [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree Eugenio Pérez
  2024-04-10 10:03 ` [RFC 1/2] iova_tree: add an id member to DMAMap Eugenio Pérez
  2024-04-10 10:03 ` [RFC 2/2] vdpa: identify aliased maps in iova_tree Eugenio Pérez
@ 2024-04-12  6:46 ` Jason Wang
  2024-04-12  7:56   ` Eugenio Perez Martin
  2 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2024-04-12  6:46 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: qemu-devel, Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Jonah Palmer, Dragos Tatulea

On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> The guest may have overlapped memory regions, where different GPA leads
> to the same HVA.  This causes a problem when overlapped regions
> (different GPA but same translated HVA) exists in the tree, as looking
> them by HVA will return them twice.

I think I don't understand if there's any side effect for shadow virtqueue?

Thanks

>
> To solve this, track GPA in the DMA entry that acs as unique identifiers
> to the maps.  When the map needs to be removed, iova tree is able to
> find the right one.
>
> Users that does not go to this extra layer of indirection can use the
> iova tree as usual, with id = 0.
>
> This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
> time to reproduce the issue.  This has been tested only without overlapping
> maps.  If it works with overlapping maps, it will be intergrated in the main
> series.
>
> Comments are welcome.  Thanks!
>
> Eugenio Pérez (2):
>   iova_tree: add an id member to DMAMap
>   vdpa: identify aliased maps in iova_tree
>
>  hw/virtio/vhost-vdpa.c   | 2 ++
>  include/qemu/iova-tree.h | 5 +++--
>  util/iova-tree.c         | 3 ++-
>  3 files changed, 7 insertions(+), 3 deletions(-)
>
> --
> 2.44.0
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
  2024-04-12  6:46 ` [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree Jason Wang
@ 2024-04-12  7:56   ` Eugenio Perez Martin
  2024-05-07  7:29     ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2024-04-12  7:56 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Jonah Palmer, Dragos Tatulea

On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > The guest may have overlapped memory regions, where different GPA leads
> > to the same HVA.  This causes a problem when overlapped regions
> > (different GPA but same translated HVA) exists in the tree, as looking
> > them by HVA will return them twice.
>
> I think I don't understand if there's any side effect for shadow virtqueue?
>

My bad, I totally forgot to put a reference to where this comes from.

Si-Wei found that during initialization this sequences of maps /
unmaps happens [1]:

HVA                    GPA                IOVA
-------------------------------------------------------------------------------------------------------------------------
Map
[0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
[0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
[0x80001000, 0x2000001000)
[0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
[0x2000001000, 0x2000021000)

Unmap
[0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
0x20000) ???

The third HVA range is contained in the first one, but exposed under a
different GVA (aliased). This is not "flattened" by QEMU, as GPA does
not overlap, only HVA.

At the third chunk unmap, the current algorithm finds the first chunk,
not the second one. This series is the way to tell the difference at
unmap time.

[1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html

Thanks!

> Thanks
>
> >
> > To solve this, track GPA in the DMA entry that acs as unique identifiers
> > to the maps.  When the map needs to be removed, iova tree is able to
> > find the right one.
> >
> > Users that does not go to this extra layer of indirection can use the
> > iova tree as usual, with id = 0.
> >
> > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
> > time to reproduce the issue.  This has been tested only without overlapping
> > maps.  If it works with overlapping maps, it will be intergrated in the main
> > series.
> >
> > Comments are welcome.  Thanks!
> >
> > Eugenio Pérez (2):
> >   iova_tree: add an id member to DMAMap
> >   vdpa: identify aliased maps in iova_tree
> >
> >  hw/virtio/vhost-vdpa.c   | 2 ++
> >  include/qemu/iova-tree.h | 5 +++--
> >  util/iova-tree.c         | 3 ++-
> >  3 files changed, 7 insertions(+), 3 deletions(-)
> >
> > --
> > 2.44.0
> >
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-04-10 10:03 ` [RFC 1/2] iova_tree: add an id member to DMAMap Eugenio Pérez
@ 2024-04-18 20:46   ` Si-Wei Liu
  2024-04-19  8:29     ` Eugenio Perez Martin
  0 siblings, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2024-04-18 20:46 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Michael S. Tsirkin, Lei Yang, Peter Xu, Jonah Palmer,
	Dragos Tatulea, Jason Wang



On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> IOVA tree is also used to track the mappings of virtio-net shadow
> virtqueue.  This mappings may not match with the GPA->HVA ones.
>
> This causes a problem when overlapped regions (different GPA but same
> translated HVA) exists in the tree, as looking them by HVA will return
> them twice.  To solve this, create an id member so we can assign unique
> identifiers (GPA) to the maps.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   include/qemu/iova-tree.h | 5 +++--
>   util/iova-tree.c         | 3 ++-
>   2 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> index 2a10a7052e..34ee230e7d 100644
> --- a/include/qemu/iova-tree.h
> +++ b/include/qemu/iova-tree.h
> @@ -36,6 +36,7 @@ typedef struct DMAMap {
>       hwaddr iova;
>       hwaddr translated_addr;
>       hwaddr size;                /* Inclusive */
> +    uint64_t id;
>       IOMMUAccessFlags perm;
>   } QEMU_PACKED DMAMap;
>   typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
>    * @map: the mapping to search
>    *
>    * Search for a mapping in the iova tree that translated_addr overlaps with the
> - * mapping range specified.  Only the first found mapping will be
> - * returned.
> + * mapping range specified and map->id is equal.  Only the first found
> + * mapping will be returned.
>    *
>    * Return: DMAMap pointer if found, or NULL if not found.  Note that
>    * the returned DMAMap pointer is maintained internally.  User should
> diff --git a/util/iova-tree.c b/util/iova-tree.c
> index 536789797e..0863e0a3b8 100644
> --- a/util/iova-tree.c
> +++ b/util/iova-tree.c
> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
>   
>       needle = args->needle;
>       if (map->translated_addr + map->size < needle->translated_addr ||
> -        needle->translated_addr + needle->size < map->translated_addr) {
> +        needle->translated_addr + needle->size < map->translated_addr ||
> +        needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from 
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA 
space will be searched on without passing in the ID (GPA), and exact 
match for the same GPA range is not actually needed unlike the mapping 
removal case. Could we create an API variant, for the SVQ lookup case 
specifically? Or alternatively, add a special flag, say skip_id_match to 
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or 
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().

Thanks,
-Siwei
>           return false;
>       }
>   



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-04-18 20:46   ` Si-Wei Liu
@ 2024-04-19  8:29     ` Eugenio Perez Martin
  2024-04-19 23:49       ` Si-Wei Liu
  0 siblings, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2024-04-19  8:29 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu, Jonah Palmer,
	Dragos Tatulea, Jason Wang

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> > IOVA tree is also used to track the mappings of virtio-net shadow
> > virtqueue.  This mappings may not match with the GPA->HVA ones.
> >
> > This causes a problem when overlapped regions (different GPA but same
> > translated HVA) exists in the tree, as looking them by HVA will return
> > them twice.  To solve this, create an id member so we can assign unique
> > identifiers (GPA) to the maps.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   include/qemu/iova-tree.h | 5 +++--
> >   util/iova-tree.c         | 3 ++-
> >   2 files changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> > index 2a10a7052e..34ee230e7d 100644
> > --- a/include/qemu/iova-tree.h
> > +++ b/include/qemu/iova-tree.h
> > @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >       hwaddr iova;
> >       hwaddr translated_addr;
> >       hwaddr size;                /* Inclusive */
> > +    uint64_t id;
> >       IOMMUAccessFlags perm;
> >   } QEMU_PACKED DMAMap;
> >   typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> > @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
> >    * @map: the mapping to search
> >    *
> >    * Search for a mapping in the iova tree that translated_addr overlaps with the
> > - * mapping range specified.  Only the first found mapping will be
> > - * returned.
> > + * mapping range specified and map->id is equal.  Only the first found
> > + * mapping will be returned.
> >    *
> >    * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >    * the returned DMAMap pointer is maintained internally.  User should
> > diff --git a/util/iova-tree.c b/util/iova-tree.c
> > index 536789797e..0863e0a3b8 100644
> > --- a/util/iova-tree.c
> > +++ b/util/iova-tree.c
> > @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
> >
> >       needle = args->needle;
> >       if (map->translated_addr + map->size < needle->translated_addr ||
> > -        needle->translated_addr + needle->size < map->translated_addr) {
> > +        needle->translated_addr + needle->size < map->translated_addr ||
> > +        needle->id != map->id) {
>
> It looks this iterator can also be invoked by SVQ from
> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> space will be searched on without passing in the ID (GPA), and exact
> match for the same GPA range is not actually needed unlike the mapping
> removal case. Could we create an API variant, for the SVQ lookup case
> specifically? Or alternatively, add a special flag, say skip_id_match to
> DMAMap, and the id match check may look like below:
>
> (!needle->skip_id_match && needle->id != map->id)
>
> I think vhost_svq_translate_addr() could just call the API variant or
> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
>

I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...

> Thanks,
> -Siwei
> >           return false;
> >       }
> >
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-04-19  8:29     ` Eugenio Perez Martin
@ 2024-04-19 23:49       ` Si-Wei Liu
  2024-04-22  8:49         ` Eugenio Perez Martin
  0 siblings, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2024-04-19 23:49 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu, Jonah Palmer,
	Dragos Tatulea, Jason Wang



On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
>>> IOVA tree is also used to track the mappings of virtio-net shadow
>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
>>>
>>> This causes a problem when overlapped regions (different GPA but same
>>> translated HVA) exists in the tree, as looking them by HVA will return
>>> them twice.  To solve this, create an id member so we can assign unique
>>> identifiers (GPA) to the maps.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>    include/qemu/iova-tree.h | 5 +++--
>>>    util/iova-tree.c         | 3 ++-
>>>    2 files changed, 5 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
>>> index 2a10a7052e..34ee230e7d 100644
>>> --- a/include/qemu/iova-tree.h
>>> +++ b/include/qemu/iova-tree.h
>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
>>>        hwaddr iova;
>>>        hwaddr translated_addr;
>>>        hwaddr size;                /* Inclusive */
>>> +    uint64_t id;
>>>        IOMMUAccessFlags perm;
>>>    } QEMU_PACKED DMAMap;
>>>    typedef gboolean (*iova_tree_iterator)(DMAMap *map);
>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
>>>     * @map: the mapping to search
>>>     *
>>>     * Search for a mapping in the iova tree that translated_addr overlaps with the
>>> - * mapping range specified.  Only the first found mapping will be
>>> - * returned.
>>> + * mapping range specified and map->id is equal.  Only the first found
>>> + * mapping will be returned.
>>>     *
>>>     * Return: DMAMap pointer if found, or NULL if not found.  Note that
>>>     * the returned DMAMap pointer is maintained internally.  User should
>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
>>> index 536789797e..0863e0a3b8 100644
>>> --- a/util/iova-tree.c
>>> +++ b/util/iova-tree.c
>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
>>>
>>>        needle = args->needle;
>>>        if (map->translated_addr + map->size < needle->translated_addr ||
>>> -        needle->translated_addr + needle->size < map->translated_addr) {
>>> +        needle->translated_addr + needle->size < map->translated_addr ||
>>> +        needle->id != map->id) {
>> It looks this iterator can also be invoked by SVQ from
>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
>> space will be searched on without passing in the ID (GPA), and exact
>> match for the same GPA range is not actually needed unlike the mapping
>> removal case. Could we create an API variant, for the SVQ lookup case
>> specifically? Or alternatively, add a special flag, say skip_id_match to
>> DMAMap, and the id match check may look like below:
>>
>> (!needle->skip_id_match && needle->id != map->id)
>>
>> I think vhost_svq_translate_addr() could just call the API variant or
>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
>>
> I think you're totally right. But I'd really like to not complicate
> the API of the iova_tree more.
>
> I think we can look for the hwaddr using memory_region_from_host and
> then get the hwaddr. It is another lookup though...
Yeah, that will be another means of doing translation without having to 
complicate the API around iova_tree. I wonder how the lookup through 
memory_region_from_host() may perform compared to the iova tree one, the 
former looks to be an O(N) linear search on a linked list while the 
latter would be roughly O(log N) on an AVL tree? Of course, 
memory_region_from_host() won't search out of the guest memory space for 
sure. As this could be on the hot data path I have a little bit 
hesitance over the potential cost or performance regression this change 
could bring in, but maybe I'm overthinking it too much...

Thanks,
-Siwei

>
>> Thanks,
>> -Siwei
>>>            return false;
>>>        }
>>>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-04-19 23:49       ` Si-Wei Liu
@ 2024-04-22  8:49         ` Eugenio Perez Martin
  2024-04-23 22:20           ` Si-Wei Liu
  0 siblings, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2024-04-22  8:49 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu, Jonah Palmer,
	Dragos Tatulea, Jason Wang

On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> > On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>
> >>> This causes a problem when overlapped regions (different GPA but same
> >>> translated HVA) exists in the tree, as looking them by HVA will return
> >>> them twice.  To solve this, create an id member so we can assign unique
> >>> identifiers (GPA) to the maps.
> >>>
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>> ---
> >>>    include/qemu/iova-tree.h | 5 +++--
> >>>    util/iova-tree.c         | 3 ++-
> >>>    2 files changed, 5 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>> index 2a10a7052e..34ee230e7d 100644
> >>> --- a/include/qemu/iova-tree.h
> >>> +++ b/include/qemu/iova-tree.h
> >>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>        hwaddr iova;
> >>>        hwaddr translated_addr;
> >>>        hwaddr size;                /* Inclusive */
> >>> +    uint64_t id;
> >>>        IOMMUAccessFlags perm;
> >>>    } QEMU_PACKED DMAMap;
> >>>    typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
> >>>     * @map: the mapping to search
> >>>     *
> >>>     * Search for a mapping in the iova tree that translated_addr overlaps with the
> >>> - * mapping range specified.  Only the first found mapping will be
> >>> - * returned.
> >>> + * mapping range specified and map->id is equal.  Only the first found
> >>> + * mapping will be returned.
> >>>     *
> >>>     * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >>>     * the returned DMAMap pointer is maintained internally.  User should
> >>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>> index 536789797e..0863e0a3b8 100644
> >>> --- a/util/iova-tree.c
> >>> +++ b/util/iova-tree.c
> >>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>
> >>>        needle = args->needle;
> >>>        if (map->translated_addr + map->size < needle->translated_addr ||
> >>> -        needle->translated_addr + needle->size < map->translated_addr) {
> >>> +        needle->translated_addr + needle->size < map->translated_addr ||
> >>> +        needle->id != map->id) {
> >> It looks this iterator can also be invoked by SVQ from
> >> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> >> space will be searched on without passing in the ID (GPA), and exact
> >> match for the same GPA range is not actually needed unlike the mapping
> >> removal case. Could we create an API variant, for the SVQ lookup case
> >> specifically? Or alternatively, add a special flag, say skip_id_match to
> >> DMAMap, and the id match check may look like below:
> >>
> >> (!needle->skip_id_match && needle->id != map->id)
> >>
> >> I think vhost_svq_translate_addr() could just call the API variant or
> >> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
> >>
> > I think you're totally right. But I'd really like to not complicate
> > the API of the iova_tree more.
> >
> > I think we can look for the hwaddr using memory_region_from_host and
> > then get the hwaddr. It is another lookup though...
> Yeah, that will be another means of doing translation without having to
> complicate the API around iova_tree. I wonder how the lookup through
> memory_region_from_host() may perform compared to the iova tree one, the
> former looks to be an O(N) linear search on a linked list while the
> latter would be roughly O(log N) on an AVL tree?

Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
linear too. It is not even ordered.

But apart from this detail you're right, I have the same concerns with
this solution too. If we see a hard performance regression we could go
to more complicated solutions, like maintaining a reverse IOVATree in
vhost-iova-tree too. First RFCs of SVQ did that actually.

Thanks!

> Of course,
> memory_region_from_host() won't search out of the guest memory space for
> sure. As this could be on the hot data path I have a little bit
> hesitance over the potential cost or performance regression this change
> could bring in, but maybe I'm overthinking it too much...
>
> Thanks,
> -Siwei
>
> >
> >> Thanks,
> >> -Siwei
> >>>            return false;
> >>>        }
> >>>
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-04-22  8:49         ` Eugenio Perez Martin
@ 2024-04-23 22:20           ` Si-Wei Liu
  2024-04-24  7:33             ` Eugenio Perez Martin
  0 siblings, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2024-04-23 22:20 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu, Jonah Palmer,
	Dragos Tatulea, Jason Wang



On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
>>>>>
>>>>> This causes a problem when overlapped regions (different GPA but same
>>>>> translated HVA) exists in the tree, as looking them by HVA will return
>>>>> them twice.  To solve this, create an id member so we can assign unique
>>>>> identifiers (GPA) to the maps.
>>>>>
>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>> ---
>>>>>     include/qemu/iova-tree.h | 5 +++--
>>>>>     util/iova-tree.c         | 3 ++-
>>>>>     2 files changed, 5 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
>>>>> index 2a10a7052e..34ee230e7d 100644
>>>>> --- a/include/qemu/iova-tree.h
>>>>> +++ b/include/qemu/iova-tree.h
>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
>>>>>         hwaddr iova;
>>>>>         hwaddr translated_addr;
>>>>>         hwaddr size;                /* Inclusive */
>>>>> +    uint64_t id;
>>>>>         IOMMUAccessFlags perm;
>>>>>     } QEMU_PACKED DMAMap;
>>>>>     typedef gboolean (*iova_tree_iterator)(DMAMap *map);
>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
>>>>>      * @map: the mapping to search
>>>>>      *
>>>>>      * Search for a mapping in the iova tree that translated_addr overlaps with the
>>>>> - * mapping range specified.  Only the first found mapping will be
>>>>> - * returned.
>>>>> + * mapping range specified and map->id is equal.  Only the first found
>>>>> + * mapping will be returned.
>>>>>      *
>>>>>      * Return: DMAMap pointer if found, or NULL if not found.  Note that
>>>>>      * the returned DMAMap pointer is maintained internally.  User should
>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
>>>>> index 536789797e..0863e0a3b8 100644
>>>>> --- a/util/iova-tree.c
>>>>> +++ b/util/iova-tree.c
>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
>>>>>
>>>>>         needle = args->needle;
>>>>>         if (map->translated_addr + map->size < needle->translated_addr ||
>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
>>>>> +        needle->id != map->id) {
>>>> It looks this iterator can also be invoked by SVQ from
>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
>>>> space will be searched on without passing in the ID (GPA), and exact
>>>> match for the same GPA range is not actually needed unlike the mapping
>>>> removal case. Could we create an API variant, for the SVQ lookup case
>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
>>>> DMAMap, and the id match check may look like below:
>>>>
>>>> (!needle->skip_id_match && needle->id != map->id)
>>>>
>>>> I think vhost_svq_translate_addr() could just call the API variant or
>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
>>>>
>>> I think you're totally right. But I'd really like to not complicate
>>> the API of the iova_tree more.
>>>
>>> I think we can look for the hwaddr using memory_region_from_host and
>>> then get the hwaddr. It is another lookup though...
>> Yeah, that will be another means of doing translation without having to
>> complicate the API around iova_tree. I wonder how the lookup through
>> memory_region_from_host() may perform compared to the iova tree one, the
>> former looks to be an O(N) linear search on a linked list while the
>> latter would be roughly O(log N) on an AVL tree?
> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
> linear too. It is not even ordered.
Oh Sorry, I misread the code and I should look for g_tree_foreach () 
instead of g_tree_search_node(). So the former is indeed linear 
iteration, but it looks to be ordered?

https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115
>
> But apart from this detail you're right, I have the same concerns with
> this solution too. If we see a hard performance regression we could go
> to more complicated solutions, like maintaining a reverse IOVATree in
> vhost-iova-tree too. First RFCs of SVQ did that actually.
Agreed, yeap we can use memory_region_from_host for now.  Any reason why 
reverse IOVATree was dropped, lack of users? But now we have one!

Thanks,
-Siwei
>
> Thanks!
>
>> Of course,
>> memory_region_from_host() won't search out of the guest memory space for
>> sure. As this could be on the hot data path I have a little bit
>> hesitance over the potential cost or performance regression this change
>> could bring in, but maybe I'm overthinking it too much...
>>
>> Thanks,
>> -Siwei
>>
>>>> Thanks,
>>>> -Siwei
>>>>>             return false;
>>>>>         }
>>>>>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-04-23 22:20           ` Si-Wei Liu
@ 2024-04-24  7:33             ` Eugenio Perez Martin
  2024-04-25 17:43               ` Si-Wei Liu
  0 siblings, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2024-04-24  7:33 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu, Jonah Palmer,
	Dragos Tatulea, Jason Wang

On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> > On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>
> >>>>> This causes a problem when overlapped regions (different GPA but same
> >>>>> translated HVA) exists in the tree, as looking them by HVA will return
> >>>>> them twice.  To solve this, create an id member so we can assign unique
> >>>>> identifiers (GPA) to the maps.
> >>>>>
> >>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>> ---
> >>>>>     include/qemu/iova-tree.h | 5 +++--
> >>>>>     util/iova-tree.c         | 3 ++-
> >>>>>     2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>
> >>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>> --- a/include/qemu/iova-tree.h
> >>>>> +++ b/include/qemu/iova-tree.h
> >>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>         hwaddr iova;
> >>>>>         hwaddr translated_addr;
> >>>>>         hwaddr size;                /* Inclusive */
> >>>>> +    uint64_t id;
> >>>>>         IOMMUAccessFlags perm;
> >>>>>     } QEMU_PACKED DMAMap;
> >>>>>     typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
> >>>>>      * @map: the mapping to search
> >>>>>      *
> >>>>>      * Search for a mapping in the iova tree that translated_addr overlaps with the
> >>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>> - * returned.
> >>>>> + * mapping range specified and map->id is equal.  Only the first found
> >>>>> + * mapping will be returned.
> >>>>>      *
> >>>>>      * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >>>>>      * the returned DMAMap pointer is maintained internally.  User should
> >>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>>>> index 536789797e..0863e0a3b8 100644
> >>>>> --- a/util/iova-tree.c
> >>>>> +++ b/util/iova-tree.c
> >>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>>>
> >>>>>         needle = args->needle;
> >>>>>         if (map->translated_addr + map->size < needle->translated_addr ||
> >>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
> >>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
> >>>>> +        needle->id != map->id) {
> >>>> It looks this iterator can also be invoked by SVQ from
> >>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> >>>> space will be searched on without passing in the ID (GPA), and exact
> >>>> match for the same GPA range is not actually needed unlike the mapping
> >>>> removal case. Could we create an API variant, for the SVQ lookup case
> >>>> specifically? Or alternatively, add a special flag, say skip_id_match to
> >>>> DMAMap, and the id match check may look like below:
> >>>>
> >>>> (!needle->skip_id_match && needle->id != map->id)
> >>>>
> >>>> I think vhost_svq_translate_addr() could just call the API variant or
> >>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
> >>>>
> >>> I think you're totally right. But I'd really like to not complicate
> >>> the API of the iova_tree more.
> >>>
> >>> I think we can look for the hwaddr using memory_region_from_host and
> >>> then get the hwaddr. It is another lookup though...
> >> Yeah, that will be another means of doing translation without having to
> >> complicate the API around iova_tree. I wonder how the lookup through
> >> memory_region_from_host() may perform compared to the iova tree one, the
> >> former looks to be an O(N) linear search on a linked list while the
> >> latter would be roughly O(log N) on an AVL tree?
> > Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
> > linear too. It is not even ordered.
> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
> instead of g_tree_search_node(). So the former is indeed linear
> iteration, but it looks to be ordered?
>
> https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115

The GPA / IOVA are ordered but we're looking by QEMU's vaddr.

If we have these translations:
[0x1000, 0x2000] -> [0x10000, 0x11000]
[0x2000, 0x3000] -> [0x6000, 0x7000]

We will see them in this order, so we cannot stop the search at the first node.

> >
> > But apart from this detail you're right, I have the same concerns with
> > this solution too. If we see a hard performance regression we could go
> > to more complicated solutions, like maintaining a reverse IOVATree in
> > vhost-iova-tree too. First RFCs of SVQ did that actually.
> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
> reverse IOVATree was dropped, lack of users? But now we have one!
>

No, it is just simplicity. We already have an user in the hot patch in
the master branch, vhost_svq_vring_write_descs. But I never profiled
enough to find if it is a bottleneck or not to be honest.

I'll send the new series by today, thank you for finding these issues!

> Thanks,
> -Siwei
> >
> > Thanks!
> >
> >> Of course,
> >> memory_region_from_host() won't search out of the guest memory space for
> >> sure. As this could be on the hot data path I have a little bit
> >> hesitance over the potential cost or performance regression this change
> >> could bring in, but maybe I'm overthinking it too much...
> >>
> >> Thanks,
> >> -Siwei
> >>
> >>>> Thanks,
> >>>> -Siwei
> >>>>>             return false;
> >>>>>         }
> >>>>>
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-04-24  7:33             ` Eugenio Perez Martin
@ 2024-04-25 17:43               ` Si-Wei Liu
  2024-04-29  8:14                 ` Eugenio Perez Martin
  0 siblings, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2024-04-25 17:43 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu, Jonah Palmer,
	Dragos Tatulea, Jason Wang



On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
>>>>>>>
>>>>>>> This causes a problem when overlapped regions (different GPA but same
>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
>>>>>>> them twice.  To solve this, create an id member so we can assign unique
>>>>>>> identifiers (GPA) to the maps.
>>>>>>>
>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>>> ---
>>>>>>>      include/qemu/iova-tree.h | 5 +++--
>>>>>>>      util/iova-tree.c         | 3 ++-
>>>>>>>      2 files changed, 5 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
>>>>>>> index 2a10a7052e..34ee230e7d 100644
>>>>>>> --- a/include/qemu/iova-tree.h
>>>>>>> +++ b/include/qemu/iova-tree.h
>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
>>>>>>>          hwaddr iova;
>>>>>>>          hwaddr translated_addr;
>>>>>>>          hwaddr size;                /* Inclusive */
>>>>>>> +    uint64_t id;
>>>>>>>          IOMMUAccessFlags perm;
>>>>>>>      } QEMU_PACKED DMAMap;
>>>>>>>      typedef gboolean (*iova_tree_iterator)(DMAMap *map);
>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
>>>>>>>       * @map: the mapping to search
>>>>>>>       *
>>>>>>>       * Search for a mapping in the iova tree that translated_addr overlaps with the
>>>>>>> - * mapping range specified.  Only the first found mapping will be
>>>>>>> - * returned.
>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
>>>>>>> + * mapping will be returned.
>>>>>>>       *
>>>>>>>       * Return: DMAMap pointer if found, or NULL if not found.  Note that
>>>>>>>       * the returned DMAMap pointer is maintained internally.  User should
>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
>>>>>>> index 536789797e..0863e0a3b8 100644
>>>>>>> --- a/util/iova-tree.c
>>>>>>> +++ b/util/iova-tree.c
>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
>>>>>>>
>>>>>>>          needle = args->needle;
>>>>>>>          if (map->translated_addr + map->size < needle->translated_addr ||
>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
>>>>>>> +        needle->id != map->id) {
>>>>>> It looks this iterator can also be invoked by SVQ from
>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
>>>>>> space will be searched on without passing in the ID (GPA), and exact
>>>>>> match for the same GPA range is not actually needed unlike the mapping
>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
>>>>>> DMAMap, and the id match check may look like below:
>>>>>>
>>>>>> (!needle->skip_id_match && needle->id != map->id)
>>>>>>
>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
>>>>>>
>>>>> I think you're totally right. But I'd really like to not complicate
>>>>> the API of the iova_tree more.
>>>>>
>>>>> I think we can look for the hwaddr using memory_region_from_host and
>>>>> then get the hwaddr. It is another lookup though...
>>>> Yeah, that will be another means of doing translation without having to
>>>> complicate the API around iova_tree. I wonder how the lookup through
>>>> memory_region_from_host() may perform compared to the iova tree one, the
>>>> former looks to be an O(N) linear search on a linked list while the
>>>> latter would be roughly O(log N) on an AVL tree?
>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
>>> linear too. It is not even ordered.
>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
>> instead of g_tree_search_node(). So the former is indeed linear
>> iteration, but it looks to be ordered?
>>
>> https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115
> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
>
> If we have these translations:
> [0x1000, 0x2000] -> [0x10000, 0x11000]
> [0x2000, 0x3000] -> [0x6000, 0x7000]
>
> We will see them in this order, so we cannot stop the search at the first node.
Yeah, reverse lookup is unordered indeed, anyway.

>
>>> But apart from this detail you're right, I have the same concerns with
>>> this solution too. If we see a hard performance regression we could go
>>> to more complicated solutions, like maintaining a reverse IOVATree in
>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
>> reverse IOVATree was dropped, lack of users? But now we have one!
>>
> No, it is just simplicity. We already have an user in the hot patch in
> the master branch, vhost_svq_vring_write_descs. But I never profiled
> enough to find if it is a bottleneck or not to be honest.
Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to 
profile and see the difference.
>
> I'll send the new series by today, thank you for finding these issues!
Thanks! In case you don't have bandwidth to add back reverse IOVA tree, 
Jonah (cc'ed) may have interest in looking into it.

-Siwei


>
>> Thanks,
>> -Siwei
>>> Thanks!
>>>
>>>> Of course,
>>>> memory_region_from_host() won't search out of the guest memory space for
>>>> sure. As this could be on the hot data path I have a little bit
>>>> hesitance over the potential cost or performance regression this change
>>>> could bring in, but maybe I'm overthinking it too much...
>>>>
>>>> Thanks,
>>>> -Siwei
>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>>              return false;
>>>>>>>          }
>>>>>>>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-04-25 17:43               ` Si-Wei Liu
@ 2024-04-29  8:14                 ` Eugenio Perez Martin
  2024-04-29 11:19                   ` Jonah Palmer
  2024-04-30  5:54                   ` Si-Wei Liu
  0 siblings, 2 replies; 37+ messages in thread
From: Eugenio Perez Martin @ 2024-04-29  8:14 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu, Jonah Palmer,
	Dragos Tatulea, Jason Wang

On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> > On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> >>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>>>
> >>>>>>> This causes a problem when overlapped regions (different GPA but same
> >>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
> >>>>>>> them twice.  To solve this, create an id member so we can assign unique
> >>>>>>> identifiers (GPA) to the maps.
> >>>>>>>
> >>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>>>> ---
> >>>>>>>      include/qemu/iova-tree.h | 5 +++--
> >>>>>>>      util/iova-tree.c         | 3 ++-
> >>>>>>>      2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>>>> --- a/include/qemu/iova-tree.h
> >>>>>>> +++ b/include/qemu/iova-tree.h
> >>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>>>          hwaddr iova;
> >>>>>>>          hwaddr translated_addr;
> >>>>>>>          hwaddr size;                /* Inclusive */
> >>>>>>> +    uint64_t id;
> >>>>>>>          IOMMUAccessFlags perm;
> >>>>>>>      } QEMU_PACKED DMAMap;
> >>>>>>>      typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
> >>>>>>>       * @map: the mapping to search
> >>>>>>>       *
> >>>>>>>       * Search for a mapping in the iova tree that translated_addr overlaps with the
> >>>>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>>>> - * returned.
> >>>>>>> + * mapping range specified and map->id is equal.  Only the first found
> >>>>>>> + * mapping will be returned.
> >>>>>>>       *
> >>>>>>>       * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >>>>>>>       * the returned DMAMap pointer is maintained internally.  User should
> >>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>>>>>> index 536789797e..0863e0a3b8 100644
> >>>>>>> --- a/util/iova-tree.c
> >>>>>>> +++ b/util/iova-tree.c
> >>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>>>>>
> >>>>>>>          needle = args->needle;
> >>>>>>>          if (map->translated_addr + map->size < needle->translated_addr ||
> >>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
> >>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
> >>>>>>> +        needle->id != map->id) {
> >>>>>> It looks this iterator can also be invoked by SVQ from
> >>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> >>>>>> space will be searched on without passing in the ID (GPA), and exact
> >>>>>> match for the same GPA range is not actually needed unlike the mapping
> >>>>>> removal case. Could we create an API variant, for the SVQ lookup case
> >>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
> >>>>>> DMAMap, and the id match check may look like below:
> >>>>>>
> >>>>>> (!needle->skip_id_match && needle->id != map->id)
> >>>>>>
> >>>>>> I think vhost_svq_translate_addr() could just call the API variant or
> >>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
> >>>>>>
> >>>>> I think you're totally right. But I'd really like to not complicate
> >>>>> the API of the iova_tree more.
> >>>>>
> >>>>> I think we can look for the hwaddr using memory_region_from_host and
> >>>>> then get the hwaddr. It is another lookup though...
> >>>> Yeah, that will be another means of doing translation without having to
> >>>> complicate the API around iova_tree. I wonder how the lookup through
> >>>> memory_region_from_host() may perform compared to the iova tree one, the
> >>>> former looks to be an O(N) linear search on a linked list while the
> >>>> latter would be roughly O(log N) on an AVL tree?
> >>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
> >>> linear too. It is not even ordered.
> >> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
> >> instead of g_tree_search_node(). So the former is indeed linear
> >> iteration, but it looks to be ordered?
> >>
> >> https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115
> > The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
> >
> > If we have these translations:
> > [0x1000, 0x2000] -> [0x10000, 0x11000]
> > [0x2000, 0x3000] -> [0x6000, 0x7000]
> >
> > We will see them in this order, so we cannot stop the search at the first node.
> Yeah, reverse lookup is unordered indeed, anyway.
>
> >
> >>> But apart from this detail you're right, I have the same concerns with
> >>> this solution too. If we see a hard performance regression we could go
> >>> to more complicated solutions, like maintaining a reverse IOVATree in
> >>> vhost-iova-tree too. First RFCs of SVQ did that actually.
> >> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
> >> reverse IOVATree was dropped, lack of users? But now we have one!
> >>
> > No, it is just simplicity. We already have an user in the hot patch in
> > the master branch, vhost_svq_vring_write_descs. But I never profiled
> > enough to find if it is a bottleneck or not to be honest.
> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
> profile and see the difference.
> >
> > I'll send the new series by today, thank you for finding these issues!
> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
> Jonah (cc'ed) may have interest in looking into it.
>

Actually, yes. I've tried to solve it using:
memory_region_get_ram_ptr -> It's hard to get this pointer to work
without messing a lot with IOVATree.
memory_region_find -> I'm totally unable to make it return sections
that make sense
flatview_for_each_range -> It does not return the same
MemoryRegionsection as the listener, not sure why.

The only advance I have is that memory_region_from_host is able to
tell if the vaddr is from the guest or not.

So I'm convinced there must be a way to do it with the memory
subsystem, but I think the best way to do it ATM is to store a
parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
find the entry in this new tree, we can directly remove it by GPA. If
not, assume it is a host-only address like SVQ vrings, and remove by
iterating on vaddr as we do now. It is guaranteed the guest does not
translate to that vaddr and that that vaddr is unique in the tree
anyway.

Does it sound reasonable? Jonah, would you be interested in moving this forward?

Thanks!

> -Siwei
>
>
> >
> >> Thanks,
> >> -Siwei
> >>> Thanks!
> >>>
> >>>> Of course,
> >>>> memory_region_from_host() won't search out of the guest memory space for
> >>>> sure. As this could be on the hot data path I have a little bit
> >>>> hesitance over the potential cost or performance regression this change
> >>>> could bring in, but maybe I'm overthinking it too much...
> >>>>
> >>>> Thanks,
> >>>> -Siwei
> >>>>
> >>>>>> Thanks,
> >>>>>> -Siwei
> >>>>>>>              return false;
> >>>>>>>          }
> >>>>>>>
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-04-29  8:14                 ` Eugenio Perez Martin
@ 2024-04-29 11:19                   ` Jonah Palmer
  2024-04-30 18:11                     ` Eugenio Perez Martin
  2024-04-30  5:54                   ` Si-Wei Liu
  1 sibling, 1 reply; 37+ messages in thread
From: Jonah Palmer @ 2024-04-29 11:19 UTC (permalink / raw)
  To: Eugenio Perez Martin, Si-Wei Liu
  Cc: qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Dragos Tatulea, Jason Wang



On 4/29/24 4:14 AM, Eugenio Perez Martin wrote:
> On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>>
>> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
>>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>>
>>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
>>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>
>>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
>>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
>>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
>>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
>>>>>>>>>
>>>>>>>>> This causes a problem when overlapped regions (different GPA but same
>>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
>>>>>>>>> them twice.  To solve this, create an id member so we can assign unique
>>>>>>>>> identifiers (GPA) to the maps.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>>>>> ---
>>>>>>>>>       include/qemu/iova-tree.h | 5 +++--
>>>>>>>>>       util/iova-tree.c         | 3 ++-
>>>>>>>>>       2 files changed, 5 insertions(+), 3 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
>>>>>>>>> index 2a10a7052e..34ee230e7d 100644
>>>>>>>>> --- a/include/qemu/iova-tree.h
>>>>>>>>> +++ b/include/qemu/iova-tree.h
>>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
>>>>>>>>>           hwaddr iova;
>>>>>>>>>           hwaddr translated_addr;
>>>>>>>>>           hwaddr size;                /* Inclusive */
>>>>>>>>> +    uint64_t id;
>>>>>>>>>           IOMMUAccessFlags perm;
>>>>>>>>>       } QEMU_PACKED DMAMap;
>>>>>>>>>       typedef gboolean (*iova_tree_iterator)(DMAMap *map);
>>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
>>>>>>>>>        * @map: the mapping to search
>>>>>>>>>        *
>>>>>>>>>        * Search for a mapping in the iova tree that translated_addr overlaps with the
>>>>>>>>> - * mapping range specified.  Only the first found mapping will be
>>>>>>>>> - * returned.
>>>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
>>>>>>>>> + * mapping will be returned.
>>>>>>>>>        *
>>>>>>>>>        * Return: DMAMap pointer if found, or NULL if not found.  Note that
>>>>>>>>>        * the returned DMAMap pointer is maintained internally.  User should
>>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
>>>>>>>>> index 536789797e..0863e0a3b8 100644
>>>>>>>>> --- a/util/iova-tree.c
>>>>>>>>> +++ b/util/iova-tree.c
>>>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
>>>>>>>>>
>>>>>>>>>           needle = args->needle;
>>>>>>>>>           if (map->translated_addr + map->size < needle->translated_addr ||
>>>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
>>>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
>>>>>>>>> +        needle->id != map->id) {
>>>>>>>> It looks this iterator can also be invoked by SVQ from
>>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
>>>>>>>> space will be searched on without passing in the ID (GPA), and exact
>>>>>>>> match for the same GPA range is not actually needed unlike the mapping
>>>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
>>>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
>>>>>>>> DMAMap, and the id match check may look like below:
>>>>>>>>
>>>>>>>> (!needle->skip_id_match && needle->id != map->id)
>>>>>>>>
>>>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
>>>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
>>>>>>>>
>>>>>>> I think you're totally right. But I'd really like to not complicate
>>>>>>> the API of the iova_tree more.
>>>>>>>
>>>>>>> I think we can look for the hwaddr using memory_region_from_host and
>>>>>>> then get the hwaddr. It is another lookup though...
>>>>>> Yeah, that will be another means of doing translation without having to
>>>>>> complicate the API around iova_tree. I wonder how the lookup through
>>>>>> memory_region_from_host() may perform compared to the iova tree one, the
>>>>>> former looks to be an O(N) linear search on a linked list while the
>>>>>> latter would be roughly O(log N) on an AVL tree?
>>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
>>>>> linear too. It is not even ordered.
>>>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
>>>> instead of g_tree_search_node(). So the former is indeed linear
>>>> iteration, but it looks to be ordered?
>>>>
>>>> https://urldefense.com/v3/__https://github.com/GNOME/glib/blob/main/glib/gtree.c*L1115__;Iw!!ACWV5N9M2RV99hQ!Ng2rLfRd9tLyNTNocW50Mf5AcxSt0uF0wOdv120djff-z_iAdbujYK-jMi5UC1DZLxb1yLUv2vV0j3wJo8o$
>>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
>>>
>>> If we have these translations:
>>> [0x1000, 0x2000] -> [0x10000, 0x11000]
>>> [0x2000, 0x3000] -> [0x6000, 0x7000]
>>>
>>> We will see them in this order, so we cannot stop the search at the first node.
>> Yeah, reverse lookup is unordered indeed, anyway.
>>
>>>
>>>>> But apart from this detail you're right, I have the same concerns with
>>>>> this solution too. If we see a hard performance regression we could go
>>>>> to more complicated solutions, like maintaining a reverse IOVATree in
>>>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
>>>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
>>>> reverse IOVATree was dropped, lack of users? But now we have one!
>>>>
>>> No, it is just simplicity. We already have an user in the hot patch in
>>> the master branch, vhost_svq_vring_write_descs. But I never profiled
>>> enough to find if it is a bottleneck or not to be honest.
>> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
>> profile and see the difference.
>>>
>>> I'll send the new series by today, thank you for finding these issues!
>> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
>> Jonah (cc'ed) may have interest in looking into it.
>>
> 
> Actually, yes. I've tried to solve it using:
> memory_region_get_ram_ptr -> It's hard to get this pointer to work
> without messing a lot with IOVATree.
> memory_region_find -> I'm totally unable to make it return sections
> that make sense
> flatview_for_each_range -> It does not return the same
> MemoryRegionsection as the listener, not sure why.
> 
> The only advance I have is that memory_region_from_host is able to
> tell if the vaddr is from the guest or not.
> 
> So I'm convinced there must be a way to do it with the memory
> subsystem, but I think the best way to do it ATM is to store a
> parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
> find the entry in this new tree, we can directly remove it by GPA. If
> not, assume it is a host-only address like SVQ vrings, and remove by
> iterating on vaddr as we do now. It is guaranteed the guest does not
> translate to that vaddr and that that vaddr is unique in the tree
> anyway.
> 
> Does it sound reasonable? Jonah, would you be interested in moving this forward?
> 
> Thanks!
> 

Sure, I'd be more than happy to work on this stuff! I can probably get 
started on this either today or tomorrow.

Si-Wei mentioned something about these "reverse IOVATree" patches that 
were dropped; is this relevant to what you're asking here? Is it 
something I should base my work off of?

If there's any other relevant information about this issue that you 
think I should know, let me know. I'll start digging into this ASAP and 
will reach out if I need any guidance. :)

Jonah

>> -Siwei
>>
>>
>>>
>>>> Thanks,
>>>> -Siwei
>>>>> Thanks!
>>>>>
>>>>>> Of course,
>>>>>> memory_region_from_host() won't search out of the guest memory space for
>>>>>> sure. As this could be on the hot data path I have a little bit
>>>>>> hesitance over the potential cost or performance regression this change
>>>>>> could bring in, but maybe I'm overthinking it too much...
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>>>>               return false;
>>>>>>>>>           }
>>>>>>>>>
>>
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-04-29  8:14                 ` Eugenio Perez Martin
  2024-04-29 11:19                   ` Jonah Palmer
@ 2024-04-30  5:54                   ` Si-Wei Liu
  2024-04-30 17:19                     ` Eugenio Perez Martin
  1 sibling, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2024-04-30  5:54 UTC (permalink / raw)
  To: Eugenio Perez Martin, Jonah Palmer
  Cc: qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Dragos Tatulea, Jason Wang



On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:
> On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
>>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
>>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
>>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
>>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
>>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
>>>>>>>>>
>>>>>>>>> This causes a problem when overlapped regions (different GPA but same
>>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
>>>>>>>>> them twice.  To solve this, create an id member so we can assign unique
>>>>>>>>> identifiers (GPA) to the maps.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>>>>> ---
>>>>>>>>>       include/qemu/iova-tree.h | 5 +++--
>>>>>>>>>       util/iova-tree.c         | 3 ++-
>>>>>>>>>       2 files changed, 5 insertions(+), 3 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
>>>>>>>>> index 2a10a7052e..34ee230e7d 100644
>>>>>>>>> --- a/include/qemu/iova-tree.h
>>>>>>>>> +++ b/include/qemu/iova-tree.h
>>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
>>>>>>>>>           hwaddr iova;
>>>>>>>>>           hwaddr translated_addr;
>>>>>>>>>           hwaddr size;                /* Inclusive */
>>>>>>>>> +    uint64_t id;
>>>>>>>>>           IOMMUAccessFlags perm;
>>>>>>>>>       } QEMU_PACKED DMAMap;
>>>>>>>>>       typedef gboolean (*iova_tree_iterator)(DMAMap *map);
>>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
>>>>>>>>>        * @map: the mapping to search
>>>>>>>>>        *
>>>>>>>>>        * Search for a mapping in the iova tree that translated_addr overlaps with the
>>>>>>>>> - * mapping range specified.  Only the first found mapping will be
>>>>>>>>> - * returned.
>>>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
>>>>>>>>> + * mapping will be returned.
>>>>>>>>>        *
>>>>>>>>>        * Return: DMAMap pointer if found, or NULL if not found.  Note that
>>>>>>>>>        * the returned DMAMap pointer is maintained internally.  User should
>>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
>>>>>>>>> index 536789797e..0863e0a3b8 100644
>>>>>>>>> --- a/util/iova-tree.c
>>>>>>>>> +++ b/util/iova-tree.c
>>>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
>>>>>>>>>
>>>>>>>>>           needle = args->needle;
>>>>>>>>>           if (map->translated_addr + map->size < needle->translated_addr ||
>>>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
>>>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
>>>>>>>>> +        needle->id != map->id) {
>>>>>>>> It looks this iterator can also be invoked by SVQ from
>>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
>>>>>>>> space will be searched on without passing in the ID (GPA), and exact
>>>>>>>> match for the same GPA range is not actually needed unlike the mapping
>>>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
>>>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
>>>>>>>> DMAMap, and the id match check may look like below:
>>>>>>>>
>>>>>>>> (!needle->skip_id_match && needle->id != map->id)
>>>>>>>>
>>>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
>>>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
>>>>>>>>
>>>>>>> I think you're totally right. But I'd really like to not complicate
>>>>>>> the API of the iova_tree more.
>>>>>>>
>>>>>>> I think we can look for the hwaddr using memory_region_from_host and
>>>>>>> then get the hwaddr. It is another lookup though...
>>>>>> Yeah, that will be another means of doing translation without having to
>>>>>> complicate the API around iova_tree. I wonder how the lookup through
>>>>>> memory_region_from_host() may perform compared to the iova tree one, the
>>>>>> former looks to be an O(N) linear search on a linked list while the
>>>>>> latter would be roughly O(log N) on an AVL tree?
>>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
>>>>> linear too. It is not even ordered.
>>>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
>>>> instead of g_tree_search_node(). So the former is indeed linear
>>>> iteration, but it looks to be ordered?
>>>>
>>>> https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115
>>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
>>>
>>> If we have these translations:
>>> [0x1000, 0x2000] -> [0x10000, 0x11000]
>>> [0x2000, 0x3000] -> [0x6000, 0x7000]
>>>
>>> We will see them in this order, so we cannot stop the search at the first node.
>> Yeah, reverse lookup is unordered indeed, anyway.
>>
>>>>> But apart from this detail you're right, I have the same concerns with
>>>>> this solution too. If we see a hard performance regression we could go
>>>>> to more complicated solutions, like maintaining a reverse IOVATree in
>>>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
>>>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
>>>> reverse IOVATree was dropped, lack of users? But now we have one!
>>>>
>>> No, it is just simplicity. We already have an user in the hot patch in
>>> the master branch, vhost_svq_vring_write_descs. But I never profiled
>>> enough to find if it is a bottleneck or not to be honest.
>> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
>> profile and see the difference.
>>> I'll send the new series by today, thank you for finding these issues!
>> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
>> Jonah (cc'ed) may have interest in looking into it.
>>
> Actually, yes. I've tried to solve it using:
> memory_region_get_ram_ptr -> It's hard to get this pointer to work
> without messing a lot with IOVATree.
> memory_region_find -> I'm totally unable to make it return sections
> that make sense
> flatview_for_each_range -> It does not return the same
> MemoryRegionsection as the listener, not sure why.
Ouch, thank you for the summary of attempts that were done earlier.
> The only advance I have is that memory_region_from_host is able to
> tell if the vaddr is from the guest or not.
Hmmm, then it won't be too useful without a direct means to identifying 
the exact memory region associated with the iova that is being mapped. 
And, this additional indirection seems introduce a tiny bit of more 
latency in the reverse lookup routine (should not be a scalability issue 
though if it's a linear search)?

> So I'm convinced there must be a way to do it with the memory
> subsystem, but I think the best way to do it ATM is to store a
> parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
> find the entry in this new tree, we can directly remove it by GPA. If
> not, assume it is a host-only address like SVQ vrings, and remove by
> iterating on vaddr as we do now.
Yeah, this could work I think. On the other hand, given that we are now 
trying to improve it, I wonder if possible to come up with a fast 
version for the SVQ (host-only address) case without having to look up 
twice? SVQ callers should be able to tell apart from the guest case 
where GPA -> IOVA translation doesn't exist? Or just maintain a parallel 
tree with HVA -> IOVA translations for SVQ reverse lookup only? I feel 
SVQ mappings may be worth a separate fast lookup path - unlike guest 
mappings, the insertion, lookup and removal for SVQ mappings seem 
unavoidable during the migration downtime path.

>   It is guaranteed the guest does not
> translate to that vaddr and that that vaddr is unique in the tree
> anyway.
>
> Does it sound reasonable? Jonah, would you be interested in moving this forward?
My thought would be that the reverse IOVA tree stuff can be added as a 
follow-up optimization right after for extended scalability, but for now 
as the interim, we may still need some form of simple fix, so as to 
quickly unblock the other dependent work built on top of this one and 
the early pinning series [1]. With it said, I'm completely fine if 
performing the reverse lookup through linear tree walk e.g. 
g_tree_foreach(), that should suffice small VM configs with just a 
couple of queues and limited number of memory regions. Going forward, to 
address the scalability bottleneck, Jonah could just replace the 
corresponding API call with the one built on top of reverse IOVA tree (I 
presume the use of these iova tree APIs is kind of internal that only 
limits to SVQ and vhost-vdpa subsystems) once he gets there, and then 
eliminate the other API variants that will no longer be in use. What do 
you think about this idea / plan?

Thanks,
-Siwei

[1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html

>
> Thanks!
>
>> -Siwei
>>
>>
>>>> Thanks,
>>>> -Siwei
>>>>> Thanks!
>>>>>
>>>>>> Of course,
>>>>>> memory_region_from_host() won't search out of the guest memory space for
>>>>>> sure. As this could be on the hot data path I have a little bit
>>>>>> hesitance over the potential cost or performance regression this change
>>>>>> could bring in, but maybe I'm overthinking it too much...
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>>>>               return false;
>>>>>>>>>           }
>>>>>>>>>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-04-30  5:54                   ` Si-Wei Liu
@ 2024-04-30 17:19                     ` Eugenio Perez Martin
  2024-05-01 23:13                       ` Si-Wei Liu
  0 siblings, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2024-04-30 17:19 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Jonah Palmer, qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Dragos Tatulea, Jason Wang

On Tue, Apr 30, 2024 at 7:55 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:
> > On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> >>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> >>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>>>>>
> >>>>>>>>> This causes a problem when overlapped regions (different GPA but same
> >>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
> >>>>>>>>> them twice.  To solve this, create an id member so we can assign unique
> >>>>>>>>> identifiers (GPA) to the maps.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>>>>>> ---
> >>>>>>>>>       include/qemu/iova-tree.h | 5 +++--
> >>>>>>>>>       util/iova-tree.c         | 3 ++-
> >>>>>>>>>       2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>>>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>>>>>> --- a/include/qemu/iova-tree.h
> >>>>>>>>> +++ b/include/qemu/iova-tree.h
> >>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>>>>>           hwaddr iova;
> >>>>>>>>>           hwaddr translated_addr;
> >>>>>>>>>           hwaddr size;                /* Inclusive */
> >>>>>>>>> +    uint64_t id;
> >>>>>>>>>           IOMMUAccessFlags perm;
> >>>>>>>>>       } QEMU_PACKED DMAMap;
> >>>>>>>>>       typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
> >>>>>>>>>        * @map: the mapping to search
> >>>>>>>>>        *
> >>>>>>>>>        * Search for a mapping in the iova tree that translated_addr overlaps with the
> >>>>>>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>>>>>> - * returned.
> >>>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
> >>>>>>>>> + * mapping will be returned.
> >>>>>>>>>        *
> >>>>>>>>>        * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >>>>>>>>>        * the returned DMAMap pointer is maintained internally.  User should
> >>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>>>>>>>> index 536789797e..0863e0a3b8 100644
> >>>>>>>>> --- a/util/iova-tree.c
> >>>>>>>>> +++ b/util/iova-tree.c
> >>>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>>>>>>>
> >>>>>>>>>           needle = args->needle;
> >>>>>>>>>           if (map->translated_addr + map->size < needle->translated_addr ||
> >>>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
> >>>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
> >>>>>>>>> +        needle->id != map->id) {
> >>>>>>>> It looks this iterator can also be invoked by SVQ from
> >>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> >>>>>>>> space will be searched on without passing in the ID (GPA), and exact
> >>>>>>>> match for the same GPA range is not actually needed unlike the mapping
> >>>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
> >>>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
> >>>>>>>> DMAMap, and the id match check may look like below:
> >>>>>>>>
> >>>>>>>> (!needle->skip_id_match && needle->id != map->id)
> >>>>>>>>
> >>>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
> >>>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
> >>>>>>>>
> >>>>>>> I think you're totally right. But I'd really like to not complicate
> >>>>>>> the API of the iova_tree more.
> >>>>>>>
> >>>>>>> I think we can look for the hwaddr using memory_region_from_host and
> >>>>>>> then get the hwaddr. It is another lookup though...
> >>>>>> Yeah, that will be another means of doing translation without having to
> >>>>>> complicate the API around iova_tree. I wonder how the lookup through
> >>>>>> memory_region_from_host() may perform compared to the iova tree one, the
> >>>>>> former looks to be an O(N) linear search on a linked list while the
> >>>>>> latter would be roughly O(log N) on an AVL tree?
> >>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
> >>>>> linear too. It is not even ordered.
> >>>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
> >>>> instead of g_tree_search_node(). So the former is indeed linear
> >>>> iteration, but it looks to be ordered?
> >>>>
> >>>> https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115
> >>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
> >>>
> >>> If we have these translations:
> >>> [0x1000, 0x2000] -> [0x10000, 0x11000]
> >>> [0x2000, 0x3000] -> [0x6000, 0x7000]
> >>>
> >>> We will see them in this order, so we cannot stop the search at the first node.
> >> Yeah, reverse lookup is unordered indeed, anyway.
> >>
> >>>>> But apart from this detail you're right, I have the same concerns with
> >>>>> this solution too. If we see a hard performance regression we could go
> >>>>> to more complicated solutions, like maintaining a reverse IOVATree in
> >>>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
> >>>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
> >>>> reverse IOVATree was dropped, lack of users? But now we have one!
> >>>>
> >>> No, it is just simplicity. We already have an user in the hot patch in
> >>> the master branch, vhost_svq_vring_write_descs. But I never profiled
> >>> enough to find if it is a bottleneck or not to be honest.
> >> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
> >> profile and see the difference.
> >>> I'll send the new series by today, thank you for finding these issues!
> >> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
> >> Jonah (cc'ed) may have interest in looking into it.
> >>
> > Actually, yes. I've tried to solve it using:
> > memory_region_get_ram_ptr -> It's hard to get this pointer to work
> > without messing a lot with IOVATree.
> > memory_region_find -> I'm totally unable to make it return sections
> > that make sense
> > flatview_for_each_range -> It does not return the same
> > MemoryRegionsection as the listener, not sure why.
> Ouch, thank you for the summary of attempts that were done earlier.
> > The only advance I have is that memory_region_from_host is able to
> > tell if the vaddr is from the guest or not.
> Hmmm, then it won't be too useful without a direct means to identifying
> the exact memory region associated with the iova that is being mapped.
> And, this additional indirection seems introduce a tiny bit of more
> latency in the reverse lookup routine (should not be a scalability issue
> though if it's a linear search)?
>

I didn't measure, but I guess yes it might. OTOH these structs may be
cached because virtqueue_pop just looked for them.

> > So I'm convinced there must be a way to do it with the memory
> > subsystem, but I think the best way to do it ATM is to store a
> > parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
> > find the entry in this new tree, we can directly remove it by GPA. If
> > not, assume it is a host-only address like SVQ vrings, and remove by
> > iterating on vaddr as we do now.
> Yeah, this could work I think. On the other hand, given that we are now
> trying to improve it, I wonder if possible to come up with a fast
> version for the SVQ (host-only address) case without having to look up
> twice? SVQ callers should be able to tell apart from the guest case
> where GPA -> IOVA translation doesn't exist? Or just maintain a parallel
> tree with HVA -> IOVA translations for SVQ reverse lookup only? I feel
> SVQ mappings may be worth a separate fast lookup path - unlike guest
> mappings, the insertion, lookup and removal for SVQ mappings seem
> unavoidable during the migration downtime path.
>

I think the ideal order is the opposite actually. So:
1) Try for the NIC to support _F_VRING_ASID, no translation needed by QEMU
2) Try reverse lookup from HVA to GPA. Since dataplane should fit
this, we should test this first
3) Look in SVQ host-only entries (SVQ, current shadow CVQ). It is the
control VQ, speed is not so important.

Overlapping regions may return the wrong SVQ IOVA though. We should
take extra care to make sure these are correctly handled. I mean,
there are valid translations in the tree unless the driver is buggy,
just may need to span many translations.

> >   It is guaranteed the guest does not
> > translate to that vaddr and that that vaddr is unique in the tree
> > anyway.
> >
> > Does it sound reasonable? Jonah, would you be interested in moving this forward?
> My thought would be that the reverse IOVA tree stuff can be added as a
> follow-up optimization right after for extended scalability, but for now
> as the interim, we may still need some form of simple fix, so as to
> quickly unblock the other dependent work built on top of this one and
> the early pinning series [1]. With it said, I'm completely fine if
> performing the reverse lookup through linear tree walk e.g.
> g_tree_foreach(), that should suffice small VM configs with just a
> couple of queues and limited number of memory regions. Going forward, to
> address the scalability bottleneck, Jonah could just replace the
> corresponding API call with the one built on top of reverse IOVA tree (I
> presume the use of these iova tree APIs is kind of internal that only
> limits to SVQ and vhost-vdpa subsystems) once he gets there, and then
> eliminate the other API variants that will no longer be in use. What do
> you think about this idea / plan?
>

Yeah it makes sense to me. Hopefully we can even get rid of the id member.

> Thanks,
> -Siwei
>
> [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
>
> >
> > Thanks!
> >
> >> -Siwei
> >>
> >>
> >>>> Thanks,
> >>>> -Siwei
> >>>>> Thanks!
> >>>>>
> >>>>>> Of course,
> >>>>>> memory_region_from_host() won't search out of the guest memory space for
> >>>>>> sure. As this could be on the hot data path I have a little bit
> >>>>>> hesitance over the potential cost or performance regression this change
> >>>>>> could bring in, but maybe I'm overthinking it too much...
> >>>>>>
> >>>>>> Thanks,
> >>>>>> -Siwei
> >>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -Siwei
> >>>>>>>>>               return false;
> >>>>>>>>>           }
> >>>>>>>>>
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-04-29 11:19                   ` Jonah Palmer
@ 2024-04-30 18:11                     ` Eugenio Perez Martin
  2024-05-01 22:08                       ` Si-Wei Liu
  0 siblings, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2024-04-30 18:11 UTC (permalink / raw)
  To: Jonah Palmer
  Cc: Si-Wei Liu, qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Dragos Tatulea, Jason Wang, David Hildenbrand

On Mon, Apr 29, 2024 at 1:19 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
>
>
> On 4/29/24 4:14 AM, Eugenio Perez Martin wrote:
> > On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >>
> >> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> >>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>>
> >>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> >>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>
> >>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>>>>>
> >>>>>>>>> This causes a problem when overlapped regions (different GPA but same
> >>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
> >>>>>>>>> them twice.  To solve this, create an id member so we can assign unique
> >>>>>>>>> identifiers (GPA) to the maps.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>>>>>> ---
> >>>>>>>>>       include/qemu/iova-tree.h | 5 +++--
> >>>>>>>>>       util/iova-tree.c         | 3 ++-
> >>>>>>>>>       2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>>>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>>>>>> --- a/include/qemu/iova-tree.h
> >>>>>>>>> +++ b/include/qemu/iova-tree.h
> >>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>>>>>           hwaddr iova;
> >>>>>>>>>           hwaddr translated_addr;
> >>>>>>>>>           hwaddr size;                /* Inclusive */
> >>>>>>>>> +    uint64_t id;
> >>>>>>>>>           IOMMUAccessFlags perm;
> >>>>>>>>>       } QEMU_PACKED DMAMap;
> >>>>>>>>>       typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
> >>>>>>>>>        * @map: the mapping to search
> >>>>>>>>>        *
> >>>>>>>>>        * Search for a mapping in the iova tree that translated_addr overlaps with the
> >>>>>>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>>>>>> - * returned.
> >>>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
> >>>>>>>>> + * mapping will be returned.
> >>>>>>>>>        *
> >>>>>>>>>        * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >>>>>>>>>        * the returned DMAMap pointer is maintained internally.  User should
> >>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>>>>>>>> index 536789797e..0863e0a3b8 100644
> >>>>>>>>> --- a/util/iova-tree.c
> >>>>>>>>> +++ b/util/iova-tree.c
> >>>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>>>>>>>
> >>>>>>>>>           needle = args->needle;
> >>>>>>>>>           if (map->translated_addr + map->size < needle->translated_addr ||
> >>>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
> >>>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
> >>>>>>>>> +        needle->id != map->id) {
> >>>>>>>> It looks this iterator can also be invoked by SVQ from
> >>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> >>>>>>>> space will be searched on without passing in the ID (GPA), and exact
> >>>>>>>> match for the same GPA range is not actually needed unlike the mapping
> >>>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
> >>>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
> >>>>>>>> DMAMap, and the id match check may look like below:
> >>>>>>>>
> >>>>>>>> (!needle->skip_id_match && needle->id != map->id)
> >>>>>>>>
> >>>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
> >>>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
> >>>>>>>>
> >>>>>>> I think you're totally right. But I'd really like to not complicate
> >>>>>>> the API of the iova_tree more.
> >>>>>>>
> >>>>>>> I think we can look for the hwaddr using memory_region_from_host and
> >>>>>>> then get the hwaddr. It is another lookup though...
> >>>>>> Yeah, that will be another means of doing translation without having to
> >>>>>> complicate the API around iova_tree. I wonder how the lookup through
> >>>>>> memory_region_from_host() may perform compared to the iova tree one, the
> >>>>>> former looks to be an O(N) linear search on a linked list while the
> >>>>>> latter would be roughly O(log N) on an AVL tree?
> >>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
> >>>>> linear too. It is not even ordered.
> >>>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
> >>>> instead of g_tree_search_node(). So the former is indeed linear
> >>>> iteration, but it looks to be ordered?
> >>>>
> >>>> https://urldefense.com/v3/__https://github.com/GNOME/glib/blob/main/glib/gtree.c*L1115__;Iw!!ACWV5N9M2RV99hQ!Ng2rLfRd9tLyNTNocW50Mf5AcxSt0uF0wOdv120djff-z_iAdbujYK-jMi5UC1DZLxb1yLUv2vV0j3wJo8o$
> >>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
> >>>
> >>> If we have these translations:
> >>> [0x1000, 0x2000] -> [0x10000, 0x11000]
> >>> [0x2000, 0x3000] -> [0x6000, 0x7000]
> >>>
> >>> We will see them in this order, so we cannot stop the search at the first node.
> >> Yeah, reverse lookup is unordered indeed, anyway.
> >>
> >>>
> >>>>> But apart from this detail you're right, I have the same concerns with
> >>>>> this solution too. If we see a hard performance regression we could go
> >>>>> to more complicated solutions, like maintaining a reverse IOVATree in
> >>>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
> >>>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
> >>>> reverse IOVATree was dropped, lack of users? But now we have one!
> >>>>
> >>> No, it is just simplicity. We already have an user in the hot patch in
> >>> the master branch, vhost_svq_vring_write_descs. But I never profiled
> >>> enough to find if it is a bottleneck or not to be honest.
> >> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
> >> profile and see the difference.
> >>>
> >>> I'll send the new series by today, thank you for finding these issues!
> >> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
> >> Jonah (cc'ed) may have interest in looking into it.
> >>
> >
> > Actually, yes. I've tried to solve it using:
> > memory_region_get_ram_ptr -> It's hard to get this pointer to work
> > without messing a lot with IOVATree.
> > memory_region_find -> I'm totally unable to make it return sections
> > that make sense
> > flatview_for_each_range -> It does not return the same
> > MemoryRegionsection as the listener, not sure why.
> >
> > The only advance I have is that memory_region_from_host is able to
> > tell if the vaddr is from the guest or not.
> >
> > So I'm convinced there must be a way to do it with the memory
> > subsystem, but I think the best way to do it ATM is to store a
> > parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
> > find the entry in this new tree, we can directly remove it by GPA. If
> > not, assume it is a host-only address like SVQ vrings, and remove by
> > iterating on vaddr as we do now. It is guaranteed the guest does not
> > translate to that vaddr and that that vaddr is unique in the tree
> > anyway.
> >
> > Does it sound reasonable? Jonah, would you be interested in moving this forward?
> >
> > Thanks!
> >
>
> Sure, I'd be more than happy to work on this stuff! I can probably get
> started on this either today or tomorrow.
>
> Si-Wei mentioned something about these "reverse IOVATree" patches that
> were dropped;

The patches implementing the reverse IOVA tree were never created /
posted, just in case you try to look for them.


> is this relevant to what you're asking here? Is it
> something I should base my work off of?
>

So these patches work ok for adding and removing maps. We assign ids,
which is the gpa of the memory region that the listener receives. The
bad news is that SVQ also needs this id to look for the right
translation at vhost_svq_translate_addr, so this series is not
complete. You can find the
vhost_iova_tree_find_iova()->iova_tree_find_iova() call there.

The easiest solution is the reverse IOVA tree of HVA -> SVQ IOVA. It
is also the less elegant and (potentially) the less performant, as it
includes duplicating information that QEMU already has, and a
potentially linear search.

David Hildenbrand (CCed) proposed to try iterating through RAMBlocks.
I guess qemu_ram_block_from_host() could return a block where
block->offset is the id of the map?

It would be great to try this approach. If you don't have the bandwith
for this, going directly for the reverse iova tree is also a valid
solution.

Thanks!

> If there's any other relevant information about this issue that you
> think I should know, let me know. I'll start digging into this ASAP and
> will reach out if I need any guidance. :)
>
> Jonah
>
> >> -Siwei
> >>
> >>
> >>>
> >>>> Thanks,
> >>>> -Siwei
> >>>>> Thanks!
> >>>>>
> >>>>>> Of course,
> >>>>>> memory_region_from_host() won't search out of the guest memory space for
> >>>>>> sure. As this could be on the hot data path I have a little bit
> >>>>>> hesitance over the potential cost or performance regression this change
> >>>>>> could bring in, but maybe I'm overthinking it too much...
> >>>>>>
> >>>>>> Thanks,
> >>>>>> -Siwei
> >>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -Siwei
> >>>>>>>>>               return false;
> >>>>>>>>>           }
> >>>>>>>>>
> >>
> >
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-04-30 18:11                     ` Eugenio Perez Martin
@ 2024-05-01 22:08                       ` Si-Wei Liu
  2024-05-02  6:18                         ` Eugenio Perez Martin
  0 siblings, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2024-05-01 22:08 UTC (permalink / raw)
  To: Eugenio Perez Martin, Jonah Palmer
  Cc: qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Dragos Tatulea, Jason Wang, David Hildenbrand



On 4/30/2024 11:11 AM, Eugenio Perez Martin wrote:
> On Mon, Apr 29, 2024 at 1:19 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>
>>
>> On 4/29/24 4:14 AM, Eugenio Perez Martin wrote:
>>> On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>>
>>>> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
>>>>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>
>>>>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
>>>>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
>>>>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
>>>>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
>>>>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
>>>>>>>>>>>
>>>>>>>>>>> This causes a problem when overlapped regions (different GPA but same
>>>>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
>>>>>>>>>>> them twice.  To solve this, create an id member so we can assign unique
>>>>>>>>>>> identifiers (GPA) to the maps.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>>>>>>> ---
>>>>>>>>>>>        include/qemu/iova-tree.h | 5 +++--
>>>>>>>>>>>        util/iova-tree.c         | 3 ++-
>>>>>>>>>>>        2 files changed, 5 insertions(+), 3 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
>>>>>>>>>>> index 2a10a7052e..34ee230e7d 100644
>>>>>>>>>>> --- a/include/qemu/iova-tree.h
>>>>>>>>>>> +++ b/include/qemu/iova-tree.h
>>>>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
>>>>>>>>>>>            hwaddr iova;
>>>>>>>>>>>            hwaddr translated_addr;
>>>>>>>>>>>            hwaddr size;                /* Inclusive */
>>>>>>>>>>> +    uint64_t id;
>>>>>>>>>>>            IOMMUAccessFlags perm;
>>>>>>>>>>>        } QEMU_PACKED DMAMap;
>>>>>>>>>>>        typedef gboolean (*iova_tree_iterator)(DMAMap *map);
>>>>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
>>>>>>>>>>>         * @map: the mapping to search
>>>>>>>>>>>         *
>>>>>>>>>>>         * Search for a mapping in the iova tree that translated_addr overlaps with the
>>>>>>>>>>> - * mapping range specified.  Only the first found mapping will be
>>>>>>>>>>> - * returned.
>>>>>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
>>>>>>>>>>> + * mapping will be returned.
>>>>>>>>>>>         *
>>>>>>>>>>>         * Return: DMAMap pointer if found, or NULL if not found.  Note that
>>>>>>>>>>>         * the returned DMAMap pointer is maintained internally.  User should
>>>>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
>>>>>>>>>>> index 536789797e..0863e0a3b8 100644
>>>>>>>>>>> --- a/util/iova-tree.c
>>>>>>>>>>> +++ b/util/iova-tree.c
>>>>>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
>>>>>>>>>>>
>>>>>>>>>>>            needle = args->needle;
>>>>>>>>>>>            if (map->translated_addr + map->size < needle->translated_addr ||
>>>>>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
>>>>>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
>>>>>>>>>>> +        needle->id != map->id) {
>>>>>>>>>> It looks this iterator can also be invoked by SVQ from
>>>>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
>>>>>>>>>> space will be searched on without passing in the ID (GPA), and exact
>>>>>>>>>> match for the same GPA range is not actually needed unlike the mapping
>>>>>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
>>>>>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
>>>>>>>>>> DMAMap, and the id match check may look like below:
>>>>>>>>>>
>>>>>>>>>> (!needle->skip_id_match && needle->id != map->id)
>>>>>>>>>>
>>>>>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
>>>>>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
>>>>>>>>>>
>>>>>>>>> I think you're totally right. But I'd really like to not complicate
>>>>>>>>> the API of the iova_tree more.
>>>>>>>>>
>>>>>>>>> I think we can look for the hwaddr using memory_region_from_host and
>>>>>>>>> then get the hwaddr. It is another lookup though...
>>>>>>>> Yeah, that will be another means of doing translation without having to
>>>>>>>> complicate the API around iova_tree. I wonder how the lookup through
>>>>>>>> memory_region_from_host() may perform compared to the iova tree one, the
>>>>>>>> former looks to be an O(N) linear search on a linked list while the
>>>>>>>> latter would be roughly O(log N) on an AVL tree?
>>>>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
>>>>>>> linear too. It is not even ordered.
>>>>>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
>>>>>> instead of g_tree_search_node(). So the former is indeed linear
>>>>>> iteration, but it looks to be ordered?
>>>>>>
>>>>>> https://urldefense.com/v3/__https://github.com/GNOME/glib/blob/main/glib/gtree.c*L1115__;Iw!!ACWV5N9M2RV99hQ!Ng2rLfRd9tLyNTNocW50Mf5AcxSt0uF0wOdv120djff-z_iAdbujYK-jMi5UC1DZLxb1yLUv2vV0j3wJo8o$
>>>>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
>>>>>
>>>>> If we have these translations:
>>>>> [0x1000, 0x2000] -> [0x10000, 0x11000]
>>>>> [0x2000, 0x3000] -> [0x6000, 0x7000]
>>>>>
>>>>> We will see them in this order, so we cannot stop the search at the first node.
>>>> Yeah, reverse lookup is unordered indeed, anyway.
>>>>
>>>>>>> But apart from this detail you're right, I have the same concerns with
>>>>>>> this solution too. If we see a hard performance regression we could go
>>>>>>> to more complicated solutions, like maintaining a reverse IOVATree in
>>>>>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
>>>>>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
>>>>>> reverse IOVATree was dropped, lack of users? But now we have one!
>>>>>>
>>>>> No, it is just simplicity. We already have an user in the hot patch in
>>>>> the master branch, vhost_svq_vring_write_descs. But I never profiled
>>>>> enough to find if it is a bottleneck or not to be honest.
>>>> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
>>>> profile and see the difference.
>>>>> I'll send the new series by today, thank you for finding these issues!
>>>> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
>>>> Jonah (cc'ed) may have interest in looking into it.
>>>>
>>> Actually, yes. I've tried to solve it using:
>>> memory_region_get_ram_ptr -> It's hard to get this pointer to work
>>> without messing a lot with IOVATree.
>>> memory_region_find -> I'm totally unable to make it return sections
>>> that make sense
>>> flatview_for_each_range -> It does not return the same
>>> MemoryRegionsection as the listener, not sure why.
>>>
>>> The only advance I have is that memory_region_from_host is able to
>>> tell if the vaddr is from the guest or not.
>>>
>>> So I'm convinced there must be a way to do it with the memory
>>> subsystem, but I think the best way to do it ATM is to store a
>>> parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
>>> find the entry in this new tree, we can directly remove it by GPA. If
>>> not, assume it is a host-only address like SVQ vrings, and remove by
>>> iterating on vaddr as we do now. It is guaranteed the guest does not
>>> translate to that vaddr and that that vaddr is unique in the tree
>>> anyway.
>>>
>>> Does it sound reasonable? Jonah, would you be interested in moving this forward?
>>>
>>> Thanks!
>>>
>> Sure, I'd be more than happy to work on this stuff! I can probably get
>> started on this either today or tomorrow.
>>
>> Si-Wei mentioned something about these "reverse IOVATree" patches that
>> were dropped;
> The patches implementing the reverse IOVA tree were never created /
> posted, just in case you try to look for them.
>
>
>> is this relevant to what you're asking here? Is it
>> something I should base my work off of?
>>
> So these patches work ok for adding and removing maps. We assign ids,
> which is the gpa of the memory region that the listener receives. The
> bad news is that SVQ also needs this id to look for the right
> translation at vhost_svq_translate_addr, so this series is not
> complete.
I have a fundamental question to ask here. Are we sure SVQ really needs 
this id (GPA) to identify the right translation? Or we're just 
concerning much about the aliased map where there could be one single 
HVA mapped to multiple IOVAs / GPAs, which (the overlapped) is almost 
transient mapping that usually goes away very soon after guest memory 
layout is stabilized? For what I can tell, the caller in SVQ datapath 
code (vhost_svq_vring_write_descs) just calls into 
vhost_iova_tree_find_iova to look for IOVA translation rather than 
identify a specific section on the memory region, the latter of which 
would need the id (GPA) to perform an exact match. The removal case 
would definitely need perfect match on GPA with the additional id, but I 
don't find it necessary for the vhost_svq_vring_write_descs code to pass 
in the id / GPA? Do I miss something?

Thanks,
-Siwei

> You can find the
> vhost_iova_tree_find_iova()->iova_tree_find_iova() call there.
>
> The easiest solution is the reverse IOVA tree of HVA -> SVQ IOVA. It
> is also the less elegant and (potentially) the less performant, as it
> includes duplicating information that QEMU already has, and a
> potentially linear search.
>
> David Hildenbrand (CCed) proposed to try iterating through RAMBlocks.
> I guess qemu_ram_block_from_host() could return a block where
> block->offset is the id of the map?
>
> It would be great to try this approach. If you don't have the bandwith
> for this, going directly for the reverse iova tree is also a valid
> solution.
>
> Thanks!
>
>> If there's any other relevant information about this issue that you
>> think I should know, let me know. I'll start digging into this ASAP and
>> will reach out if I need any guidance. :)
>>
>> Jonah
>>
>>>> -Siwei
>>>>
>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>> Thanks!
>>>>>>>
>>>>>>>> Of course,
>>>>>>>> memory_region_from_host() won't search out of the guest memory space for
>>>>>>>> sure. As this could be on the hot data path I have a little bit
>>>>>>>> hesitance over the potential cost or performance regression this change
>>>>>>>> could bring in, but maybe I'm overthinking it too much...
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> -Siwei
>>>>>>>>>>>                return false;
>>>>>>>>>>>            }
>>>>>>>>>>>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-04-30 17:19                     ` Eugenio Perez Martin
@ 2024-05-01 23:13                       ` Si-Wei Liu
  2024-05-02  6:44                         ` Eugenio Perez Martin
  0 siblings, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2024-05-01 23:13 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Jonah Palmer, qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Dragos Tatulea, Jason Wang



On 4/30/2024 10:19 AM, Eugenio Perez Martin wrote:
> On Tue, Apr 30, 2024 at 7:55 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:
>>> On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
>>>>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
>>>>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
>>>>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
>>>>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
>>>>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
>>>>>>>>>>>
>>>>>>>>>>> This causes a problem when overlapped regions (different GPA but same
>>>>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
>>>>>>>>>>> them twice.  To solve this, create an id member so we can assign unique
>>>>>>>>>>> identifiers (GPA) to the maps.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>>>>>>> ---
>>>>>>>>>>>        include/qemu/iova-tree.h | 5 +++--
>>>>>>>>>>>        util/iova-tree.c         | 3 ++-
>>>>>>>>>>>        2 files changed, 5 insertions(+), 3 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
>>>>>>>>>>> index 2a10a7052e..34ee230e7d 100644
>>>>>>>>>>> --- a/include/qemu/iova-tree.h
>>>>>>>>>>> +++ b/include/qemu/iova-tree.h
>>>>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
>>>>>>>>>>>            hwaddr iova;
>>>>>>>>>>>            hwaddr translated_addr;
>>>>>>>>>>>            hwaddr size;                /* Inclusive */
>>>>>>>>>>> +    uint64_t id;
>>>>>>>>>>>            IOMMUAccessFlags perm;
>>>>>>>>>>>        } QEMU_PACKED DMAMap;
>>>>>>>>>>>        typedef gboolean (*iova_tree_iterator)(DMAMap *map);
>>>>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
>>>>>>>>>>>         * @map: the mapping to search
>>>>>>>>>>>         *
>>>>>>>>>>>         * Search for a mapping in the iova tree that translated_addr overlaps with the
>>>>>>>>>>> - * mapping range specified.  Only the first found mapping will be
>>>>>>>>>>> - * returned.
>>>>>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
>>>>>>>>>>> + * mapping will be returned.
>>>>>>>>>>>         *
>>>>>>>>>>>         * Return: DMAMap pointer if found, or NULL if not found.  Note that
>>>>>>>>>>>         * the returned DMAMap pointer is maintained internally.  User should
>>>>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
>>>>>>>>>>> index 536789797e..0863e0a3b8 100644
>>>>>>>>>>> --- a/util/iova-tree.c
>>>>>>>>>>> +++ b/util/iova-tree.c
>>>>>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
>>>>>>>>>>>
>>>>>>>>>>>            needle = args->needle;
>>>>>>>>>>>            if (map->translated_addr + map->size < needle->translated_addr ||
>>>>>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
>>>>>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
>>>>>>>>>>> +        needle->id != map->id) {
>>>>>>>>>> It looks this iterator can also be invoked by SVQ from
>>>>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
>>>>>>>>>> space will be searched on without passing in the ID (GPA), and exact
>>>>>>>>>> match for the same GPA range is not actually needed unlike the mapping
>>>>>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
>>>>>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
>>>>>>>>>> DMAMap, and the id match check may look like below:
>>>>>>>>>>
>>>>>>>>>> (!needle->skip_id_match && needle->id != map->id)
>>>>>>>>>>
>>>>>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
>>>>>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
>>>>>>>>>>
>>>>>>>>> I think you're totally right. But I'd really like to not complicate
>>>>>>>>> the API of the iova_tree more.
>>>>>>>>>
>>>>>>>>> I think we can look for the hwaddr using memory_region_from_host and
>>>>>>>>> then get the hwaddr. It is another lookup though...
>>>>>>>> Yeah, that will be another means of doing translation without having to
>>>>>>>> complicate the API around iova_tree. I wonder how the lookup through
>>>>>>>> memory_region_from_host() may perform compared to the iova tree one, the
>>>>>>>> former looks to be an O(N) linear search on a linked list while the
>>>>>>>> latter would be roughly O(log N) on an AVL tree?
>>>>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
>>>>>>> linear too. It is not even ordered.
>>>>>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
>>>>>> instead of g_tree_search_node(). So the former is indeed linear
>>>>>> iteration, but it looks to be ordered?
>>>>>>
>>>>>> https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115
>>>>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
>>>>>
>>>>> If we have these translations:
>>>>> [0x1000, 0x2000] -> [0x10000, 0x11000]
>>>>> [0x2000, 0x3000] -> [0x6000, 0x7000]
>>>>>
>>>>> We will see them in this order, so we cannot stop the search at the first node.
>>>> Yeah, reverse lookup is unordered indeed, anyway.
>>>>
>>>>>>> But apart from this detail you're right, I have the same concerns with
>>>>>>> this solution too. If we see a hard performance regression we could go
>>>>>>> to more complicated solutions, like maintaining a reverse IOVATree in
>>>>>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
>>>>>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
>>>>>> reverse IOVATree was dropped, lack of users? But now we have one!
>>>>>>
>>>>> No, it is just simplicity. We already have an user in the hot patch in
>>>>> the master branch, vhost_svq_vring_write_descs. But I never profiled
>>>>> enough to find if it is a bottleneck or not to be honest.
>>>> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
>>>> profile and see the difference.
>>>>> I'll send the new series by today, thank you for finding these issues!
>>>> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
>>>> Jonah (cc'ed) may have interest in looking into it.
>>>>
>>> Actually, yes. I've tried to solve it using:
>>> memory_region_get_ram_ptr -> It's hard to get this pointer to work
>>> without messing a lot with IOVATree.
>>> memory_region_find -> I'm totally unable to make it return sections
>>> that make sense
>>> flatview_for_each_range -> It does not return the same
>>> MemoryRegionsection as the listener, not sure why.
>> Ouch, thank you for the summary of attempts that were done earlier.
>>> The only advance I have is that memory_region_from_host is able to
>>> tell if the vaddr is from the guest or not.
>> Hmmm, then it won't be too useful without a direct means to identifying
>> the exact memory region associated with the iova that is being mapped.
>> And, this additional indirection seems introduce a tiny bit of more
>> latency in the reverse lookup routine (should not be a scalability issue
>> though if it's a linear search)?
>>
> I didn't measure, but I guess yes it might. OTOH these structs may be
> cached because virtqueue_pop just looked for them.
Oh, right, that's a good point.
>
>>> So I'm convinced there must be a way to do it with the memory
>>> subsystem, but I think the best way to do it ATM is to store a
>>> parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
>>> find the entry in this new tree, we can directly remove it by GPA. If
>>> not, assume it is a host-only address like SVQ vrings, and remove by
>>> iterating on vaddr as we do now.
>> Yeah, this could work I think. On the other hand, given that we are now
>> trying to improve it, I wonder if possible to come up with a fast
>> version for the SVQ (host-only address) case without having to look up
>> twice? SVQ callers should be able to tell apart from the guest case
>> where GPA -> IOVA translation doesn't exist? Or just maintain a parallel
>> tree with HVA -> IOVA translations for SVQ reverse lookup only? I feel
>> SVQ mappings may be worth a separate fast lookup path - unlike guest
>> mappings, the insertion, lookup and removal for SVQ mappings seem
>> unavoidable during the migration downtime path.
>>
> I think the ideal order is the opposite actually. So:
> 1) Try for the NIC to support _F_VRING_ASID, no translation needed by QEMU
Right, that's the case for _F_VRING_ASID, which is simple and easy to 
deal with. Though I think this is an edge case across all vendor 
devices, as most likely only those no-chip IOMMU parents may support it. 
It's a luxury for normal device to steal another VF for this ASID feature...

> 2) Try reverse lookup from HVA to GPA. Since dataplane should fit
> this, we should test this first
So instead of a direct lookup from HVA to IOVA, the purpose of the extra 
reverse lookup from HVA to GPA is to verify the validity of GPA (avoid 
from being mistakenly picked from the overlapped region)? But this would 
seem require scanning the entire GPA space to identify possible GPA 
ranges that are potentially overlapped? I wonder if there exists 
possibility to simplify this assumption, could we go this extra layer of 
GPA wide scan and validation, *only* when overlap is indeed detected 
during memory listerner's region_add (say during which we try to insert 
a duplicate / overlapped HVA into the HVA -> IOVA tree)? Or simply put, 
the first match on the reverse lookup would mostly suffice, since we 
know virtio driver can't use guest memory from these overlapped regions? 
You may say this assumption is too bold, but do we have other means to 
guarantee the first match will always hit under SVQ lookup? Given that 
we don't receive an instance of issue report until we move the memory 
listener registration upfront to device initialization, I guess there 
should be some point or under certain condition that the non-overlapped 
1:1 translation and lookup can be satisfied. Don't you agree?

Thanks,
-Siwei
> 3) Look in SVQ host-only entries (SVQ, current shadow CVQ). It is the
> control VQ, speed is not so important.
>
> Overlapping regions may return the wrong SVQ IOVA though. We should
> take extra care to make sure these are correctly handled. I mean,
> there are valid translations in the tree unless the driver is buggy,
> just may need to span many translations.
>
>>>    It is guaranteed the guest does not
>>> translate to that vaddr and that that vaddr is unique in the tree
>>> anyway.
>>>
>>> Does it sound reasonable? Jonah, would you be interested in moving this forward?
>> My thought would be that the reverse IOVA tree stuff can be added as a
>> follow-up optimization right after for extended scalability, but for now
>> as the interim, we may still need some form of simple fix, so as to
>> quickly unblock the other dependent work built on top of this one and
>> the early pinning series [1]. With it said, I'm completely fine if
>> performing the reverse lookup through linear tree walk e.g.
>> g_tree_foreach(), that should suffice small VM configs with just a
>> couple of queues and limited number of memory regions. Going forward, to
>> address the scalability bottleneck, Jonah could just replace the
>> corresponding API call with the one built on top of reverse IOVA tree (I
>> presume the use of these iova tree APIs is kind of internal that only
>> limits to SVQ and vhost-vdpa subsystems) once he gets there, and then
>> eliminate the other API variants that will no longer be in use. What do
>> you think about this idea / plan?
>>
> Yeah it makes sense to me. Hopefully we can even get rid of the id member.
>
>> Thanks,
>> -Siwei
>>
>> [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
>>
>>> Thanks!
>>>
>>>> -Siwei
>>>>
>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>> Thanks!
>>>>>>>
>>>>>>>> Of course,
>>>>>>>> memory_region_from_host() won't search out of the guest memory space for
>>>>>>>> sure. As this could be on the hot data path I have a little bit
>>>>>>>> hesitance over the potential cost or performance regression this change
>>>>>>>> could bring in, but maybe I'm overthinking it too much...
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> -Siwei
>>>>>>>>>>>                return false;
>>>>>>>>>>>            }
>>>>>>>>>>>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-05-01 22:08                       ` Si-Wei Liu
@ 2024-05-02  6:18                         ` Eugenio Perez Martin
  2024-05-07  9:12                           ` Si-Wei Liu
  0 siblings, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2024-05-02  6:18 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Jonah Palmer, qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Dragos Tatulea, Jason Wang, David Hildenbrand

On Thu, May 2, 2024 at 12:09 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 4/30/2024 11:11 AM, Eugenio Perez Martin wrote:
> > On Mon, Apr 29, 2024 at 1:19 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
> >>
> >>
> >> On 4/29/24 4:14 AM, Eugenio Perez Martin wrote:
> >>> On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>>
> >>>> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> >>>>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>
> >>>>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>>>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>>>>>>>
> >>>>>>>>>>> This causes a problem when overlapped regions (different GPA but same
> >>>>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
> >>>>>>>>>>> them twice.  To solve this, create an id member so we can assign unique
> >>>>>>>>>>> identifiers (GPA) to the maps.
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>>>>>>>> ---
> >>>>>>>>>>>        include/qemu/iova-tree.h | 5 +++--
> >>>>>>>>>>>        util/iova-tree.c         | 3 ++-
> >>>>>>>>>>>        2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>>>>>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>>>>>>>> --- a/include/qemu/iova-tree.h
> >>>>>>>>>>> +++ b/include/qemu/iova-tree.h
> >>>>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>>>>>>>            hwaddr iova;
> >>>>>>>>>>>            hwaddr translated_addr;
> >>>>>>>>>>>            hwaddr size;                /* Inclusive */
> >>>>>>>>>>> +    uint64_t id;
> >>>>>>>>>>>            IOMMUAccessFlags perm;
> >>>>>>>>>>>        } QEMU_PACKED DMAMap;
> >>>>>>>>>>>        typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
> >>>>>>>>>>>         * @map: the mapping to search
> >>>>>>>>>>>         *
> >>>>>>>>>>>         * Search for a mapping in the iova tree that translated_addr overlaps with the
> >>>>>>>>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>>>>>>>> - * returned.
> >>>>>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
> >>>>>>>>>>> + * mapping will be returned.
> >>>>>>>>>>>         *
> >>>>>>>>>>>         * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >>>>>>>>>>>         * the returned DMAMap pointer is maintained internally.  User should
> >>>>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>>>>>>>>>> index 536789797e..0863e0a3b8 100644
> >>>>>>>>>>> --- a/util/iova-tree.c
> >>>>>>>>>>> +++ b/util/iova-tree.c
> >>>>>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>>>>>>>>>
> >>>>>>>>>>>            needle = args->needle;
> >>>>>>>>>>>            if (map->translated_addr + map->size < needle->translated_addr ||
> >>>>>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
> >>>>>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
> >>>>>>>>>>> +        needle->id != map->id) {
> >>>>>>>>>> It looks this iterator can also be invoked by SVQ from
> >>>>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> >>>>>>>>>> space will be searched on without passing in the ID (GPA), and exact
> >>>>>>>>>> match for the same GPA range is not actually needed unlike the mapping
> >>>>>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
> >>>>>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
> >>>>>>>>>> DMAMap, and the id match check may look like below:
> >>>>>>>>>>
> >>>>>>>>>> (!needle->skip_id_match && needle->id != map->id)
> >>>>>>>>>>
> >>>>>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
> >>>>>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
> >>>>>>>>>>
> >>>>>>>>> I think you're totally right. But I'd really like to not complicate
> >>>>>>>>> the API of the iova_tree more.
> >>>>>>>>>
> >>>>>>>>> I think we can look for the hwaddr using memory_region_from_host and
> >>>>>>>>> then get the hwaddr. It is another lookup though...
> >>>>>>>> Yeah, that will be another means of doing translation without having to
> >>>>>>>> complicate the API around iova_tree. I wonder how the lookup through
> >>>>>>>> memory_region_from_host() may perform compared to the iova tree one, the
> >>>>>>>> former looks to be an O(N) linear search on a linked list while the
> >>>>>>>> latter would be roughly O(log N) on an AVL tree?
> >>>>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
> >>>>>>> linear too. It is not even ordered.
> >>>>>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
> >>>>>> instead of g_tree_search_node(). So the former is indeed linear
> >>>>>> iteration, but it looks to be ordered?
> >>>>>>
> >>>>>> https://urldefense.com/v3/__https://github.com/GNOME/glib/blob/main/glib/gtree.c*L1115__;Iw!!ACWV5N9M2RV99hQ!Ng2rLfRd9tLyNTNocW50Mf5AcxSt0uF0wOdv120djff-z_iAdbujYK-jMi5UC1DZLxb1yLUv2vV0j3wJo8o$
> >>>>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
> >>>>>
> >>>>> If we have these translations:
> >>>>> [0x1000, 0x2000] -> [0x10000, 0x11000]
> >>>>> [0x2000, 0x3000] -> [0x6000, 0x7000]
> >>>>>
> >>>>> We will see them in this order, so we cannot stop the search at the first node.
> >>>> Yeah, reverse lookup is unordered indeed, anyway.
> >>>>
> >>>>>>> But apart from this detail you're right, I have the same concerns with
> >>>>>>> this solution too. If we see a hard performance regression we could go
> >>>>>>> to more complicated solutions, like maintaining a reverse IOVATree in
> >>>>>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
> >>>>>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
> >>>>>> reverse IOVATree was dropped, lack of users? But now we have one!
> >>>>>>
> >>>>> No, it is just simplicity. We already have an user in the hot patch in
> >>>>> the master branch, vhost_svq_vring_write_descs. But I never profiled
> >>>>> enough to find if it is a bottleneck or not to be honest.
> >>>> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
> >>>> profile and see the difference.
> >>>>> I'll send the new series by today, thank you for finding these issues!
> >>>> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
> >>>> Jonah (cc'ed) may have interest in looking into it.
> >>>>
> >>> Actually, yes. I've tried to solve it using:
> >>> memory_region_get_ram_ptr -> It's hard to get this pointer to work
> >>> without messing a lot with IOVATree.
> >>> memory_region_find -> I'm totally unable to make it return sections
> >>> that make sense
> >>> flatview_for_each_range -> It does not return the same
> >>> MemoryRegionsection as the listener, not sure why.
> >>>
> >>> The only advance I have is that memory_region_from_host is able to
> >>> tell if the vaddr is from the guest or not.
> >>>
> >>> So I'm convinced there must be a way to do it with the memory
> >>> subsystem, but I think the best way to do it ATM is to store a
> >>> parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
> >>> find the entry in this new tree, we can directly remove it by GPA. If
> >>> not, assume it is a host-only address like SVQ vrings, and remove by
> >>> iterating on vaddr as we do now. It is guaranteed the guest does not
> >>> translate to that vaddr and that that vaddr is unique in the tree
> >>> anyway.
> >>>
> >>> Does it sound reasonable? Jonah, would you be interested in moving this forward?
> >>>
> >>> Thanks!
> >>>
> >> Sure, I'd be more than happy to work on this stuff! I can probably get
> >> started on this either today or tomorrow.
> >>
> >> Si-Wei mentioned something about these "reverse IOVATree" patches that
> >> were dropped;
> > The patches implementing the reverse IOVA tree were never created /
> > posted, just in case you try to look for them.
> >
> >
> >> is this relevant to what you're asking here? Is it
> >> something I should base my work off of?
> >>
> > So these patches work ok for adding and removing maps. We assign ids,
> > which is the gpa of the memory region that the listener receives. The
> > bad news is that SVQ also needs this id to look for the right
> > translation at vhost_svq_translate_addr, so this series is not
> > complete.
> I have a fundamental question to ask here. Are we sure SVQ really needs
> this id (GPA) to identify the right translation? Or we're just
> concerning much about the aliased map where there could be one single
> HVA mapped to multiple IOVAs / GPAs, which (the overlapped) is almost
> transient mapping that usually goes away very soon after guest memory
> layout is stabilized?

Are we sure all of the overlaps go away after the memory layout is
stabilized in all conditions? I think it is worth not making two
different ways to ask the tree depending on what part of QEMU we are.

> For what I can tell, the caller in SVQ datapath
> code (vhost_svq_vring_write_descs) just calls into
> vhost_iova_tree_find_iova to look for IOVA translation rather than
> identify a specific section on the memory region, the latter of which
> would need the id (GPA) to perform an exact match. The removal case
> would definitely need perfect match on GPA with the additional id, but I
> don't find it necessary for the vhost_svq_vring_write_descs code to pass
> in the id / GPA? Do I miss something?
>

Expanding on the other thread, as there are more concrete points
there. Please let me know if I missed something.

> Thanks,
> -Siwei
>
> > You can find the
> > vhost_iova_tree_find_iova()->iova_tree_find_iova() call there.
> >
> > The easiest solution is the reverse IOVA tree of HVA -> SVQ IOVA. It
> > is also the less elegant and (potentially) the less performant, as it
> > includes duplicating information that QEMU already has, and a
> > potentially linear search.
> >
> > David Hildenbrand (CCed) proposed to try iterating through RAMBlocks.
> > I guess qemu_ram_block_from_host() could return a block where
> > block->offset is the id of the map?
> >
> > It would be great to try this approach. If you don't have the bandwith
> > for this, going directly for the reverse iova tree is also a valid
> > solution.
> >
> > Thanks!
> >
> >> If there's any other relevant information about this issue that you
> >> think I should know, let me know. I'll start digging into this ASAP and
> >> will reach out if I need any guidance. :)
> >>
> >> Jonah
> >>
> >>>> -Siwei
> >>>>
> >>>>
> >>>>>> Thanks,
> >>>>>> -Siwei
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>>> Of course,
> >>>>>>>> memory_region_from_host() won't search out of the guest memory space for
> >>>>>>>> sure. As this could be on the hot data path I have a little bit
> >>>>>>>> hesitance over the potential cost or performance regression this change
> >>>>>>>> could bring in, but maybe I'm overthinking it too much...
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -Siwei
> >>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> -Siwei
> >>>>>>>>>>>                return false;
> >>>>>>>>>>>            }
> >>>>>>>>>>>
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-05-01 23:13                       ` Si-Wei Liu
@ 2024-05-02  6:44                         ` Eugenio Perez Martin
  2024-05-08  0:52                           ` Si-Wei Liu
  0 siblings, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2024-05-02  6:44 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Jonah Palmer, qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Dragos Tatulea, Jason Wang

On Thu, May 2, 2024 at 1:16 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 4/30/2024 10:19 AM, Eugenio Perez Martin wrote:
> > On Tue, Apr 30, 2024 at 7:55 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:
> >>> On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> >>>>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>>>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>>>>>>>
> >>>>>>>>>>> This causes a problem when overlapped regions (different GPA but same
> >>>>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
> >>>>>>>>>>> them twice.  To solve this, create an id member so we can assign unique
> >>>>>>>>>>> identifiers (GPA) to the maps.
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>>>>>>>> ---
> >>>>>>>>>>>        include/qemu/iova-tree.h | 5 +++--
> >>>>>>>>>>>        util/iova-tree.c         | 3 ++-
> >>>>>>>>>>>        2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>>>>>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>>>>>>>> --- a/include/qemu/iova-tree.h
> >>>>>>>>>>> +++ b/include/qemu/iova-tree.h
> >>>>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>>>>>>>            hwaddr iova;
> >>>>>>>>>>>            hwaddr translated_addr;
> >>>>>>>>>>>            hwaddr size;                /* Inclusive */
> >>>>>>>>>>> +    uint64_t id;
> >>>>>>>>>>>            IOMMUAccessFlags perm;
> >>>>>>>>>>>        } QEMU_PACKED DMAMap;
> >>>>>>>>>>>        typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
> >>>>>>>>>>>         * @map: the mapping to search
> >>>>>>>>>>>         *
> >>>>>>>>>>>         * Search for a mapping in the iova tree that translated_addr overlaps with the
> >>>>>>>>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>>>>>>>> - * returned.
> >>>>>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
> >>>>>>>>>>> + * mapping will be returned.
> >>>>>>>>>>>         *
> >>>>>>>>>>>         * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >>>>>>>>>>>         * the returned DMAMap pointer is maintained internally.  User should
> >>>>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>>>>>>>>>> index 536789797e..0863e0a3b8 100644
> >>>>>>>>>>> --- a/util/iova-tree.c
> >>>>>>>>>>> +++ b/util/iova-tree.c
> >>>>>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>>>>>>>>>
> >>>>>>>>>>>            needle = args->needle;
> >>>>>>>>>>>            if (map->translated_addr + map->size < needle->translated_addr ||
> >>>>>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
> >>>>>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
> >>>>>>>>>>> +        needle->id != map->id) {
> >>>>>>>>>> It looks this iterator can also be invoked by SVQ from
> >>>>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> >>>>>>>>>> space will be searched on without passing in the ID (GPA), and exact
> >>>>>>>>>> match for the same GPA range is not actually needed unlike the mapping
> >>>>>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
> >>>>>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
> >>>>>>>>>> DMAMap, and the id match check may look like below:
> >>>>>>>>>>
> >>>>>>>>>> (!needle->skip_id_match && needle->id != map->id)
> >>>>>>>>>>
> >>>>>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
> >>>>>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
> >>>>>>>>>>
> >>>>>>>>> I think you're totally right. But I'd really like to not complicate
> >>>>>>>>> the API of the iova_tree more.
> >>>>>>>>>
> >>>>>>>>> I think we can look for the hwaddr using memory_region_from_host and
> >>>>>>>>> then get the hwaddr. It is another lookup though...
> >>>>>>>> Yeah, that will be another means of doing translation without having to
> >>>>>>>> complicate the API around iova_tree. I wonder how the lookup through
> >>>>>>>> memory_region_from_host() may perform compared to the iova tree one, the
> >>>>>>>> former looks to be an O(N) linear search on a linked list while the
> >>>>>>>> latter would be roughly O(log N) on an AVL tree?
> >>>>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
> >>>>>>> linear too. It is not even ordered.
> >>>>>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
> >>>>>> instead of g_tree_search_node(). So the former is indeed linear
> >>>>>> iteration, but it looks to be ordered?
> >>>>>>
> >>>>>> https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115
> >>>>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
> >>>>>
> >>>>> If we have these translations:
> >>>>> [0x1000, 0x2000] -> [0x10000, 0x11000]
> >>>>> [0x2000, 0x3000] -> [0x6000, 0x7000]
> >>>>>
> >>>>> We will see them in this order, so we cannot stop the search at the first node.
> >>>> Yeah, reverse lookup is unordered indeed, anyway.
> >>>>
> >>>>>>> But apart from this detail you're right, I have the same concerns with
> >>>>>>> this solution too. If we see a hard performance regression we could go
> >>>>>>> to more complicated solutions, like maintaining a reverse IOVATree in
> >>>>>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
> >>>>>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
> >>>>>> reverse IOVATree was dropped, lack of users? But now we have one!
> >>>>>>
> >>>>> No, it is just simplicity. We already have an user in the hot patch in
> >>>>> the master branch, vhost_svq_vring_write_descs. But I never profiled
> >>>>> enough to find if it is a bottleneck or not to be honest.
> >>>> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
> >>>> profile and see the difference.
> >>>>> I'll send the new series by today, thank you for finding these issues!
> >>>> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
> >>>> Jonah (cc'ed) may have interest in looking into it.
> >>>>
> >>> Actually, yes. I've tried to solve it using:
> >>> memory_region_get_ram_ptr -> It's hard to get this pointer to work
> >>> without messing a lot with IOVATree.
> >>> memory_region_find -> I'm totally unable to make it return sections
> >>> that make sense
> >>> flatview_for_each_range -> It does not return the same
> >>> MemoryRegionsection as the listener, not sure why.
> >> Ouch, thank you for the summary of attempts that were done earlier.
> >>> The only advance I have is that memory_region_from_host is able to
> >>> tell if the vaddr is from the guest or not.
> >> Hmmm, then it won't be too useful without a direct means to identifying
> >> the exact memory region associated with the iova that is being mapped.
> >> And, this additional indirection seems introduce a tiny bit of more
> >> latency in the reverse lookup routine (should not be a scalability issue
> >> though if it's a linear search)?
> >>
> > I didn't measure, but I guess yes it might. OTOH these structs may be
> > cached because virtqueue_pop just looked for them.
> Oh, right, that's a good point.
> >
> >>> So I'm convinced there must be a way to do it with the memory
> >>> subsystem, but I think the best way to do it ATM is to store a
> >>> parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
> >>> find the entry in this new tree, we can directly remove it by GPA. If
> >>> not, assume it is a host-only address like SVQ vrings, and remove by
> >>> iterating on vaddr as we do now.
> >> Yeah, this could work I think. On the other hand, given that we are now
> >> trying to improve it, I wonder if possible to come up with a fast
> >> version for the SVQ (host-only address) case without having to look up
> >> twice? SVQ callers should be able to tell apart from the guest case
> >> where GPA -> IOVA translation doesn't exist? Or just maintain a parallel
> >> tree with HVA -> IOVA translations for SVQ reverse lookup only? I feel
> >> SVQ mappings may be worth a separate fast lookup path - unlike guest
> >> mappings, the insertion, lookup and removal for SVQ mappings seem
> >> unavoidable during the migration downtime path.
> >>
> > I think the ideal order is the opposite actually. So:
> > 1) Try for the NIC to support _F_VRING_ASID, no translation needed by QEMU
> Right, that's the case for _F_VRING_ASID, which is simple and easy to
> deal with. Though I think this is an edge case across all vendor
> devices, as most likely only those no-chip IOMMU parents may support it.
> It's a luxury for normal device to steal another VF for this ASID feature...
>
> > 2) Try reverse lookup from HVA to GPA. Since dataplane should fit
> > this, we should test this first
> So instead of a direct lookup from HVA to IOVA, the purpose of the extra
> reverse lookup from HVA to GPA is to verify the validity of GPA (avoid
> from being mistakenly picked from the overlapped region)? But this would
> seem require scanning the entire GPA space to identify possible GPA
> ranges that are potentially overlapped? I wonder if there exists
> possibility to simplify this assumption, could we go this extra layer of
> GPA wide scan and validation, *only* when overlap is indeed detected
> during memory listerner's region_add (say during which we try to insert
> a duplicate / overlapped HVA into the HVA -> IOVA tree)? Or simply put,
> the first match on the reverse lookup would mostly suffice, since we
> know virtio driver can't use guest memory from these overlapped regions?

The first match should be enough, but maybe we need more than one
match. Let me put an example:

The buffer is (vaddr = 0x1000, size=0x3000). Now the tree contains two
overlapped entries: (vaddr=0x1000, size=0x2000), and (vaddr=0x1000,
size=0x3000).

Assuming we go through the reverse IOVA tree, we had bad luck and we
stored the small entry plus the big entry. The first search returns
the small entry then, (vaddr=0x1000, size=0x2000),. Calling code must
detect it, and then look for vaddr = 0x1000 + 0x2000. That gives us
the next entry.

You can see that virtqueue_map_desc translates this way if
dma_memory_map returns a translation shorter than the length of the
buffer, for example.

> You may say this assumption is too bold, but do we have other means to
> guarantee the first match will always hit under SVQ lookup? Given that
> we don't receive an instance of issue report until we move the memory
> listener registration upfront to device initialization, I guess there
> should be some point or under certain condition that the non-overlapped
> 1:1 translation and lookup can be satisfied. Don't you agree?
>

To be able to build the shorter is desirable, yes. Maybe it can be
done in this series, but I find it hard to solve some situations. For
example, is it possible to have three overlapping regions (A, B, C)
where regions A and B do not overlap but C overlaps both of them?

That's why I think it is better to delay that to a future series, but
we can do it with one shot if it is simple enough for sure.

Thanks!

> Thanks,
> -Siwei
> > 3) Look in SVQ host-only entries (SVQ, current shadow CVQ). It is the
> > control VQ, speed is not so important.
> >
> > Overlapping regions may return the wrong SVQ IOVA though. We should
> > take extra care to make sure these are correctly handled. I mean,
> > there are valid translations in the tree unless the driver is buggy,
> > just may need to span many translations.
> >
> >>>    It is guaranteed the guest does not
> >>> translate to that vaddr and that that vaddr is unique in the tree
> >>> anyway.
> >>>
> >>> Does it sound reasonable? Jonah, would you be interested in moving this forward?
> >> My thought would be that the reverse IOVA tree stuff can be added as a
> >> follow-up optimization right after for extended scalability, but for now
> >> as the interim, we may still need some form of simple fix, so as to
> >> quickly unblock the other dependent work built on top of this one and
> >> the early pinning series [1]. With it said, I'm completely fine if
> >> performing the reverse lookup through linear tree walk e.g.
> >> g_tree_foreach(), that should suffice small VM configs with just a
> >> couple of queues and limited number of memory regions. Going forward, to
> >> address the scalability bottleneck, Jonah could just replace the
> >> corresponding API call with the one built on top of reverse IOVA tree (I
> >> presume the use of these iova tree APIs is kind of internal that only
> >> limits to SVQ and vhost-vdpa subsystems) once he gets there, and then
> >> eliminate the other API variants that will no longer be in use. What do
> >> you think about this idea / plan?
> >>
> > Yeah it makes sense to me. Hopefully we can even get rid of the id member.
> >
> >> Thanks,
> >> -Siwei
> >>
> >> [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> >>
> >>> Thanks!
> >>>
> >>>> -Siwei
> >>>>
> >>>>
> >>>>>> Thanks,
> >>>>>> -Siwei
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>>> Of course,
> >>>>>>>> memory_region_from_host() won't search out of the guest memory space for
> >>>>>>>> sure. As this could be on the hot data path I have a little bit
> >>>>>>>> hesitance over the potential cost or performance regression this change
> >>>>>>>> could bring in, but maybe I'm overthinking it too much...
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -Siwei
> >>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> -Siwei
> >>>>>>>>>>>                return false;
> >>>>>>>>>>>            }
> >>>>>>>>>>>
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
  2024-04-12  7:56   ` Eugenio Perez Martin
@ 2024-05-07  7:29     ` Jason Wang
  2024-05-07 10:56       ` Eugenio Perez Martin
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2024-05-07  7:29 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Jonah Palmer, Dragos Tatulea

On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > >
> > > The guest may have overlapped memory regions, where different GPA leads
> > > to the same HVA.  This causes a problem when overlapped regions
> > > (different GPA but same translated HVA) exists in the tree, as looking
> > > them by HVA will return them twice.
> >
> > I think I don't understand if there's any side effect for shadow virtqueue?
> >
>
> My bad, I totally forgot to put a reference to where this comes from.
>
> Si-Wei found that during initialization this sequences of maps /
> unmaps happens [1]:
>
> HVA                    GPA                IOVA
> -------------------------------------------------------------------------------------------------------------------------
> Map
> [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> [0x80001000, 0x2000001000)
> [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> [0x2000001000, 0x2000021000)
>
> Unmap
> [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> 0x20000) ???
>
> The third HVA range is contained in the first one, but exposed under a
> different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> not overlap, only HVA.
>
> At the third chunk unmap, the current algorithm finds the first chunk,
> not the second one. This series is the way to tell the difference at
> unmap time.
>
> [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
>
> Thanks!

Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
the iova tree to solve this issue completely. Then there won't be
aliasing issues.

Thanks

>
> > Thanks
> >
> > >
> > > To solve this, track GPA in the DMA entry that acs as unique identifiers
> > > to the maps.  When the map needs to be removed, iova tree is able to
> > > find the right one.
> > >
> > > Users that does not go to this extra layer of indirection can use the
> > > iova tree as usual, with id = 0.
> > >
> > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
> > > time to reproduce the issue.  This has been tested only without overlapping
> > > maps.  If it works with overlapping maps, it will be intergrated in the main
> > > series.
> > >
> > > Comments are welcome.  Thanks!
> > >
> > > Eugenio Pérez (2):
> > >   iova_tree: add an id member to DMAMap
> > >   vdpa: identify aliased maps in iova_tree
> > >
> > >  hw/virtio/vhost-vdpa.c   | 2 ++
> > >  include/qemu/iova-tree.h | 5 +++--
> > >  util/iova-tree.c         | 3 ++-
> > >  3 files changed, 7 insertions(+), 3 deletions(-)
> > >
> > > --
> > > 2.44.0
> > >
> >
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-05-02  6:18                         ` Eugenio Perez Martin
@ 2024-05-07  9:12                           ` Si-Wei Liu
  0 siblings, 0 replies; 37+ messages in thread
From: Si-Wei Liu @ 2024-05-07  9:12 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Jonah Palmer, qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Dragos Tatulea, Jason Wang, David Hildenbrand



On 5/1/2024 11:18 PM, Eugenio Perez Martin wrote:
> On Thu, May 2, 2024 at 12:09 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 4/30/2024 11:11 AM, Eugenio Perez Martin wrote:
>>> On Mon, Apr 29, 2024 at 1:19 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>
>>>> On 4/29/24 4:14 AM, Eugenio Perez Martin wrote:
>>>>> On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>
>>>>>> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
>>>>>>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
>>>>>>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
>>>>>>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
>>>>>>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
>>>>>>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This causes a problem when overlapped regions (different GPA but same
>>>>>>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
>>>>>>>>>>>>> them twice.  To solve this, create an id member so we can assign unique
>>>>>>>>>>>>> identifiers (GPA) to the maps.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>         include/qemu/iova-tree.h | 5 +++--
>>>>>>>>>>>>>         util/iova-tree.c         | 3 ++-
>>>>>>>>>>>>>         2 files changed, 5 insertions(+), 3 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
>>>>>>>>>>>>> index 2a10a7052e..34ee230e7d 100644
>>>>>>>>>>>>> --- a/include/qemu/iova-tree.h
>>>>>>>>>>>>> +++ b/include/qemu/iova-tree.h
>>>>>>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
>>>>>>>>>>>>>             hwaddr iova;
>>>>>>>>>>>>>             hwaddr translated_addr;
>>>>>>>>>>>>>             hwaddr size;                /* Inclusive */
>>>>>>>>>>>>> +    uint64_t id;
>>>>>>>>>>>>>             IOMMUAccessFlags perm;
>>>>>>>>>>>>>         } QEMU_PACKED DMAMap;
>>>>>>>>>>>>>         typedef gboolean (*iova_tree_iterator)(DMAMap *map);
>>>>>>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
>>>>>>>>>>>>>          * @map: the mapping to search
>>>>>>>>>>>>>          *
>>>>>>>>>>>>>          * Search for a mapping in the iova tree that translated_addr overlaps with the
>>>>>>>>>>>>> - * mapping range specified.  Only the first found mapping will be
>>>>>>>>>>>>> - * returned.
>>>>>>>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
>>>>>>>>>>>>> + * mapping will be returned.
>>>>>>>>>>>>>          *
>>>>>>>>>>>>>          * Return: DMAMap pointer if found, or NULL if not found.  Note that
>>>>>>>>>>>>>          * the returned DMAMap pointer is maintained internally.  User should
>>>>>>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
>>>>>>>>>>>>> index 536789797e..0863e0a3b8 100644
>>>>>>>>>>>>> --- a/util/iova-tree.c
>>>>>>>>>>>>> +++ b/util/iova-tree.c
>>>>>>>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
>>>>>>>>>>>>>
>>>>>>>>>>>>>             needle = args->needle;
>>>>>>>>>>>>>             if (map->translated_addr + map->size < needle->translated_addr ||
>>>>>>>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
>>>>>>>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
>>>>>>>>>>>>> +        needle->id != map->id) {
>>>>>>>>>>>> It looks this iterator can also be invoked by SVQ from
>>>>>>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
>>>>>>>>>>>> space will be searched on without passing in the ID (GPA), and exact
>>>>>>>>>>>> match for the same GPA range is not actually needed unlike the mapping
>>>>>>>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
>>>>>>>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
>>>>>>>>>>>> DMAMap, and the id match check may look like below:
>>>>>>>>>>>>
>>>>>>>>>>>> (!needle->skip_id_match && needle->id != map->id)
>>>>>>>>>>>>
>>>>>>>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
>>>>>>>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
>>>>>>>>>>>>
>>>>>>>>>>> I think you're totally right. But I'd really like to not complicate
>>>>>>>>>>> the API of the iova_tree more.
>>>>>>>>>>>
>>>>>>>>>>> I think we can look for the hwaddr using memory_region_from_host and
>>>>>>>>>>> then get the hwaddr. It is another lookup though...
>>>>>>>>>> Yeah, that will be another means of doing translation without having to
>>>>>>>>>> complicate the API around iova_tree. I wonder how the lookup through
>>>>>>>>>> memory_region_from_host() may perform compared to the iova tree one, the
>>>>>>>>>> former looks to be an O(N) linear search on a linked list while the
>>>>>>>>>> latter would be roughly O(log N) on an AVL tree?
>>>>>>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
>>>>>>>>> linear too. It is not even ordered.
>>>>>>>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
>>>>>>>> instead of g_tree_search_node(). So the former is indeed linear
>>>>>>>> iteration, but it looks to be ordered?
>>>>>>>>
>>>>>>>> https://urldefense.com/v3/__https://github.com/GNOME/glib/blob/main/glib/gtree.c*L1115__;Iw!!ACWV5N9M2RV99hQ!Ng2rLfRd9tLyNTNocW50Mf5AcxSt0uF0wOdv120djff-z_iAdbujYK-jMi5UC1DZLxb1yLUv2vV0j3wJo8o$
>>>>>>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
>>>>>>>
>>>>>>> If we have these translations:
>>>>>>> [0x1000, 0x2000] -> [0x10000, 0x11000]
>>>>>>> [0x2000, 0x3000] -> [0x6000, 0x7000]
>>>>>>>
>>>>>>> We will see them in this order, so we cannot stop the search at the first node.
>>>>>> Yeah, reverse lookup is unordered indeed, anyway.
>>>>>>
>>>>>>>>> But apart from this detail you're right, I have the same concerns with
>>>>>>>>> this solution too. If we see a hard performance regression we could go
>>>>>>>>> to more complicated solutions, like maintaining a reverse IOVATree in
>>>>>>>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
>>>>>>>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
>>>>>>>> reverse IOVATree was dropped, lack of users? But now we have one!
>>>>>>>>
>>>>>>> No, it is just simplicity. We already have an user in the hot patch in
>>>>>>> the master branch, vhost_svq_vring_write_descs. But I never profiled
>>>>>>> enough to find if it is a bottleneck or not to be honest.
>>>>>> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
>>>>>> profile and see the difference.
>>>>>>> I'll send the new series by today, thank you for finding these issues!
>>>>>> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
>>>>>> Jonah (cc'ed) may have interest in looking into it.
>>>>>>
>>>>> Actually, yes. I've tried to solve it using:
>>>>> memory_region_get_ram_ptr -> It's hard to get this pointer to work
>>>>> without messing a lot with IOVATree.
>>>>> memory_region_find -> I'm totally unable to make it return sections
>>>>> that make sense
>>>>> flatview_for_each_range -> It does not return the same
>>>>> MemoryRegionsection as the listener, not sure why.
>>>>>
>>>>> The only advance I have is that memory_region_from_host is able to
>>>>> tell if the vaddr is from the guest or not.
>>>>>
>>>>> So I'm convinced there must be a way to do it with the memory
>>>>> subsystem, but I think the best way to do it ATM is to store a
>>>>> parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
>>>>> find the entry in this new tree, we can directly remove it by GPA. If
>>>>> not, assume it is a host-only address like SVQ vrings, and remove by
>>>>> iterating on vaddr as we do now. It is guaranteed the guest does not
>>>>> translate to that vaddr and that that vaddr is unique in the tree
>>>>> anyway.
>>>>>
>>>>> Does it sound reasonable? Jonah, would you be interested in moving this forward?
>>>>>
>>>>> Thanks!
>>>>>
>>>> Sure, I'd be more than happy to work on this stuff! I can probably get
>>>> started on this either today or tomorrow.
>>>>
>>>> Si-Wei mentioned something about these "reverse IOVATree" patches that
>>>> were dropped;
>>> The patches implementing the reverse IOVA tree were never created /
>>> posted, just in case you try to look for them.
>>>
>>>
>>>> is this relevant to what you're asking here? Is it
>>>> something I should base my work off of?
>>>>
>>> So these patches work ok for adding and removing maps. We assign ids,
>>> which is the gpa of the memory region that the listener receives. The
>>> bad news is that SVQ also needs this id to look for the right
>>> translation at vhost_svq_translate_addr, so this series is not
>>> complete.
>> I have a fundamental question to ask here. Are we sure SVQ really needs
>> this id (GPA) to identify the right translation? Or we're just
>> concerning much about the aliased map where there could be one single
>> HVA mapped to multiple IOVAs / GPAs, which (the overlapped) is almost
>> transient mapping that usually goes away very soon after guest memory
>> layout is stabilized?
> Are we sure all of the overlaps go away after the memory layout is
> stabilized in all conditions?
In all of the scenarios in concern or I've tried so far it's the case 
actually, but on the other hand I do understand this is rather a bold 
assumption that might not be future proof or even safe for now. Put it 
the other way, if there's a case that makes overlap ever occur after 
memory layout is stabilized, then we probably should have gotten such 
bug report way earlier with the current upstream QEMU, without this 
series of change of moving page pinning to device initialization, right. 
Otherwise it's also considered a bug in upstream SVQ code base, 
but practically I have no reproducer that triggers the problematic code 
with overlapped / aliasing.

>   I think it is worth not making two
> different ways to ask the tree depending on what part of QEMU we are.
Yes, ideally we shouldn't do that. Just wondered if there's sort of a 
simple way to tackle the problem and make it work with certain limited 
scenario (e.g. virtio-net doesn't use overlapped region, or we don't 
enable SVQ until memory is stabilized). Given that we haven't heard of 
issue before due to the overlap case not being unhanded, just saying 
maybe there's possibility to keep that assumption (overlaps are 
transient only) around for a short while, until a full blown fix can be 
landed?

Regards,
-Siwei
>
>> For what I can tell, the caller in SVQ datapath
>> code (vhost_svq_vring_write_descs) just calls into
>> vhost_iova_tree_find_iova to look for IOVA translation rather than
>> identify a specific section on the memory region, the latter of which
>> would need the id (GPA) to perform an exact match. The removal case
>> would definitely need perfect match on GPA with the additional id, but I
>> don't find it necessary for the vhost_svq_vring_write_descs code to pass
>> in the id / GPA? Do I miss something?
>>
> Expanding on the other thread, as there are more concrete points
> there. Please let me know if I missed something.
>
>> Thanks,
>> -Siwei
>>
>>> You can find the
>>> vhost_iova_tree_find_iova()->iova_tree_find_iova() call there.
>>>
>>> The easiest solution is the reverse IOVA tree of HVA -> SVQ IOVA. It
>>> is also the less elegant and (potentially) the less performant, as it
>>> includes duplicating information that QEMU already has, and a
>>> potentially linear search.
>>>
>>> David Hildenbrand (CCed) proposed to try iterating through RAMBlocks.
>>> I guess qemu_ram_block_from_host() could return a block where
>>> block->offset is the id of the map?
>>>
>>> It would be great to try this approach. If you don't have the bandwith
>>> for this, going directly for the reverse iova tree is also a valid
>>> solution.
>>>
>>> Thanks!
>>>
>>>> If there's any other relevant information about this issue that you
>>>> think I should know, let me know. I'll start digging into this ASAP and
>>>> will reach out if I need any guidance. :)
>>>>
>>>> Jonah
>>>>
>>>>>> -Siwei
>>>>>>
>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>>> Of course,
>>>>>>>>>> memory_region_from_host() won't search out of the guest memory space for
>>>>>>>>>> sure. As this could be on the hot data path I have a little bit
>>>>>>>>>> hesitance over the potential cost or performance regression this change
>>>>>>>>>> could bring in, but maybe I'm overthinking it too much...
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> -Siwei
>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> -Siwei
>>>>>>>>>>>>>                 return false;
>>>>>>>>>>>>>             }
>>>>>>>>>>>>>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
  2024-05-07  7:29     ` Jason Wang
@ 2024-05-07 10:56       ` Eugenio Perez Martin
  2024-05-08  2:29         ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2024-05-07 10:56 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Jonah Palmer, Dragos Tatulea

On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > >
> > > > The guest may have overlapped memory regions, where different GPA leads
> > > > to the same HVA.  This causes a problem when overlapped regions
> > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > them by HVA will return them twice.
> > >
> > > I think I don't understand if there's any side effect for shadow virtqueue?
> > >
> >
> > My bad, I totally forgot to put a reference to where this comes from.
> >
> > Si-Wei found that during initialization this sequences of maps /
> > unmaps happens [1]:
> >
> > HVA                    GPA                IOVA
> > -------------------------------------------------------------------------------------------------------------------------
> > Map
> > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > [0x80001000, 0x2000001000)
> > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > [0x2000001000, 0x2000021000)
> >
> > Unmap
> > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > 0x20000) ???
> >
> > The third HVA range is contained in the first one, but exposed under a
> > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > not overlap, only HVA.
> >
> > At the third chunk unmap, the current algorithm finds the first chunk,
> > not the second one. This series is the way to tell the difference at
> > unmap time.
> >
> > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> >
> > Thanks!
>
> Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> the iova tree to solve this issue completely. Then there won't be
> aliasing issues.
>

I'm ok to explore that route but this has another problem. Both SVQ
vrings and CVQ buffers also need to be addressable by VhostIOVATree,
and they do not have GPA.

At this moment vhost_svq_translate_addr is able to handle this
transparently as we translate vaddr to SVQ IOVA. How can we store
these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
then a list to go through other entries (SVQ vaddr and CVQ buffers).

Thanks!

> Thanks
>
> >
> > > Thanks
> > >
> > > >
> > > > To solve this, track GPA in the DMA entry that acs as unique identifiers
> > > > to the maps.  When the map needs to be removed, iova tree is able to
> > > > find the right one.
> > > >
> > > > Users that does not go to this extra layer of indirection can use the
> > > > iova tree as usual, with id = 0.
> > > >
> > > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
> > > > time to reproduce the issue.  This has been tested only without overlapping
> > > > maps.  If it works with overlapping maps, it will be intergrated in the main
> > > > series.
> > > >
> > > > Comments are welcome.  Thanks!
> > > >
> > > > Eugenio Pérez (2):
> > > >   iova_tree: add an id member to DMAMap
> > > >   vdpa: identify aliased maps in iova_tree
> > > >
> > > >  hw/virtio/vhost-vdpa.c   | 2 ++
> > > >  include/qemu/iova-tree.h | 5 +++--
> > > >  util/iova-tree.c         | 3 ++-
> > > >  3 files changed, 7 insertions(+), 3 deletions(-)
> > > >
> > > > --
> > > > 2.44.0
> > > >
> > >
> >
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-05-02  6:44                         ` Eugenio Perez Martin
@ 2024-05-08  0:52                           ` Si-Wei Liu
  2024-05-08 15:25                             ` Eugenio Perez Martin
  0 siblings, 1 reply; 37+ messages in thread
From: Si-Wei Liu @ 2024-05-08  0:52 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Jonah Palmer, qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Dragos Tatulea, Jason Wang



On 5/1/2024 11:44 PM, Eugenio Perez Martin wrote:
> On Thu, May 2, 2024 at 1:16 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 4/30/2024 10:19 AM, Eugenio Perez Martin wrote:
>>> On Tue, Apr 30, 2024 at 7:55 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>
>>>> On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:
>>>>> On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
>>>>>>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
>>>>>>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
>>>>>>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>>>>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
>>>>>>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
>>>>>>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This causes a problem when overlapped regions (different GPA but same
>>>>>>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
>>>>>>>>>>>>> them twice.  To solve this, create an id member so we can assign unique
>>>>>>>>>>>>> identifiers (GPA) to the maps.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>         include/qemu/iova-tree.h | 5 +++--
>>>>>>>>>>>>>         util/iova-tree.c         | 3 ++-
>>>>>>>>>>>>>         2 files changed, 5 insertions(+), 3 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
>>>>>>>>>>>>> index 2a10a7052e..34ee230e7d 100644
>>>>>>>>>>>>> --- a/include/qemu/iova-tree.h
>>>>>>>>>>>>> +++ b/include/qemu/iova-tree.h
>>>>>>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
>>>>>>>>>>>>>             hwaddr iova;
>>>>>>>>>>>>>             hwaddr translated_addr;
>>>>>>>>>>>>>             hwaddr size;                /* Inclusive */
>>>>>>>>>>>>> +    uint64_t id;
>>>>>>>>>>>>>             IOMMUAccessFlags perm;
>>>>>>>>>>>>>         } QEMU_PACKED DMAMap;
>>>>>>>>>>>>>         typedef gboolean (*iova_tree_iterator)(DMAMap *map);
>>>>>>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
>>>>>>>>>>>>>          * @map: the mapping to search
>>>>>>>>>>>>>          *
>>>>>>>>>>>>>          * Search for a mapping in the iova tree that translated_addr overlaps with the
>>>>>>>>>>>>> - * mapping range specified.  Only the first found mapping will be
>>>>>>>>>>>>> - * returned.
>>>>>>>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
>>>>>>>>>>>>> + * mapping will be returned.
>>>>>>>>>>>>>          *
>>>>>>>>>>>>>          * Return: DMAMap pointer if found, or NULL if not found.  Note that
>>>>>>>>>>>>>          * the returned DMAMap pointer is maintained internally.  User should
>>>>>>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
>>>>>>>>>>>>> index 536789797e..0863e0a3b8 100644
>>>>>>>>>>>>> --- a/util/iova-tree.c
>>>>>>>>>>>>> +++ b/util/iova-tree.c
>>>>>>>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
>>>>>>>>>>>>>
>>>>>>>>>>>>>             needle = args->needle;
>>>>>>>>>>>>>             if (map->translated_addr + map->size < needle->translated_addr ||
>>>>>>>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
>>>>>>>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
>>>>>>>>>>>>> +        needle->id != map->id) {
>>>>>>>>>>>> It looks this iterator can also be invoked by SVQ from
>>>>>>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
>>>>>>>>>>>> space will be searched on without passing in the ID (GPA), and exact
>>>>>>>>>>>> match for the same GPA range is not actually needed unlike the mapping
>>>>>>>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
>>>>>>>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
>>>>>>>>>>>> DMAMap, and the id match check may look like below:
>>>>>>>>>>>>
>>>>>>>>>>>> (!needle->skip_id_match && needle->id != map->id)
>>>>>>>>>>>>
>>>>>>>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
>>>>>>>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
>>>>>>>>>>>>
>>>>>>>>>>> I think you're totally right. But I'd really like to not complicate
>>>>>>>>>>> the API of the iova_tree more.
>>>>>>>>>>>
>>>>>>>>>>> I think we can look for the hwaddr using memory_region_from_host and
>>>>>>>>>>> then get the hwaddr. It is another lookup though...
>>>>>>>>>> Yeah, that will be another means of doing translation without having to
>>>>>>>>>> complicate the API around iova_tree. I wonder how the lookup through
>>>>>>>>>> memory_region_from_host() may perform compared to the iova tree one, the
>>>>>>>>>> former looks to be an O(N) linear search on a linked list while the
>>>>>>>>>> latter would be roughly O(log N) on an AVL tree?
>>>>>>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
>>>>>>>>> linear too. It is not even ordered.
>>>>>>>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
>>>>>>>> instead of g_tree_search_node(). So the former is indeed linear
>>>>>>>> iteration, but it looks to be ordered?
>>>>>>>>
>>>>>>>> https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115
>>>>>>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
>>>>>>>
>>>>>>> If we have these translations:
>>>>>>> [0x1000, 0x2000] -> [0x10000, 0x11000]
>>>>>>> [0x2000, 0x3000] -> [0x6000, 0x7000]
>>>>>>>
>>>>>>> We will see them in this order, so we cannot stop the search at the first node.
>>>>>> Yeah, reverse lookup is unordered indeed, anyway.
>>>>>>
>>>>>>>>> But apart from this detail you're right, I have the same concerns with
>>>>>>>>> this solution too. If we see a hard performance regression we could go
>>>>>>>>> to more complicated solutions, like maintaining a reverse IOVATree in
>>>>>>>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
>>>>>>>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
>>>>>>>> reverse IOVATree was dropped, lack of users? But now we have one!
>>>>>>>>
>>>>>>> No, it is just simplicity. We already have an user in the hot patch in
>>>>>>> the master branch, vhost_svq_vring_write_descs. But I never profiled
>>>>>>> enough to find if it is a bottleneck or not to be honest.
>>>>>> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
>>>>>> profile and see the difference.
>>>>>>> I'll send the new series by today, thank you for finding these issues!
>>>>>> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
>>>>>> Jonah (cc'ed) may have interest in looking into it.
>>>>>>
>>>>> Actually, yes. I've tried to solve it using:
>>>>> memory_region_get_ram_ptr -> It's hard to get this pointer to work
>>>>> without messing a lot with IOVATree.
>>>>> memory_region_find -> I'm totally unable to make it return sections
>>>>> that make sense
>>>>> flatview_for_each_range -> It does not return the same
>>>>> MemoryRegionsection as the listener, not sure why.
>>>> Ouch, thank you for the summary of attempts that were done earlier.
>>>>> The only advance I have is that memory_region_from_host is able to
>>>>> tell if the vaddr is from the guest or not.
>>>> Hmmm, then it won't be too useful without a direct means to identifying
>>>> the exact memory region associated with the iova that is being mapped.
>>>> And, this additional indirection seems introduce a tiny bit of more
>>>> latency in the reverse lookup routine (should not be a scalability issue
>>>> though if it's a linear search)?
>>>>
>>> I didn't measure, but I guess yes it might. OTOH these structs may be
>>> cached because virtqueue_pop just looked for them.
>> Oh, right, that's a good point.
>>>>> So I'm convinced there must be a way to do it with the memory
>>>>> subsystem, but I think the best way to do it ATM is to store a
>>>>> parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
>>>>> find the entry in this new tree, we can directly remove it by GPA. If
>>>>> not, assume it is a host-only address like SVQ vrings, and remove by
>>>>> iterating on vaddr as we do now.
>>>> Yeah, this could work I think. On the other hand, given that we are now
>>>> trying to improve it, I wonder if possible to come up with a fast
>>>> version for the SVQ (host-only address) case without having to look up
>>>> twice? SVQ callers should be able to tell apart from the guest case
>>>> where GPA -> IOVA translation doesn't exist? Or just maintain a parallel
>>>> tree with HVA -> IOVA translations for SVQ reverse lookup only? I feel
>>>> SVQ mappings may be worth a separate fast lookup path - unlike guest
>>>> mappings, the insertion, lookup and removal for SVQ mappings seem
>>>> unavoidable during the migration downtime path.
>>>>
>>> I think the ideal order is the opposite actually. So:
>>> 1) Try for the NIC to support _F_VRING_ASID, no translation needed by QEMU
>> Right, that's the case for _F_VRING_ASID, which is simple and easy to
>> deal with. Though I think this is an edge case across all vendor
>> devices, as most likely only those no-chip IOMMU parents may support it.
>> It's a luxury for normal device to steal another VF for this ASID feature...
>>
>>> 2) Try reverse lookup from HVA to GPA. Since dataplane should fit
>>> this, we should test this first
>> So instead of a direct lookup from HVA to IOVA, the purpose of the extra
>> reverse lookup from HVA to GPA is to verify the validity of GPA (avoid
>> from being mistakenly picked from the overlapped region)? But this would
>> seem require scanning the entire GPA space to identify possible GPA
>> ranges that are potentially overlapped? I wonder if there exists
>> possibility to simplify this assumption, could we go this extra layer of
>> GPA wide scan and validation, *only* when overlap is indeed detected
>> during memory listerner's region_add (say during which we try to insert
>> a duplicate / overlapped HVA into the HVA -> IOVA tree)? Or simply put,
>> the first match on the reverse lookup would mostly suffice, since we
>> know virtio driver can't use guest memory from these overlapped regions?
> The first match should be enough, but maybe we need more than one
> match. Let me put an example:
>
> The buffer is (vaddr = 0x1000, size=0x3000). Now the tree contains two
> overlapped entries: (vaddr=0x1000, size=0x2000), and (vaddr=0x1000,
> size=0x3000).
In this case, assume the overlap can be detected via certain structs, 
for e.g. a HVA->IOVA reverse tree, then a full and slow lookup needs to 
be performed. Here we can try to match using the size, but I feel its 
best to identify the exact IOVA range by the GPA. This can be done 
through a tree storing the GPA->HVA mappings, and the reverse lookup 
from HVA->GPA will help identify if the HVA falls into certain GPA range.

>
> Assuming we go through the reverse IOVA tree, we had bad luck and we
> stored the small entry plus the big entry. The first search returns
> the small entry then, (vaddr=0x1000, size=0x2000),. Calling code must
> detect it, and then look for vaddr = 0x1000 + 0x2000. That gives us
> the next entry.
Is there any reason why the first search can't pass in the GPA to 
further help identify? Suppose it's verified that the specific GPA range 
does exists via the HVA->GPA lookup.
>
> You can see that virtqueue_map_desc translates this way if
> dma_memory_map returns a translation shorter than the length of the
> buffer, for example.
>
>> You may say this assumption is too bold, but do we have other means to
>> guarantee the first match will always hit under SVQ lookup? Given that
>> we don't receive an instance of issue report until we move the memory
>> listener registration upfront to device initialization, I guess there
>> should be some point or under certain condition that the non-overlapped
>> 1:1 translation and lookup can be satisfied. Don't you agree?
>>
> To be able to build the shorter is desirable, yes. Maybe it can be
> done in this series, but I find it hard to solve some situations. For
> example, is it possible to have three overlapping regions (A, B, C)
> where regions A and B do not overlap but C overlaps both of them?
Does C map to a different GPA range than where region A and B reside 
originally? The flatten guest view should guarantee that, right? Then it 
shouldn't be a problem by passing in the GPA as the secondary ID for the 
reverse HVA->IOVA lookup.

Regards,
-Siwei
>
> That's why I think it is better to delay that to a future series, but
> we can do it with one shot if it is simple enough for sure.
>
> Thanks!
>
>> Thanks,
>> -Siwei
>>> 3) Look in SVQ host-only entries (SVQ, current shadow CVQ). It is the
>>> control VQ, speed is not so important.
>>>
>>> Overlapping regions may return the wrong SVQ IOVA though. We should
>>> take extra care to make sure these are correctly handled. I mean,
>>> there are valid translations in the tree unless the driver is buggy,
>>> just may need to span many translations.
>>>
>>>>>     It is guaranteed the guest does not
>>>>> translate to that vaddr and that that vaddr is unique in the tree
>>>>> anyway.
>>>>>
>>>>> Does it sound reasonable? Jonah, would you be interested in moving this forward?
>>>> My thought would be that the reverse IOVA tree stuff can be added as a
>>>> follow-up optimization right after for extended scalability, but for now
>>>> as the interim, we may still need some form of simple fix, so as to
>>>> quickly unblock the other dependent work built on top of this one and
>>>> the early pinning series [1]. With it said, I'm completely fine if
>>>> performing the reverse lookup through linear tree walk e.g.
>>>> g_tree_foreach(), that should suffice small VM configs with just a
>>>> couple of queues and limited number of memory regions. Going forward, to
>>>> address the scalability bottleneck, Jonah could just replace the
>>>> corresponding API call with the one built on top of reverse IOVA tree (I
>>>> presume the use of these iova tree APIs is kind of internal that only
>>>> limits to SVQ and vhost-vdpa subsystems) once he gets there, and then
>>>> eliminate the other API variants that will no longer be in use. What do
>>>> you think about this idea / plan?
>>>>
>>> Yeah it makes sense to me. Hopefully we can even get rid of the id member.
>>>
>>>> Thanks,
>>>> -Siwei
>>>>
>>>> [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
>>>>
>>>>> Thanks!
>>>>>
>>>>>> -Siwei
>>>>>>
>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>>> Of course,
>>>>>>>>>> memory_region_from_host() won't search out of the guest memory space for
>>>>>>>>>> sure. As this could be on the hot data path I have a little bit
>>>>>>>>>> hesitance over the potential cost or performance regression this change
>>>>>>>>>> could bring in, but maybe I'm overthinking it too much...
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> -Siwei
>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> -Siwei
>>>>>>>>>>>>>                 return false;
>>>>>>>>>>>>>             }
>>>>>>>>>>>>>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
  2024-05-07 10:56       ` Eugenio Perez Martin
@ 2024-05-08  2:29         ` Jason Wang
  2024-05-08 17:15           ` Eugenio Perez Martin
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2024-05-08  2:29 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Jonah Palmer, Dragos Tatulea

On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > >
> > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > them by HVA will return them twice.
> > > >
> > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > >
> > >
> > > My bad, I totally forgot to put a reference to where this comes from.
> > >
> > > Si-Wei found that during initialization this sequences of maps /
> > > unmaps happens [1]:
> > >
> > > HVA                    GPA                IOVA
> > > -------------------------------------------------------------------------------------------------------------------------
> > > Map
> > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > [0x80001000, 0x2000001000)
> > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > [0x2000001000, 0x2000021000)
> > >
> > > Unmap
> > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > 0x20000) ???
> > >
> > > The third HVA range is contained in the first one, but exposed under a
> > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > not overlap, only HVA.
> > >
> > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > not the second one. This series is the way to tell the difference at
> > > unmap time.
> > >
> > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > >
> > > Thanks!
> >
> > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > the iova tree to solve this issue completely. Then there won't be
> > aliasing issues.
> >
>
> I'm ok to explore that route but this has another problem. Both SVQ
> vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> and they do not have GPA.
>
> At this moment vhost_svq_translate_addr is able to handle this
> transparently as we translate vaddr to SVQ IOVA. How can we store
> these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> then a list to go through other entries (SVQ vaddr and CVQ buffers).

This seems to be tricky.

As discussed, it could be another iova tree.

Thanks

>
> Thanks!
>
> > Thanks
> >
> > >
> > > > Thanks
> > > >
> > > > >
> > > > > To solve this, track GPA in the DMA entry that acs as unique identifiers
> > > > > to the maps.  When the map needs to be removed, iova tree is able to
> > > > > find the right one.
> > > > >
> > > > > Users that does not go to this extra layer of indirection can use the
> > > > > iova tree as usual, with id = 0.
> > > > >
> > > > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
> > > > > time to reproduce the issue.  This has been tested only without overlapping
> > > > > maps.  If it works with overlapping maps, it will be intergrated in the main
> > > > > series.
> > > > >
> > > > > Comments are welcome.  Thanks!
> > > > >
> > > > > Eugenio Pérez (2):
> > > > >   iova_tree: add an id member to DMAMap
> > > > >   vdpa: identify aliased maps in iova_tree
> > > > >
> > > > >  hw/virtio/vhost-vdpa.c   | 2 ++
> > > > >  include/qemu/iova-tree.h | 5 +++--
> > > > >  util/iova-tree.c         | 3 ++-
> > > > >  3 files changed, 7 insertions(+), 3 deletions(-)
> > > > >
> > > > > --
> > > > > 2.44.0
> > > > >
> > > >
> > >
> >
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 1/2] iova_tree: add an id member to DMAMap
  2024-05-08  0:52                           ` Si-Wei Liu
@ 2024-05-08 15:25                             ` Eugenio Perez Martin
  0 siblings, 0 replies; 37+ messages in thread
From: Eugenio Perez Martin @ 2024-05-08 15:25 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Jonah Palmer, qemu-devel, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Dragos Tatulea, Jason Wang

On Wed, May 8, 2024 at 2:52 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 5/1/2024 11:44 PM, Eugenio Perez Martin wrote:
> > On Thu, May 2, 2024 at 1:16 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 4/30/2024 10:19 AM, Eugenio Perez Martin wrote:
> >>> On Tue, Apr 30, 2024 at 7:55 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>> On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:
> >>>>> On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> >>>>>>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>>>>>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>>>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>>>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This causes a problem when overlapped regions (different GPA but same
> >>>>>>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
> >>>>>>>>>>>>> them twice.  To solve this, create an id member so we can assign unique
> >>>>>>>>>>>>> identifiers (GPA) to the maps.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>>>>>>>>>> ---
> >>>>>>>>>>>>>         include/qemu/iova-tree.h | 5 +++--
> >>>>>>>>>>>>>         util/iova-tree.c         | 3 ++-
> >>>>>>>>>>>>>         2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>>>>>>>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>>>>>>>>>> --- a/include/qemu/iova-tree.h
> >>>>>>>>>>>>> +++ b/include/qemu/iova-tree.h
> >>>>>>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>>>>>>>>>             hwaddr iova;
> >>>>>>>>>>>>>             hwaddr translated_addr;
> >>>>>>>>>>>>>             hwaddr size;                /* Inclusive */
> >>>>>>>>>>>>> +    uint64_t id;
> >>>>>>>>>>>>>             IOMMUAccessFlags perm;
> >>>>>>>>>>>>>         } QEMU_PACKED DMAMap;
> >>>>>>>>>>>>>         typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>>>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
> >>>>>>>>>>>>>          * @map: the mapping to search
> >>>>>>>>>>>>>          *
> >>>>>>>>>>>>>          * Search for a mapping in the iova tree that translated_addr overlaps with the
> >>>>>>>>>>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>>>>>>>>>> - * returned.
> >>>>>>>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
> >>>>>>>>>>>>> + * mapping will be returned.
> >>>>>>>>>>>>>          *
> >>>>>>>>>>>>>          * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >>>>>>>>>>>>>          * the returned DMAMap pointer is maintained internally.  User should
> >>>>>>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>>>>>>>>>>>> index 536789797e..0863e0a3b8 100644
> >>>>>>>>>>>>> --- a/util/iova-tree.c
> >>>>>>>>>>>>> +++ b/util/iova-tree.c
> >>>>>>>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>             needle = args->needle;
> >>>>>>>>>>>>>             if (map->translated_addr + map->size < needle->translated_addr ||
> >>>>>>>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
> >>>>>>>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
> >>>>>>>>>>>>> +        needle->id != map->id) {
> >>>>>>>>>>>> It looks this iterator can also be invoked by SVQ from
> >>>>>>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> >>>>>>>>>>>> space will be searched on without passing in the ID (GPA), and exact
> >>>>>>>>>>>> match for the same GPA range is not actually needed unlike the mapping
> >>>>>>>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
> >>>>>>>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
> >>>>>>>>>>>> DMAMap, and the id match check may look like below:
> >>>>>>>>>>>>
> >>>>>>>>>>>> (!needle->skip_id_match && needle->id != map->id)
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
> >>>>>>>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
> >>>>>>>>>>>>
> >>>>>>>>>>> I think you're totally right. But I'd really like to not complicate
> >>>>>>>>>>> the API of the iova_tree more.
> >>>>>>>>>>>
> >>>>>>>>>>> I think we can look for the hwaddr using memory_region_from_host and
> >>>>>>>>>>> then get the hwaddr. It is another lookup though...
> >>>>>>>>>> Yeah, that will be another means of doing translation without having to
> >>>>>>>>>> complicate the API around iova_tree. I wonder how the lookup through
> >>>>>>>>>> memory_region_from_host() may perform compared to the iova tree one, the
> >>>>>>>>>> former looks to be an O(N) linear search on a linked list while the
> >>>>>>>>>> latter would be roughly O(log N) on an AVL tree?
> >>>>>>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
> >>>>>>>>> linear too. It is not even ordered.
> >>>>>>>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
> >>>>>>>> instead of g_tree_search_node(). So the former is indeed linear
> >>>>>>>> iteration, but it looks to be ordered?
> >>>>>>>>
> >>>>>>>> https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115
> >>>>>>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
> >>>>>>>
> >>>>>>> If we have these translations:
> >>>>>>> [0x1000, 0x2000] -> [0x10000, 0x11000]
> >>>>>>> [0x2000, 0x3000] -> [0x6000, 0x7000]
> >>>>>>>
> >>>>>>> We will see them in this order, so we cannot stop the search at the first node.
> >>>>>> Yeah, reverse lookup is unordered indeed, anyway.
> >>>>>>
> >>>>>>>>> But apart from this detail you're right, I have the same concerns with
> >>>>>>>>> this solution too. If we see a hard performance regression we could go
> >>>>>>>>> to more complicated solutions, like maintaining a reverse IOVATree in
> >>>>>>>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
> >>>>>>>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
> >>>>>>>> reverse IOVATree was dropped, lack of users? But now we have one!
> >>>>>>>>
> >>>>>>> No, it is just simplicity. We already have an user in the hot patch in
> >>>>>>> the master branch, vhost_svq_vring_write_descs. But I never profiled
> >>>>>>> enough to find if it is a bottleneck or not to be honest.
> >>>>>> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
> >>>>>> profile and see the difference.
> >>>>>>> I'll send the new series by today, thank you for finding these issues!
> >>>>>> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
> >>>>>> Jonah (cc'ed) may have interest in looking into it.
> >>>>>>
> >>>>> Actually, yes. I've tried to solve it using:
> >>>>> memory_region_get_ram_ptr -> It's hard to get this pointer to work
> >>>>> without messing a lot with IOVATree.
> >>>>> memory_region_find -> I'm totally unable to make it return sections
> >>>>> that make sense
> >>>>> flatview_for_each_range -> It does not return the same
> >>>>> MemoryRegionsection as the listener, not sure why.
> >>>> Ouch, thank you for the summary of attempts that were done earlier.
> >>>>> The only advance I have is that memory_region_from_host is able to
> >>>>> tell if the vaddr is from the guest or not.
> >>>> Hmmm, then it won't be too useful without a direct means to identifying
> >>>> the exact memory region associated with the iova that is being mapped.
> >>>> And, this additional indirection seems introduce a tiny bit of more
> >>>> latency in the reverse lookup routine (should not be a scalability issue
> >>>> though if it's a linear search)?
> >>>>
> >>> I didn't measure, but I guess yes it might. OTOH these structs may be
> >>> cached because virtqueue_pop just looked for them.
> >> Oh, right, that's a good point.
> >>>>> So I'm convinced there must be a way to do it with the memory
> >>>>> subsystem, but I think the best way to do it ATM is to store a
> >>>>> parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
> >>>>> find the entry in this new tree, we can directly remove it by GPA. If
> >>>>> not, assume it is a host-only address like SVQ vrings, and remove by
> >>>>> iterating on vaddr as we do now.
> >>>> Yeah, this could work I think. On the other hand, given that we are now
> >>>> trying to improve it, I wonder if possible to come up with a fast
> >>>> version for the SVQ (host-only address) case without having to look up
> >>>> twice? SVQ callers should be able to tell apart from the guest case
> >>>> where GPA -> IOVA translation doesn't exist? Or just maintain a parallel
> >>>> tree with HVA -> IOVA translations for SVQ reverse lookup only? I feel
> >>>> SVQ mappings may be worth a separate fast lookup path - unlike guest
> >>>> mappings, the insertion, lookup and removal for SVQ mappings seem
> >>>> unavoidable during the migration downtime path.
> >>>>
> >>> I think the ideal order is the opposite actually. So:
> >>> 1) Try for the NIC to support _F_VRING_ASID, no translation needed by QEMU
> >> Right, that's the case for _F_VRING_ASID, which is simple and easy to
> >> deal with. Though I think this is an edge case across all vendor
> >> devices, as most likely only those no-chip IOMMU parents may support it.
> >> It's a luxury for normal device to steal another VF for this ASID feature...
> >>
> >>> 2) Try reverse lookup from HVA to GPA. Since dataplane should fit
> >>> this, we should test this first
> >> So instead of a direct lookup from HVA to IOVA, the purpose of the extra
> >> reverse lookup from HVA to GPA is to verify the validity of GPA (avoid
> >> from being mistakenly picked from the overlapped region)? But this would
> >> seem require scanning the entire GPA space to identify possible GPA
> >> ranges that are potentially overlapped? I wonder if there exists
> >> possibility to simplify this assumption, could we go this extra layer of
> >> GPA wide scan and validation, *only* when overlap is indeed detected
> >> during memory listerner's region_add (say during which we try to insert
> >> a duplicate / overlapped HVA into the HVA -> IOVA tree)? Or simply put,
> >> the first match on the reverse lookup would mostly suffice, since we
> >> know virtio driver can't use guest memory from these overlapped regions?
> > The first match should be enough, but maybe we need more than one
> > match. Let me put an example:
> >
> > The buffer is (vaddr = 0x1000, size=0x3000). Now the tree contains two
> > overlapped entries: (vaddr=0x1000, size=0x2000), and (vaddr=0x1000,
> > size=0x3000).
> In this case, assume the overlap can be detected via certain structs,
> for e.g. a HVA->IOVA reverse tree, then a full and slow lookup needs to
> be performed. Here we can try to match using the size, but I feel its
> best to identify the exact IOVA range by the GPA. This can be done
> through a tree storing the GPA->HVA mappings, and the reverse lookup
> from HVA->GPA will help identify if the HVA falls into certain GPA range.
>

It is possible somehow, but multiple searches are already used in
other areas where the full range is not found in the first attempt.
First one may return a partial result, but the second one can look for
the missing part of the key (vaddr=0x2000, size=0x1000). Isn't that
simpler?

> >
> > Assuming we go through the reverse IOVA tree, we had bad luck and we
> > stored the small entry plus the big entry. The first search returns
> > the small entry then, (vaddr=0x1000, size=0x2000),. Calling code must
> > detect it, and then look for vaddr = 0x1000 + 0x2000. That gives us
> > the next entry.
> Is there any reason why the first search can't pass in the GPA to
> further help identify? Suppose it's verified that the specific GPA range
> does exists via the HVA->GPA lookup.

The only problem is that IOVATree is shared with intel_iommu. How to
keep modifying it without affecting IOVATree usage by intel_iommu
might be a problem.

> >
> > You can see that virtqueue_map_desc translates this way if
> > dma_memory_map returns a translation shorter than the length of the
> > buffer, for example.
> >
> >> You may say this assumption is too bold, but do we have other means to
> >> guarantee the first match will always hit under SVQ lookup? Given that
> >> we don't receive an instance of issue report until we move the memory
> >> listener registration upfront to device initialization, I guess there
> >> should be some point or under certain condition that the non-overlapped
> >> 1:1 translation and lookup can be satisfied. Don't you agree?
> >>
> > To be able to build the shorter is desirable, yes. Maybe it can be
> > done in this series, but I find it hard to solve some situations. For
> > example, is it possible to have three overlapping regions (A, B, C)
> > where regions A and B do not overlap but C overlaps both of them?
> Does C map to a different GPA range than where region A and B reside
> originally? The flatten guest view should guarantee that, right? Then it
> shouldn't be a problem by passing in the GPA as the secondary ID for the
> reverse HVA->IOVA lookup.
>

Right. But in this RFC the id is not searched in full range, only the
first GPA of each region.

> Regards,
> -Siwei
> >
> > That's why I think it is better to delay that to a future series, but
> > we can do it with one shot if it is simple enough for sure.
> >
> > Thanks!
> >
> >> Thanks,
> >> -Siwei
> >>> 3) Look in SVQ host-only entries (SVQ, current shadow CVQ). It is the
> >>> control VQ, speed is not so important.
> >>>
> >>> Overlapping regions may return the wrong SVQ IOVA though. We should
> >>> take extra care to make sure these are correctly handled. I mean,
> >>> there are valid translations in the tree unless the driver is buggy,
> >>> just may need to span many translations.
> >>>
> >>>>>     It is guaranteed the guest does not
> >>>>> translate to that vaddr and that that vaddr is unique in the tree
> >>>>> anyway.
> >>>>>
> >>>>> Does it sound reasonable? Jonah, would you be interested in moving this forward?
> >>>> My thought would be that the reverse IOVA tree stuff can be added as a
> >>>> follow-up optimization right after for extended scalability, but for now
> >>>> as the interim, we may still need some form of simple fix, so as to
> >>>> quickly unblock the other dependent work built on top of this one and
> >>>> the early pinning series [1]. With it said, I'm completely fine if
> >>>> performing the reverse lookup through linear tree walk e.g.
> >>>> g_tree_foreach(), that should suffice small VM configs with just a
> >>>> couple of queues and limited number of memory regions. Going forward, to
> >>>> address the scalability bottleneck, Jonah could just replace the
> >>>> corresponding API call with the one built on top of reverse IOVA tree (I
> >>>> presume the use of these iova tree APIs is kind of internal that only
> >>>> limits to SVQ and vhost-vdpa subsystems) once he gets there, and then
> >>>> eliminate the other API variants that will no longer be in use. What do
> >>>> you think about this idea / plan?
> >>>>
> >>> Yeah it makes sense to me. Hopefully we can even get rid of the id member.
> >>>
> >>>> Thanks,
> >>>> -Siwei
> >>>>
> >>>> [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> >>>>
> >>>>> Thanks!
> >>>>>
> >>>>>> -Siwei
> >>>>>>
> >>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -Siwei
> >>>>>>>>> Thanks!
> >>>>>>>>>
> >>>>>>>>>> Of course,
> >>>>>>>>>> memory_region_from_host() won't search out of the guest memory space for
> >>>>>>>>>> sure. As this could be on the hot data path I have a little bit
> >>>>>>>>>> hesitance over the potential cost or performance regression this change
> >>>>>>>>>> could bring in, but maybe I'm overthinking it too much...
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> -Siwei
> >>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> -Siwei
> >>>>>>>>>>>>>                 return false;
> >>>>>>>>>>>>>             }
> >>>>>>>>>>>>>
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
  2024-05-08  2:29         ` Jason Wang
@ 2024-05-08 17:15           ` Eugenio Perez Martin
  2024-05-09  6:27             ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2024-05-08 17:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Jonah Palmer, Dragos Tatulea

On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > <eperezma@redhat.com> wrote:
> > > >
> > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > them by HVA will return them twice.
> > > > >
> > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > >
> > > >
> > > > My bad, I totally forgot to put a reference to where this comes from.
> > > >
> > > > Si-Wei found that during initialization this sequences of maps /
> > > > unmaps happens [1]:
> > > >
> > > > HVA                    GPA                IOVA
> > > > -------------------------------------------------------------------------------------------------------------------------
> > > > Map
> > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > [0x80001000, 0x2000001000)
> > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > [0x2000001000, 0x2000021000)
> > > >
> > > > Unmap
> > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > 0x20000) ???
> > > >
> > > > The third HVA range is contained in the first one, but exposed under a
> > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > not overlap, only HVA.
> > > >
> > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > not the second one. This series is the way to tell the difference at
> > > > unmap time.
> > > >
> > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > >
> > > > Thanks!
> > >
> > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > the iova tree to solve this issue completely. Then there won't be
> > > aliasing issues.
> > >
> >
> > I'm ok to explore that route but this has another problem. Both SVQ
> > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > and they do not have GPA.
> >
> > At this moment vhost_svq_translate_addr is able to handle this
> > transparently as we translate vaddr to SVQ IOVA. How can we store
> > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > then a list to go through other entries (SVQ vaddr and CVQ buffers).
>
> This seems to be tricky.
>
> As discussed, it could be another iova tree.
>

Yes but there are many ways to add another IOVATree. Let me expand & recap.

Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
Let's call it gpa_iova_tree, as opposed to the current iova_tree that
translates from vaddr to SVQ IOVA. To know which one to use is easy at
adding or removing, like in the memory listener, but how to know at
vhost_svq_translate_addr?

The easiest way for me is to rely on memory_region_from_host(). When
vaddr is from the guest, it returns a valid MemoryRegion. When it is
not, it returns NULL. I'm not sure if this is a valid use case, it
just worked in my tests so far.

Now we have the second problem: The GPA values of the regions of the
two IOVA tree must be unique. We need to be able to find unallocated
regions in SVQ IOVA. At this moment there is only one IOVATree, so
this is done easily by vhost_iova_tree_map_alloc. But it is very
complicated with two trees.

Option 2a is to add another IOVATree in VhostIOVATree. I think the
easiest way is to keep the GPA -> SVQ IOVA in one tree, let's call it
iova_gpa_map, and the current vaddr -> SVQ IOVA tree in
iova_taddr_map. This second tree should contain both vaddr memory that
belongs to the guest and host-only vaddr like vrings and CVQ buffers.

How to pass the GPA to VhostIOVATree API? To add it to DMAMap is
complicated, as it is shared with intel_iommu. We can add new
functions to VhostIOVATree that accepts vaddr plus GPA, or maybe it is
enough with GPA only. It should be functions to add, remove, and
allocate new entries. But vaddr ones must be kept, because buffers
might be host-only.

Then the caller can choose which version to call: for adding and
removing guest memory from the memory listener, the GPA variant.
Adding SVQ vrings and CVQ buffers should use the current vaddr
versions. vhost_svq_translate_addr still needs to use
memory_region_from_host() to know which one to call.

Although I didn't like this approach because it complicates
VhostIOVATree, I think it is the easier way except for option 4, I'll
explain later.

This has an extra advantage: currently, the lookup in
vhost_svq_translate_addr is linear, O(1). This would allow us to use
the tree properly.

Option 2b could be to keep them totally separated. So current
VhostIOVATree->iova_taddr_map only contains host-only entries, and the
new iova_gpa_map containst the guest entries. I don't think this case
adds anything except less memory usage, as the gpa map (which should
be the fastest) will be the same size. Also, it makes it difficult to
implement vhost_iova_tree_map_alloc.

Option 3 is to not add new functions but extend the current ones. That
would require special values of GPA values to indicate no GPA, like
SVQ vrings. I think option 2a is better, but this may help to keep the
interface simpler.

Option 4 is what I'm proposing in this RFC. To store the GPA as map id
so we can tell if the vaddr corresponds to one SVQ IOVA or another.
Now I'm having trouble retrieving the memory section I see in the
memory listener. It should not be so difficult but. The main advantage
is not to duplicate data structs that are already in QEMU, but maybe
it is not worth the effort.

Going further with this option, we could add a flag to ignore the .id
member added. But it adds more and more complexity to the API so I
would prefer option 2a for this.

> Thanks
>
> >
> > Thanks!
> >
> > > Thanks
> > >
> > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > > To solve this, track GPA in the DMA entry that acs as unique identifiers
> > > > > > to the maps.  When the map needs to be removed, iova tree is able to
> > > > > > find the right one.
> > > > > >
> > > > > > Users that does not go to this extra layer of indirection can use the
> > > > > > iova tree as usual, with id = 0.
> > > > > >
> > > > > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
> > > > > > time to reproduce the issue.  This has been tested only without overlapping
> > > > > > maps.  If it works with overlapping maps, it will be intergrated in the main
> > > > > > series.
> > > > > >
> > > > > > Comments are welcome.  Thanks!
> > > > > >
> > > > > > Eugenio Pérez (2):
> > > > > >   iova_tree: add an id member to DMAMap
> > > > > >   vdpa: identify aliased maps in iova_tree
> > > > > >
> > > > > >  hw/virtio/vhost-vdpa.c   | 2 ++
> > > > > >  include/qemu/iova-tree.h | 5 +++--
> > > > > >  util/iova-tree.c         | 3 ++-
> > > > > >  3 files changed, 7 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > --
> > > > > > 2.44.0
> > > > > >
> > > > >
> > > >
> > >
> >
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
  2024-05-08 17:15           ` Eugenio Perez Martin
@ 2024-05-09  6:27             ` Jason Wang
  2024-05-09  7:10               ` Eugenio Perez Martin
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2024-05-09  6:27 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Jonah Palmer, Dragos Tatulea

On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > >
> > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > >
> > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > them by HVA will return them twice.
> > > > > >
> > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > >
> > > > >
> > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > >
> > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > unmaps happens [1]:
> > > > >
> > > > > HVA                    GPA                IOVA
> > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > Map
> > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > [0x80001000, 0x2000001000)
> > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > [0x2000001000, 0x2000021000)
> > > > >
> > > > > Unmap
> > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > 0x20000) ???
> > > > >
> > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > not overlap, only HVA.
> > > > >
> > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > not the second one. This series is the way to tell the difference at
> > > > > unmap time.
> > > > >
> > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > >
> > > > > Thanks!
> > > >
> > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > the iova tree to solve this issue completely. Then there won't be
> > > > aliasing issues.
> > > >
> > >
> > > I'm ok to explore that route but this has another problem. Both SVQ
> > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > and they do not have GPA.
> > >
> > > At this moment vhost_svq_translate_addr is able to handle this
> > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> >
> > This seems to be tricky.
> >
> > As discussed, it could be another iova tree.
> >
>
> Yes but there are many ways to add another IOVATree. Let me expand & recap.
>
> Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> translates from vaddr to SVQ IOVA. To know which one to use is easy at
> adding or removing, like in the memory listener, but how to know at
> vhost_svq_translate_addr?

Then we won't use virtqueue_pop() at all, we need a SVQ version of
virtqueue_pop() to translate GPA to SVQ IOVA directly?

>
> The easiest way for me is to rely on memory_region_from_host(). When
> vaddr is from the guest, it returns a valid MemoryRegion. When it is
> not, it returns NULL. I'm not sure if this is a valid use case, it
> just worked in my tests so far.
>
> Now we have the second problem: The GPA values of the regions of the
> two IOVA tree must be unique. We need to be able to find unallocated
> regions in SVQ IOVA. At this moment there is only one IOVATree, so
> this is done easily by vhost_iova_tree_map_alloc. But it is very
> complicated with two trees.

Would it be simpler if we decouple the IOVA allocator? For example, we
can have a dedicated gtree to track the allocated IOVA ranges. It is
shared by both

1) Guest memory (GPA)
2) SVQ virtqueue and buffers

And another gtree to track the GPA to IOVA.

The SVQ code could use either

1) one linear mappings that contains both SVQ virtqueue and buffers

or

2) dynamic IOVA allocation/deallocation helpers

So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?

>
> Option 2a is to add another IOVATree in VhostIOVATree. I think the
> easiest way is to keep the GPA -> SVQ IOVA in one tree, let's call it
> iova_gpa_map, and the current vaddr -> SVQ IOVA tree in
> iova_taddr_map. This second tree should contain both vaddr memory that
> belongs to the guest and host-only vaddr like vrings and CVQ buffers.
>
> How to pass the GPA to VhostIOVATree API? To add it to DMAMap is
> complicated, as it is shared with intel_iommu. We can add new
> functions to VhostIOVATree that accepts vaddr plus GPA, or maybe it is
> enough with GPA only. It should be functions to add, remove, and
> allocate new entries. But vaddr ones must be kept, because buffers
> might be host-only.
>
> Then the caller can choose which version to call: for adding and
> removing guest memory from the memory listener, the GPA variant.
> Adding SVQ vrings and CVQ buffers should use the current vaddr
> versions. vhost_svq_translate_addr still needs to use
> memory_region_from_host() to know which one to call.

So the idea is, for region_del/region_add use iova_gpa_map? For the
SVQ translation, use the iova_taddr_map?

I suspect we still need to synchronize with those two trees so it
might be still problematic as iova_taddr_map contains the overlapped
regions.

>
> Although I didn't like this approach because it complicates
> VhostIOVATree, I think it is the easier way except for option 4, I'll
> explain later.
>
> This has an extra advantage: currently, the lookup in
> vhost_svq_translate_addr is linear, O(1). This would allow us to use
> the tree properly.

It uses g_tree_foreach() which I guess is not O(1).

>
> Option 2b could be to keep them totally separated. So current
> VhostIOVATree->iova_taddr_map only contains host-only entries, and the
> new iova_gpa_map containst the guest entries. I don't think this case
> adds anything except less memory usage, as the gpa map (which should
> be the fastest) will be the same size. Also, it makes it difficult to
> implement vhost_iova_tree_map_alloc.
>
> Option 3 is to not add new functions but extend the current ones. That
> would require special values of GPA values to indicate no GPA, like
> SVQ vrings. I think option 2a is better, but this may help to keep the
> interface simpler.
>
> Option 4 is what I'm proposing in this RFC. To store the GPA as map id
> so we can tell if the vaddr corresponds to one SVQ IOVA or another.
> Now I'm having trouble retrieving the memory section I see in the
> memory listener. It should not be so difficult but. The main advantage
> is not to duplicate data structs that are already in QEMU, but maybe
> it is not worth the effort.
>
> Going further with this option, we could add a flag to ignore the .id
> member added. But it adds more and more complexity to the API so I
> would prefer option 2a for this.

Thanks

>
> > Thanks
> >
> > >
> > > Thanks!
> > >
> > > > Thanks
> > > >
> > > > >
> > > > > > Thanks
> > > > > >
> > > > > > >
> > > > > > > To solve this, track GPA in the DMA entry that acs as unique identifiers
> > > > > > > to the maps.  When the map needs to be removed, iova tree is able to
> > > > > > > find the right one.
> > > > > > >
> > > > > > > Users that does not go to this extra layer of indirection can use the
> > > > > > > iova tree as usual, with id = 0.
> > > > > > >
> > > > > > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
> > > > > > > time to reproduce the issue.  This has been tested only without overlapping
> > > > > > > maps.  If it works with overlapping maps, it will be intergrated in the main
> > > > > > > series.
> > > > > > >
> > > > > > > Comments are welcome.  Thanks!
> > > > > > >
> > > > > > > Eugenio Pérez (2):
> > > > > > >   iova_tree: add an id member to DMAMap
> > > > > > >   vdpa: identify aliased maps in iova_tree
> > > > > > >
> > > > > > >  hw/virtio/vhost-vdpa.c   | 2 ++
> > > > > > >  include/qemu/iova-tree.h | 5 +++--
> > > > > > >  util/iova-tree.c         | 3 ++-
> > > > > > >  3 files changed, 7 insertions(+), 3 deletions(-)
> > > > > > >
> > > > > > > --
> > > > > > > 2.44.0
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
  2024-05-09  6:27             ` Jason Wang
@ 2024-05-09  7:10               ` Eugenio Perez Martin
  2024-05-10  4:28                 ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2024-05-09  7:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Jonah Palmer, Dragos Tatulea

On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > >
> > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > >
> > > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > > them by HVA will return them twice.
> > > > > > >
> > > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > > >
> > > > > >
> > > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > > >
> > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > unmaps happens [1]:
> > > > > >
> > > > > > HVA                    GPA                IOVA
> > > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > > Map
> > > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > > [0x80001000, 0x2000001000)
> > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > > [0x2000001000, 0x2000021000)
> > > > > >
> > > > > > Unmap
> > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > > 0x20000) ???
> > > > > >
> > > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > > not overlap, only HVA.
> > > > > >
> > > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > > not the second one. This series is the way to tell the difference at
> > > > > > unmap time.
> > > > > >
> > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > >
> > > > > > Thanks!
> > > > >
> > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > > the iova tree to solve this issue completely. Then there won't be
> > > > > aliasing issues.
> > > > >
> > > >
> > > > I'm ok to explore that route but this has another problem. Both SVQ
> > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > > and they do not have GPA.
> > > >
> > > > At this moment vhost_svq_translate_addr is able to handle this
> > > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> > >
> > > This seems to be tricky.
> > >
> > > As discussed, it could be another iova tree.
> > >
> >
> > Yes but there are many ways to add another IOVATree. Let me expand & recap.
> >
> > Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> > Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> > translates from vaddr to SVQ IOVA. To know which one to use is easy at
> > adding or removing, like in the memory listener, but how to know at
> > vhost_svq_translate_addr?
>
> Then we won't use virtqueue_pop() at all, we need a SVQ version of
> virtqueue_pop() to translate GPA to SVQ IOVA directly?
>

The problem is not virtqueue_pop, that's out of the
vhost_svq_translate_addr. The problem is the need of adding
conditionals / complexity in all the callers of

> >
> > The easiest way for me is to rely on memory_region_from_host(). When
> > vaddr is from the guest, it returns a valid MemoryRegion. When it is
> > not, it returns NULL. I'm not sure if this is a valid use case, it
> > just worked in my tests so far.
> >
> > Now we have the second problem: The GPA values of the regions of the
> > two IOVA tree must be unique. We need to be able to find unallocated
> > regions in SVQ IOVA. At this moment there is only one IOVATree, so
> > this is done easily by vhost_iova_tree_map_alloc. But it is very
> > complicated with two trees.
>
> Would it be simpler if we decouple the IOVA allocator? For example, we
> can have a dedicated gtree to track the allocated IOVA ranges. It is
> shared by both
>
> 1) Guest memory (GPA)
> 2) SVQ virtqueue and buffers
>
> And another gtree to track the GPA to IOVA.
>
> The SVQ code could use either
>
> 1) one linear mappings that contains both SVQ virtqueue and buffers
>
> or
>
> 2) dynamic IOVA allocation/deallocation helpers
>
> So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
>

That's possible, but that scatters the IOVA handling code instead of
keeping it self-contained in VhostIOVATree.

> >
> > Option 2a is to add another IOVATree in VhostIOVATree. I think the
> > easiest way is to keep the GPA -> SVQ IOVA in one tree, let's call it
> > iova_gpa_map, and the current vaddr -> SVQ IOVA tree in
> > iova_taddr_map. This second tree should contain both vaddr memory that
> > belongs to the guest and host-only vaddr like vrings and CVQ buffers.
> >
> > How to pass the GPA to VhostIOVATree API? To add it to DMAMap is
> > complicated, as it is shared with intel_iommu. We can add new
> > functions to VhostIOVATree that accepts vaddr plus GPA, or maybe it is
> > enough with GPA only. It should be functions to add, remove, and
> > allocate new entries. But vaddr ones must be kept, because buffers
> > might be host-only.
> >
> > Then the caller can choose which version to call: for adding and
> > removing guest memory from the memory listener, the GPA variant.
> > Adding SVQ vrings and CVQ buffers should use the current vaddr
> > versions. vhost_svq_translate_addr still needs to use
> > memory_region_from_host() to know which one to call.
>
> So the idea is, for region_del/region_add use iova_gpa_map? For the
> SVQ translation, use the iova_taddr_map?
>
> I suspect we still need to synchronize with those two trees so it
> might be still problematic as iova_taddr_map contains the overlapped
> regions.
>

Right. All adding / removing functions with GPA must also update the
current iova_taddr_map too.

> >
> > Although I didn't like this approach because it complicates
> > VhostIOVATree, I think it is the easier way except for option 4, I'll
> > explain later.
> >
> > This has an extra advantage: currently, the lookup in
> > vhost_svq_translate_addr is linear, O(1). This would allow us to use
> > the tree properly.
>
> It uses g_tree_foreach() which I guess is not O(1).
>

I'm sorry I meant O(N).

> >
> > Option 2b could be to keep them totally separated. So current
> > VhostIOVATree->iova_taddr_map only contains host-only entries, and the
> > new iova_gpa_map containst the guest entries. I don't think this case
> > adds anything except less memory usage, as the gpa map (which should
> > be the fastest) will be the same size. Also, it makes it difficult to
> > implement vhost_iova_tree_map_alloc.
> >
> > Option 3 is to not add new functions but extend the current ones. That
> > would require special values of GPA values to indicate no GPA, like
> > SVQ vrings. I think option 2a is better, but this may help to keep the
> > interface simpler.
> >
> > Option 4 is what I'm proposing in this RFC. To store the GPA as map id
> > so we can tell if the vaddr corresponds to one SVQ IOVA or another.
> > Now I'm having trouble retrieving the memory section I see in the
> > memory listener. It should not be so difficult but. The main advantage
> > is not to duplicate data structs that are already in QEMU, but maybe
> > it is not worth the effort.
> >
> > Going further with this option, we could add a flag to ignore the .id
> > member added. But it adds more and more complexity to the API so I
> > would prefer option 2a for this.
>
> Thanks
>
> >
> > > Thanks
> > >
> > > >
> > > > Thanks!
> > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > >
> > > > > > > > To solve this, track GPA in the DMA entry that acs as unique identifiers
> > > > > > > > to the maps.  When the map needs to be removed, iova tree is able to
> > > > > > > > find the right one.
> > > > > > > >
> > > > > > > > Users that does not go to this extra layer of indirection can use the
> > > > > > > > iova tree as usual, with id = 0.
> > > > > > > >
> > > > > > > > This was found by Si-Wei Liu <si-wei.liu@oracle.com>, but I'm having a hard
> > > > > > > > time to reproduce the issue.  This has been tested only without overlapping
> > > > > > > > maps.  If it works with overlapping maps, it will be intergrated in the main
> > > > > > > > series.
> > > > > > > >
> > > > > > > > Comments are welcome.  Thanks!
> > > > > > > >
> > > > > > > > Eugenio Pérez (2):
> > > > > > > >   iova_tree: add an id member to DMAMap
> > > > > > > >   vdpa: identify aliased maps in iova_tree
> > > > > > > >
> > > > > > > >  hw/virtio/vhost-vdpa.c   | 2 ++
> > > > > > > >  include/qemu/iova-tree.h | 5 +++--
> > > > > > > >  util/iova-tree.c         | 3 ++-
> > > > > > > >  3 files changed, 7 insertions(+), 3 deletions(-)
> > > > > > > >
> > > > > > > > --
> > > > > > > > 2.44.0
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
  2024-05-09  7:10               ` Eugenio Perez Martin
@ 2024-05-10  4:28                 ` Jason Wang
  2024-05-10  7:16                   ` Eugenio Perez Martin
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2024-05-10  4:28 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Jonah Palmer, Dragos Tatulea

On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > >
> > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > > <eperezma@redhat.com> wrote:
> > > > > > >
> > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > > > them by HVA will return them twice.
> > > > > > > >
> > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > > > >
> > > > > > >
> > > > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > > > >
> > > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > > unmaps happens [1]:
> > > > > > >
> > > > > > > HVA                    GPA                IOVA
> > > > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > > > Map
> > > > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > > > [0x80001000, 0x2000001000)
> > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > > > [0x2000001000, 0x2000021000)
> > > > > > >
> > > > > > > Unmap
> > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > > > 0x20000) ???
> > > > > > >
> > > > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > > > not overlap, only HVA.
> > > > > > >
> > > > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > > > not the second one. This series is the way to tell the difference at
> > > > > > > unmap time.
> > > > > > >
> > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > > >
> > > > > > > Thanks!
> > > > > >
> > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > > > the iova tree to solve this issue completely. Then there won't be
> > > > > > aliasing issues.
> > > > > >
> > > > >
> > > > > I'm ok to explore that route but this has another problem. Both SVQ
> > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > > > and they do not have GPA.
> > > > >
> > > > > At this moment vhost_svq_translate_addr is able to handle this
> > > > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> > > >
> > > > This seems to be tricky.
> > > >
> > > > As discussed, it could be another iova tree.
> > > >
> > >
> > > Yes but there are many ways to add another IOVATree. Let me expand & recap.
> > >
> > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> > > translates from vaddr to SVQ IOVA. To know which one to use is easy at
> > > adding or removing, like in the memory listener, but how to know at
> > > vhost_svq_translate_addr?
> >
> > Then we won't use virtqueue_pop() at all, we need a SVQ version of
> > virtqueue_pop() to translate GPA to SVQ IOVA directly?
> >
>
> The problem is not virtqueue_pop, that's out of the
> vhost_svq_translate_addr. The problem is the need of adding
> conditionals / complexity in all the callers of
>
> > >
> > > The easiest way for me is to rely on memory_region_from_host(). When
> > > vaddr is from the guest, it returns a valid MemoryRegion. When it is
> > > not, it returns NULL. I'm not sure if this is a valid use case, it
> > > just worked in my tests so far.
> > >
> > > Now we have the second problem: The GPA values of the regions of the
> > > two IOVA tree must be unique. We need to be able to find unallocated
> > > regions in SVQ IOVA. At this moment there is only one IOVATree, so
> > > this is done easily by vhost_iova_tree_map_alloc. But it is very
> > > complicated with two trees.
> >
> > Would it be simpler if we decouple the IOVA allocator? For example, we
> > can have a dedicated gtree to track the allocated IOVA ranges. It is
> > shared by both
> >
> > 1) Guest memory (GPA)
> > 2) SVQ virtqueue and buffers
> >
> > And another gtree to track the GPA to IOVA.
> >
> > The SVQ code could use either
> >
> > 1) one linear mappings that contains both SVQ virtqueue and buffers
> >
> > or
> >
> > 2) dynamic IOVA allocation/deallocation helpers
> >
> > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
> >
>
> That's possible, but that scatters the IOVA handling code instead of
> keeping it self-contained in VhostIOVATree.

To me, the IOVA range/allocation is orthogonal to how IOVA is used.

An example is the iova allocator in the kernel.

Note that there's an even simpler IOVA "allocator" in NVME passthrough
code, not sure it is useful here though (haven't had a deep look at
that).

Thanks



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
  2024-05-10  4:28                 ` Jason Wang
@ 2024-05-10  7:16                   ` Eugenio Perez Martin
  2024-05-11  4:00                     ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2024-05-10  7:16 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Jonah Palmer, Dragos Tatulea

On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > >
> > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > > > <eperezma@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > > > > them by HVA will return them twice.
> > > > > > > > >
> > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > > > > >
> > > > > > > >
> > > > > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > > > > >
> > > > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > > > unmaps happens [1]:
> > > > > > > >
> > > > > > > > HVA                    GPA                IOVA
> > > > > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > > > > Map
> > > > > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > > > > [0x80001000, 0x2000001000)
> > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > > > > [0x2000001000, 0x2000021000)
> > > > > > > >
> > > > > > > > Unmap
> > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > > > > 0x20000) ???
> > > > > > > >
> > > > > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > > > > not overlap, only HVA.
> > > > > > > >
> > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > > > > not the second one. This series is the way to tell the difference at
> > > > > > > > unmap time.
> > > > > > > >
> > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > >
> > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > > > > the iova tree to solve this issue completely. Then there won't be
> > > > > > > aliasing issues.
> > > > > > >
> > > > > >
> > > > > > I'm ok to explore that route but this has another problem. Both SVQ
> > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > > > > and they do not have GPA.
> > > > > >
> > > > > > At this moment vhost_svq_translate_addr is able to handle this
> > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> > > > >
> > > > > This seems to be tricky.
> > > > >
> > > > > As discussed, it could be another iova tree.
> > > > >
> > > >
> > > > Yes but there are many ways to add another IOVATree. Let me expand & recap.
> > > >
> > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at
> > > > adding or removing, like in the memory listener, but how to know at
> > > > vhost_svq_translate_addr?
> > >
> > > Then we won't use virtqueue_pop() at all, we need a SVQ version of
> > > virtqueue_pop() to translate GPA to SVQ IOVA directly?
> > >
> >
> > The problem is not virtqueue_pop, that's out of the
> > vhost_svq_translate_addr. The problem is the need of adding
> > conditionals / complexity in all the callers of
> >
> > > >
> > > > The easiest way for me is to rely on memory_region_from_host(). When
> > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is
> > > > not, it returns NULL. I'm not sure if this is a valid use case, it
> > > > just worked in my tests so far.
> > > >
> > > > Now we have the second problem: The GPA values of the regions of the
> > > > two IOVA tree must be unique. We need to be able to find unallocated
> > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so
> > > > this is done easily by vhost_iova_tree_map_alloc. But it is very
> > > > complicated with two trees.
> > >
> > > Would it be simpler if we decouple the IOVA allocator? For example, we
> > > can have a dedicated gtree to track the allocated IOVA ranges. It is
> > > shared by both
> > >
> > > 1) Guest memory (GPA)
> > > 2) SVQ virtqueue and buffers
> > >
> > > And another gtree to track the GPA to IOVA.
> > >
> > > The SVQ code could use either
> > >
> > > 1) one linear mappings that contains both SVQ virtqueue and buffers
> > >
> > > or
> > >
> > > 2) dynamic IOVA allocation/deallocation helpers
> > >
> > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
> > >
> >
> > That's possible, but that scatters the IOVA handling code instead of
> > keeping it self-contained in VhostIOVATree.
>
> To me, the IOVA range/allocation is orthogonal to how IOVA is used.
>
> An example is the iova allocator in the kernel.
>
> Note that there's an even simpler IOVA "allocator" in NVME passthrough
> code, not sure it is useful here though (haven't had a deep look at
> that).
>

I don't know enough about them to have an opinion. I keep seeing the
drawback of needing to synchronize both allocation & adding in all the
places we want to modify the IOVATree. At this moment, these are the
vhost-vdpa memory listener, the SVQ vring creation and removal, and
net CVQ buffers. But it may be more in the future.

What are the advantages of keeping these separated that justifies
needing to synchronize in all these places, compared with keeping them
synchronized in VhostIOVATree?

Thanks!



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
  2024-05-10  7:16                   ` Eugenio Perez Martin
@ 2024-05-11  4:00                     ` Jason Wang
  2024-05-13  6:27                       ` Eugenio Perez Martin
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2024-05-11  4:00 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Jonah Palmer, Dragos Tatulea

On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > >
> > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > > > > <eperezma@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > > > > > them by HVA will return them twice.
> > > > > > > > > >
> > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > > > > > >
> > > > > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > > > > unmaps happens [1]:
> > > > > > > > >
> > > > > > > > > HVA                    GPA                IOVA
> > > > > > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > > > > > Map
> > > > > > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > > > > > [0x80001000, 0x2000001000)
> > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > > > > > [0x2000001000, 0x2000021000)
> > > > > > > > >
> > > > > > > > > Unmap
> > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > > > > > 0x20000) ???
> > > > > > > > >
> > > > > > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > > > > > not overlap, only HVA.
> > > > > > > > >
> > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > > > > > not the second one. This series is the way to tell the difference at
> > > > > > > > > unmap time.
> > > > > > > > >
> > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > >
> > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > > > > > the iova tree to solve this issue completely. Then there won't be
> > > > > > > > aliasing issues.
> > > > > > > >
> > > > > > >
> > > > > > > I'm ok to explore that route but this has another problem. Both SVQ
> > > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > > > > > and they do not have GPA.
> > > > > > >
> > > > > > > At this moment vhost_svq_translate_addr is able to handle this
> > > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> > > > > >
> > > > > > This seems to be tricky.
> > > > > >
> > > > > > As discussed, it could be another iova tree.
> > > > > >
> > > > >
> > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap.
> > > > >
> > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> > > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> > > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at
> > > > > adding or removing, like in the memory listener, but how to know at
> > > > > vhost_svq_translate_addr?
> > > >
> > > > Then we won't use virtqueue_pop() at all, we need a SVQ version of
> > > > virtqueue_pop() to translate GPA to SVQ IOVA directly?
> > > >
> > >
> > > The problem is not virtqueue_pop, that's out of the
> > > vhost_svq_translate_addr. The problem is the need of adding
> > > conditionals / complexity in all the callers of
> > >
> > > > >
> > > > > The easiest way for me is to rely on memory_region_from_host(). When
> > > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is
> > > > > not, it returns NULL. I'm not sure if this is a valid use case, it
> > > > > just worked in my tests so far.
> > > > >
> > > > > Now we have the second problem: The GPA values of the regions of the
> > > > > two IOVA tree must be unique. We need to be able to find unallocated
> > > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so
> > > > > this is done easily by vhost_iova_tree_map_alloc. But it is very
> > > > > complicated with two trees.
> > > >
> > > > Would it be simpler if we decouple the IOVA allocator? For example, we
> > > > can have a dedicated gtree to track the allocated IOVA ranges. It is
> > > > shared by both
> > > >
> > > > 1) Guest memory (GPA)
> > > > 2) SVQ virtqueue and buffers
> > > >
> > > > And another gtree to track the GPA to IOVA.
> > > >
> > > > The SVQ code could use either
> > > >
> > > > 1) one linear mappings that contains both SVQ virtqueue and buffers
> > > >
> > > > or
> > > >
> > > > 2) dynamic IOVA allocation/deallocation helpers
> > > >
> > > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
> > > >
> > >
> > > That's possible, but that scatters the IOVA handling code instead of
> > > keeping it self-contained in VhostIOVATree.
> >
> > To me, the IOVA range/allocation is orthogonal to how IOVA is used.
> >
> > An example is the iova allocator in the kernel.
> >
> > Note that there's an even simpler IOVA "allocator" in NVME passthrough
> > code, not sure it is useful here though (haven't had a deep look at
> > that).
> >
>
> I don't know enough about them to have an opinion. I keep seeing the
> drawback of needing to synchronize both allocation & adding in all the
> places we want to modify the IOVATree. At this moment, these are the
> vhost-vdpa memory listener, the SVQ vring creation and removal, and
> net CVQ buffers. But it may be more in the future.
>
> What are the advantages of keeping these separated that justifies
> needing to synchronize in all these places, compared with keeping them
> synchronized in VhostIOVATree?

It doesn't need to be synchronized.

Assuming guest and SVQ shares IOVA range. IOVA only needs to track
which part of the range has been used.

This simplifies things, we can store GPA->IOVA mappings and SVQ ->
IOVA mappings separately.

Thanks

>
> Thanks!
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
  2024-05-11  4:00                     ` Jason Wang
@ 2024-05-13  6:27                       ` Eugenio Perez Martin
  2024-05-13  8:28                         ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2024-05-13  6:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Jonah Palmer, Dragos Tatulea

On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > >
> > > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > > > > > <eperezma@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > > > > > > them by HVA will return them twice.
> > > > > > > > > > >
> > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > > > > > > >
> > > > > > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > > > > > unmaps happens [1]:
> > > > > > > > > >
> > > > > > > > > > HVA                    GPA                IOVA
> > > > > > > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > > > > > > Map
> > > > > > > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > > > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > > > > > > [0x80001000, 0x2000001000)
> > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > > > > > > [0x2000001000, 0x2000021000)
> > > > > > > > > >
> > > > > > > > > > Unmap
> > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > > > > > > 0x20000) ???
> > > > > > > > > >
> > > > > > > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > > > > > > not overlap, only HVA.
> > > > > > > > > >
> > > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > > > > > > not the second one. This series is the way to tell the difference at
> > > > > > > > > > unmap time.
> > > > > > > > > >
> > > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > > > > > >
> > > > > > > > > > Thanks!
> > > > > > > > >
> > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > > > > > > the iova tree to solve this issue completely. Then there won't be
> > > > > > > > > aliasing issues.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ
> > > > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > > > > > > and they do not have GPA.
> > > > > > > >
> > > > > > > > At this moment vhost_svq_translate_addr is able to handle this
> > > > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> > > > > > >
> > > > > > > This seems to be tricky.
> > > > > > >
> > > > > > > As discussed, it could be another iova tree.
> > > > > > >
> > > > > >
> > > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap.
> > > > > >
> > > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> > > > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> > > > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at
> > > > > > adding or removing, like in the memory listener, but how to know at
> > > > > > vhost_svq_translate_addr?
> > > > >
> > > > > Then we won't use virtqueue_pop() at all, we need a SVQ version of
> > > > > virtqueue_pop() to translate GPA to SVQ IOVA directly?
> > > > >
> > > >
> > > > The problem is not virtqueue_pop, that's out of the
> > > > vhost_svq_translate_addr. The problem is the need of adding
> > > > conditionals / complexity in all the callers of
> > > >
> > > > > >
> > > > > > The easiest way for me is to rely on memory_region_from_host(). When
> > > > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is
> > > > > > not, it returns NULL. I'm not sure if this is a valid use case, it
> > > > > > just worked in my tests so far.
> > > > > >
> > > > > > Now we have the second problem: The GPA values of the regions of the
> > > > > > two IOVA tree must be unique. We need to be able to find unallocated
> > > > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so
> > > > > > this is done easily by vhost_iova_tree_map_alloc. But it is very
> > > > > > complicated with two trees.
> > > > >
> > > > > Would it be simpler if we decouple the IOVA allocator? For example, we
> > > > > can have a dedicated gtree to track the allocated IOVA ranges. It is
> > > > > shared by both
> > > > >
> > > > > 1) Guest memory (GPA)
> > > > > 2) SVQ virtqueue and buffers
> > > > >
> > > > > And another gtree to track the GPA to IOVA.
> > > > >
> > > > > The SVQ code could use either
> > > > >
> > > > > 1) one linear mappings that contains both SVQ virtqueue and buffers
> > > > >
> > > > > or
> > > > >
> > > > > 2) dynamic IOVA allocation/deallocation helpers
> > > > >
> > > > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
> > > > >
> > > >
> > > > That's possible, but that scatters the IOVA handling code instead of
> > > > keeping it self-contained in VhostIOVATree.
> > >
> > > To me, the IOVA range/allocation is orthogonal to how IOVA is used.
> > >
> > > An example is the iova allocator in the kernel.
> > >
> > > Note that there's an even simpler IOVA "allocator" in NVME passthrough
> > > code, not sure it is useful here though (haven't had a deep look at
> > > that).
> > >
> >
> > I don't know enough about them to have an opinion. I keep seeing the
> > drawback of needing to synchronize both allocation & adding in all the
> > places we want to modify the IOVATree. At this moment, these are the
> > vhost-vdpa memory listener, the SVQ vring creation and removal, and
> > net CVQ buffers. But it may be more in the future.
> >
> > What are the advantages of keeping these separated that justifies
> > needing to synchronize in all these places, compared with keeping them
> > synchronized in VhostIOVATree?
>
> It doesn't need to be synchronized.
>
> Assuming guest and SVQ shares IOVA range. IOVA only needs to track
> which part of the range has been used.
>

Not sure if I follow, that is what I mean with "synchronized".

> This simplifies things, we can store GPA->IOVA mappings and SVQ ->
> IOVA mappings separately.
>

Sorry, I still cannot see the whole picture :).

Let's assume we have all the GPA mapped to specific IOVA regions, so
we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ
because of the migration. How can we know where we can place SVQ
vrings without having them synchronized?

At this moment we're using a tree. The tree nature of the current SVQ
IOVA -> VA makes all nodes ordered so it is more or less easy to look
for holes.

Now your proposal uses the SVQ IOVA as tree values. Should we iterate
over all of them, order them, of the two trees, and then look for
holes there?

> Thanks
>
> >
> > Thanks!
> >
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
  2024-05-13  6:27                       ` Eugenio Perez Martin
@ 2024-05-13  8:28                         ` Jason Wang
  2024-05-13  9:56                           ` Eugenio Perez Martin
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2024-05-13  8:28 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Jonah Palmer, Dragos Tatulea

On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > >
> > > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > > > > > > <eperezma@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > > > > > > > them by HVA will return them twice.
> > > > > > > > > > > >
> > > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > > > > > > > >
> > > > > > > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > > > > > > unmaps happens [1]:
> > > > > > > > > > >
> > > > > > > > > > > HVA                    GPA                IOVA
> > > > > > > > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > > > > > > > Map
> > > > > > > > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > > > > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > > > > > > > [0x80001000, 0x2000001000)
> > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > > > > > > > [0x2000001000, 0x2000021000)
> > > > > > > > > > >
> > > > > > > > > > > Unmap
> > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > > > > > > > 0x20000) ???
> > > > > > > > > > >
> > > > > > > > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > > > > > > > not overlap, only HVA.
> > > > > > > > > > >
> > > > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > > > > > > > not the second one. This series is the way to tell the difference at
> > > > > > > > > > > unmap time.
> > > > > > > > > > >
> > > > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > > > > > > >
> > > > > > > > > > > Thanks!
> > > > > > > > > >
> > > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > > > > > > > the iova tree to solve this issue completely. Then there won't be
> > > > > > > > > > aliasing issues.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ
> > > > > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > > > > > > > and they do not have GPA.
> > > > > > > > >
> > > > > > > > > At this moment vhost_svq_translate_addr is able to handle this
> > > > > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > > > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > > > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> > > > > > > >
> > > > > > > > This seems to be tricky.
> > > > > > > >
> > > > > > > > As discussed, it could be another iova tree.
> > > > > > > >
> > > > > > >
> > > > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap.
> > > > > > >
> > > > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> > > > > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> > > > > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at
> > > > > > > adding or removing, like in the memory listener, but how to know at
> > > > > > > vhost_svq_translate_addr?
> > > > > >
> > > > > > Then we won't use virtqueue_pop() at all, we need a SVQ version of
> > > > > > virtqueue_pop() to translate GPA to SVQ IOVA directly?
> > > > > >
> > > > >
> > > > > The problem is not virtqueue_pop, that's out of the
> > > > > vhost_svq_translate_addr. The problem is the need of adding
> > > > > conditionals / complexity in all the callers of
> > > > >
> > > > > > >
> > > > > > > The easiest way for me is to rely on memory_region_from_host(). When
> > > > > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is
> > > > > > > not, it returns NULL. I'm not sure if this is a valid use case, it
> > > > > > > just worked in my tests so far.
> > > > > > >
> > > > > > > Now we have the second problem: The GPA values of the regions of the
> > > > > > > two IOVA tree must be unique. We need to be able to find unallocated
> > > > > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so
> > > > > > > this is done easily by vhost_iova_tree_map_alloc. But it is very
> > > > > > > complicated with two trees.
> > > > > >
> > > > > > Would it be simpler if we decouple the IOVA allocator? For example, we
> > > > > > can have a dedicated gtree to track the allocated IOVA ranges. It is
> > > > > > shared by both
> > > > > >
> > > > > > 1) Guest memory (GPA)
> > > > > > 2) SVQ virtqueue and buffers
> > > > > >
> > > > > > And another gtree to track the GPA to IOVA.
> > > > > >
> > > > > > The SVQ code could use either
> > > > > >
> > > > > > 1) one linear mappings that contains both SVQ virtqueue and buffers
> > > > > >
> > > > > > or
> > > > > >
> > > > > > 2) dynamic IOVA allocation/deallocation helpers
> > > > > >
> > > > > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
> > > > > >
> > > > >
> > > > > That's possible, but that scatters the IOVA handling code instead of
> > > > > keeping it self-contained in VhostIOVATree.
> > > >
> > > > To me, the IOVA range/allocation is orthogonal to how IOVA is used.
> > > >
> > > > An example is the iova allocator in the kernel.
> > > >
> > > > Note that there's an even simpler IOVA "allocator" in NVME passthrough
> > > > code, not sure it is useful here though (haven't had a deep look at
> > > > that).
> > > >
> > >
> > > I don't know enough about them to have an opinion. I keep seeing the
> > > drawback of needing to synchronize both allocation & adding in all the
> > > places we want to modify the IOVATree. At this moment, these are the
> > > vhost-vdpa memory listener, the SVQ vring creation and removal, and
> > > net CVQ buffers. But it may be more in the future.
> > >
> > > What are the advantages of keeping these separated that justifies
> > > needing to synchronize in all these places, compared with keeping them
> > > synchronized in VhostIOVATree?
> >
> > It doesn't need to be synchronized.
> >
> > Assuming guest and SVQ shares IOVA range. IOVA only needs to track
> > which part of the range has been used.
> >
>
> Not sure if I follow, that is what I mean with "synchronized".

Oh right.

>
> > This simplifies things, we can store GPA->IOVA mappings and SVQ ->
> > IOVA mappings separately.
> >
>
> Sorry, I still cannot see the whole picture :).
>
> Let's assume we have all the GPA mapped to specific IOVA regions, so
> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ
> because of the migration. How can we know where we can place SVQ
> vrings without having them synchronized?

Just allocating a new IOVA range for SVQ?

>
> At this moment we're using a tree. The tree nature of the current SVQ
> IOVA -> VA makes all nodes ordered so it is more or less easy to look
> for holes.

Yes, iova allocate could still be implemented via a tree.

>
> Now your proposal uses the SVQ IOVA as tree values. Should we iterate
> over all of them, order them, of the two trees, and then look for
> holes there?

Let me clarify, correct me if I was wrong:

1) IOVA allocator is still implemented via a tree, we just don't need
to store how the IOVA is used
2) A dedicated GPA -> IOVA tree, updated via listeners and is used in
the datapath SVQ translation
3) A linear mapping or another SVQ -> IOVA tree used for SVQ

Thanks

>
> > Thanks
> >
> > >
> > > Thanks!
> > >
> >
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
  2024-05-13  8:28                         ` Jason Wang
@ 2024-05-13  9:56                           ` Eugenio Perez Martin
  2024-05-14  3:56                             ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Eugenio Perez Martin @ 2024-05-13  9:56 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Jonah Palmer, Dragos Tatulea

On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
> > > <eperezma@redhat.com> wrote:
> > > >
> > > > On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > > > > > > > <eperezma@redhat.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > > > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > > > > > > > > them by HVA will return them twice.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > > > > > > > > >
> > > > > > > > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > > > > > > > unmaps happens [1]:
> > > > > > > > > > > >
> > > > > > > > > > > > HVA                    GPA                IOVA
> > > > > > > > > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > > > > > > > > Map
> > > > > > > > > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > > > > > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > > > > > > > > [0x80001000, 0x2000001000)
> > > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > > > > > > > > [0x2000001000, 0x2000021000)
> > > > > > > > > > > >
> > > > > > > > > > > > Unmap
> > > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > > > > > > > > 0x20000) ???
> > > > > > > > > > > >
> > > > > > > > > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > > > > > > > > not overlap, only HVA.
> > > > > > > > > > > >
> > > > > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > > > > > > > > not the second one. This series is the way to tell the difference at
> > > > > > > > > > > > unmap time.
> > > > > > > > > > > >
> > > > > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks!
> > > > > > > > > > >
> > > > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > > > > > > > > the iova tree to solve this issue completely. Then there won't be
> > > > > > > > > > > aliasing issues.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ
> > > > > > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > > > > > > > > and they do not have GPA.
> > > > > > > > > >
> > > > > > > > > > At this moment vhost_svq_translate_addr is able to handle this
> > > > > > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > > > > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > > > > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> > > > > > > > >
> > > > > > > > > This seems to be tricky.
> > > > > > > > >
> > > > > > > > > As discussed, it could be another iova tree.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap.
> > > > > > > >
> > > > > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> > > > > > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> > > > > > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at
> > > > > > > > adding or removing, like in the memory listener, but how to know at
> > > > > > > > vhost_svq_translate_addr?
> > > > > > >
> > > > > > > Then we won't use virtqueue_pop() at all, we need a SVQ version of
> > > > > > > virtqueue_pop() to translate GPA to SVQ IOVA directly?
> > > > > > >
> > > > > >
> > > > > > The problem is not virtqueue_pop, that's out of the
> > > > > > vhost_svq_translate_addr. The problem is the need of adding
> > > > > > conditionals / complexity in all the callers of
> > > > > >
> > > > > > > >
> > > > > > > > The easiest way for me is to rely on memory_region_from_host(). When
> > > > > > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is
> > > > > > > > not, it returns NULL. I'm not sure if this is a valid use case, it
> > > > > > > > just worked in my tests so far.
> > > > > > > >
> > > > > > > > Now we have the second problem: The GPA values of the regions of the
> > > > > > > > two IOVA tree must be unique. We need to be able to find unallocated
> > > > > > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so
> > > > > > > > this is done easily by vhost_iova_tree_map_alloc. But it is very
> > > > > > > > complicated with two trees.
> > > > > > >
> > > > > > > Would it be simpler if we decouple the IOVA allocator? For example, we
> > > > > > > can have a dedicated gtree to track the allocated IOVA ranges. It is
> > > > > > > shared by both
> > > > > > >
> > > > > > > 1) Guest memory (GPA)
> > > > > > > 2) SVQ virtqueue and buffers
> > > > > > >
> > > > > > > And another gtree to track the GPA to IOVA.
> > > > > > >
> > > > > > > The SVQ code could use either
> > > > > > >
> > > > > > > 1) one linear mappings that contains both SVQ virtqueue and buffers
> > > > > > >
> > > > > > > or
> > > > > > >
> > > > > > > 2) dynamic IOVA allocation/deallocation helpers
> > > > > > >
> > > > > > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
> > > > > > >
> > > > > >
> > > > > > That's possible, but that scatters the IOVA handling code instead of
> > > > > > keeping it self-contained in VhostIOVATree.
> > > > >
> > > > > To me, the IOVA range/allocation is orthogonal to how IOVA is used.
> > > > >
> > > > > An example is the iova allocator in the kernel.
> > > > >
> > > > > Note that there's an even simpler IOVA "allocator" in NVME passthrough
> > > > > code, not sure it is useful here though (haven't had a deep look at
> > > > > that).
> > > > >
> > > >
> > > > I don't know enough about them to have an opinion. I keep seeing the
> > > > drawback of needing to synchronize both allocation & adding in all the
> > > > places we want to modify the IOVATree. At this moment, these are the
> > > > vhost-vdpa memory listener, the SVQ vring creation and removal, and
> > > > net CVQ buffers. But it may be more in the future.
> > > >
> > > > What are the advantages of keeping these separated that justifies
> > > > needing to synchronize in all these places, compared with keeping them
> > > > synchronized in VhostIOVATree?
> > >
> > > It doesn't need to be synchronized.
> > >
> > > Assuming guest and SVQ shares IOVA range. IOVA only needs to track
> > > which part of the range has been used.
> > >
> >
> > Not sure if I follow, that is what I mean with "synchronized".
>
> Oh right.
>
> >
> > > This simplifies things, we can store GPA->IOVA mappings and SVQ ->
> > > IOVA mappings separately.
> > >
> >
> > Sorry, I still cannot see the whole picture :).
> >
> > Let's assume we have all the GPA mapped to specific IOVA regions, so
> > we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ
> > because of the migration. How can we know where we can place SVQ
> > vrings without having them synchronized?
>
> Just allocating a new IOVA range for SVQ?
>
> >
> > At this moment we're using a tree. The tree nature of the current SVQ
> > IOVA -> VA makes all nodes ordered so it is more or less easy to look
> > for holes.
>
> Yes, iova allocate could still be implemented via a tree.
>
> >
> > Now your proposal uses the SVQ IOVA as tree values. Should we iterate
> > over all of them, order them, of the two trees, and then look for
> > holes there?
>
> Let me clarify, correct me if I was wrong:
>
> 1) IOVA allocator is still implemented via a tree, we just don't need
> to store how the IOVA is used
> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in
> the datapath SVQ translation
> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ
>

Ok, so the part I was missing is that now we have 3 whole trees, with
somehow redundant information :).

In some sense this is simpler than trying to get all the information
from only two trees. On the bad side, all SVQ calls that allocate some
region need to remember to add to one of the two other threes. That is
what I mean by synchronized. But sure, we can go that way.

> Thanks
>
> >
> > > Thanks
> > >
> > > >
> > > > Thanks!
> > > >
> > >
> >
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
  2024-05-13  9:56                           ` Eugenio Perez Martin
@ 2024-05-14  3:56                             ` Jason Wang
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Wang @ 2024-05-14  3:56 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, Si-Wei Liu, Michael S. Tsirkin, Lei Yang, Peter Xu,
	Jonah Palmer, Dragos Tatulea

On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
> > > > <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > >
> > > > > > > On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > > > > > > > > <eperezma@redhat.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The guest may have overlapped memory regions, where different GPA leads
> > > > > > > > > > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > > > > > > > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > > > > > > > > > > > > them by HVA will return them twice.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I think I don't understand if there's any side effect for shadow virtqueue?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > My bad, I totally forgot to put a reference to where this comes from.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > > > > > > > > unmaps happens [1]:
> > > > > > > > > > > > >
> > > > > > > > > > > > > HVA                    GPA                IOVA
> > > > > > > > > > > > > -------------------------------------------------------------------------------------------------------------------------
> > > > > > > > > > > > > Map
> > > > > > > > > > > > > [0x7f7903e00000, 0x7f7983e00000)    [0x0, 0x80000000) [0x1000, 0x80000000)
> > > > > > > > > > > > > [0x7f7983e00000, 0x7f9903e00000)    [0x100000000, 0x2080000000)
> > > > > > > > > > > > > [0x80001000, 0x2000001000)
> > > > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000)
> > > > > > > > > > > > > [0x2000001000, 0x2000021000)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Unmap
> > > > > > > > > > > > > [0x7f7903ea0000, 0x7f7903ec0000)    [0xfeda0000, 0xfedc0000) [0x1000,
> > > > > > > > > > > > > 0x20000) ???
> > > > > > > > > > > > >
> > > > > > > > > > > > > The third HVA range is contained in the first one, but exposed under a
> > > > > > > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > > > > > > > > > > not overlap, only HVA.
> > > > > > > > > > > > >
> > > > > > > > > > > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > > > > > > > > > > not the second one. This series is the way to tell the difference at
> > > > > > > > > > > > > unmap time.
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks!
> > > > > > > > > > > >
> > > > > > > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > > > > > > > > > the iova tree to solve this issue completely. Then there won't be
> > > > > > > > > > > > aliasing issues.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I'm ok to explore that route but this has another problem. Both SVQ
> > > > > > > > > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > > > > > > > > > and they do not have GPA.
> > > > > > > > > > >
> > > > > > > > > > > At this moment vhost_svq_translate_addr is able to handle this
> > > > > > > > > > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > > > > > > > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > > > > > > > > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> > > > > > > > > >
> > > > > > > > > > This seems to be tricky.
> > > > > > > > > >
> > > > > > > > > > As discussed, it could be another iova tree.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Yes but there are many ways to add another IOVATree. Let me expand & recap.
> > > > > > > > >
> > > > > > > > > Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
> > > > > > > > > Let's call it gpa_iova_tree, as opposed to the current iova_tree that
> > > > > > > > > translates from vaddr to SVQ IOVA. To know which one to use is easy at
> > > > > > > > > adding or removing, like in the memory listener, but how to know at
> > > > > > > > > vhost_svq_translate_addr?
> > > > > > > >
> > > > > > > > Then we won't use virtqueue_pop() at all, we need a SVQ version of
> > > > > > > > virtqueue_pop() to translate GPA to SVQ IOVA directly?
> > > > > > > >
> > > > > > >
> > > > > > > The problem is not virtqueue_pop, that's out of the
> > > > > > > vhost_svq_translate_addr. The problem is the need of adding
> > > > > > > conditionals / complexity in all the callers of
> > > > > > >
> > > > > > > > >
> > > > > > > > > The easiest way for me is to rely on memory_region_from_host(). When
> > > > > > > > > vaddr is from the guest, it returns a valid MemoryRegion. When it is
> > > > > > > > > not, it returns NULL. I'm not sure if this is a valid use case, it
> > > > > > > > > just worked in my tests so far.
> > > > > > > > >
> > > > > > > > > Now we have the second problem: The GPA values of the regions of the
> > > > > > > > > two IOVA tree must be unique. We need to be able to find unallocated
> > > > > > > > > regions in SVQ IOVA. At this moment there is only one IOVATree, so
> > > > > > > > > this is done easily by vhost_iova_tree_map_alloc. But it is very
> > > > > > > > > complicated with two trees.
> > > > > > > >
> > > > > > > > Would it be simpler if we decouple the IOVA allocator? For example, we
> > > > > > > > can have a dedicated gtree to track the allocated IOVA ranges. It is
> > > > > > > > shared by both
> > > > > > > >
> > > > > > > > 1) Guest memory (GPA)
> > > > > > > > 2) SVQ virtqueue and buffers
> > > > > > > >
> > > > > > > > And another gtree to track the GPA to IOVA.
> > > > > > > >
> > > > > > > > The SVQ code could use either
> > > > > > > >
> > > > > > > > 1) one linear mappings that contains both SVQ virtqueue and buffers
> > > > > > > >
> > > > > > > > or
> > > > > > > >
> > > > > > > > 2) dynamic IOVA allocation/deallocation helpers
> > > > > > > >
> > > > > > > > So we don't actually need the third gtree for SVQ HVA -> SVQ IOVA?
> > > > > > > >
> > > > > > >
> > > > > > > That's possible, but that scatters the IOVA handling code instead of
> > > > > > > keeping it self-contained in VhostIOVATree.
> > > > > >
> > > > > > To me, the IOVA range/allocation is orthogonal to how IOVA is used.
> > > > > >
> > > > > > An example is the iova allocator in the kernel.
> > > > > >
> > > > > > Note that there's an even simpler IOVA "allocator" in NVME passthrough
> > > > > > code, not sure it is useful here though (haven't had a deep look at
> > > > > > that).
> > > > > >
> > > > >
> > > > > I don't know enough about them to have an opinion. I keep seeing the
> > > > > drawback of needing to synchronize both allocation & adding in all the
> > > > > places we want to modify the IOVATree. At this moment, these are the
> > > > > vhost-vdpa memory listener, the SVQ vring creation and removal, and
> > > > > net CVQ buffers. But it may be more in the future.
> > > > >
> > > > > What are the advantages of keeping these separated that justifies
> > > > > needing to synchronize in all these places, compared with keeping them
> > > > > synchronized in VhostIOVATree?
> > > >
> > > > It doesn't need to be synchronized.
> > > >
> > > > Assuming guest and SVQ shares IOVA range. IOVA only needs to track
> > > > which part of the range has been used.
> > > >
> > >
> > > Not sure if I follow, that is what I mean with "synchronized".
> >
> > Oh right.
> >
> > >
> > > > This simplifies things, we can store GPA->IOVA mappings and SVQ ->
> > > > IOVA mappings separately.
> > > >
> > >
> > > Sorry, I still cannot see the whole picture :).
> > >
> > > Let's assume we have all the GPA mapped to specific IOVA regions, so
> > > we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ
> > > because of the migration. How can we know where we can place SVQ
> > > vrings without having them synchronized?
> >
> > Just allocating a new IOVA range for SVQ?
> >
> > >
> > > At this moment we're using a tree. The tree nature of the current SVQ
> > > IOVA -> VA makes all nodes ordered so it is more or less easy to look
> > > for holes.
> >
> > Yes, iova allocate could still be implemented via a tree.
> >
> > >
> > > Now your proposal uses the SVQ IOVA as tree values. Should we iterate
> > > over all of them, order them, of the two trees, and then look for
> > > holes there?
> >
> > Let me clarify, correct me if I was wrong:
> >
> > 1) IOVA allocator is still implemented via a tree, we just don't need
> > to store how the IOVA is used
> > 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in
> > the datapath SVQ translation
> > 3) A linear mapping or another SVQ -> IOVA tree used for SVQ
> >
>
> Ok, so the part I was missing is that now we have 3 whole trees, with
> somehow redundant information :).

Somehow, it decouples the IOVA usage out of the IOVA allocator. This
might be simple as guests and SVQ may try to share a single IOVA
address space.

>
> In some sense this is simpler than trying to get all the information
> from only two trees. On the bad side, all SVQ calls that allocate some
> region need to remember to add to one of the two other threes. That is
> what I mean by synchronized. But sure, we can go that way.

Just a suggestion, if it turns out to complicate the issue, I'm fine
to go the other way.

Thanks

>
> > Thanks
> >
> > >
> > > > Thanks
> > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > >
> > >
> >
>
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2024-05-14  3:58 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-10 10:03 [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree Eugenio Pérez
2024-04-10 10:03 ` [RFC 1/2] iova_tree: add an id member to DMAMap Eugenio Pérez
2024-04-18 20:46   ` Si-Wei Liu
2024-04-19  8:29     ` Eugenio Perez Martin
2024-04-19 23:49       ` Si-Wei Liu
2024-04-22  8:49         ` Eugenio Perez Martin
2024-04-23 22:20           ` Si-Wei Liu
2024-04-24  7:33             ` Eugenio Perez Martin
2024-04-25 17:43               ` Si-Wei Liu
2024-04-29  8:14                 ` Eugenio Perez Martin
2024-04-29 11:19                   ` Jonah Palmer
2024-04-30 18:11                     ` Eugenio Perez Martin
2024-05-01 22:08                       ` Si-Wei Liu
2024-05-02  6:18                         ` Eugenio Perez Martin
2024-05-07  9:12                           ` Si-Wei Liu
2024-04-30  5:54                   ` Si-Wei Liu
2024-04-30 17:19                     ` Eugenio Perez Martin
2024-05-01 23:13                       ` Si-Wei Liu
2024-05-02  6:44                         ` Eugenio Perez Martin
2024-05-08  0:52                           ` Si-Wei Liu
2024-05-08 15:25                             ` Eugenio Perez Martin
2024-04-10 10:03 ` [RFC 2/2] vdpa: identify aliased maps in iova_tree Eugenio Pérez
2024-04-12  6:46 ` [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree Jason Wang
2024-04-12  7:56   ` Eugenio Perez Martin
2024-05-07  7:29     ` Jason Wang
2024-05-07 10:56       ` Eugenio Perez Martin
2024-05-08  2:29         ` Jason Wang
2024-05-08 17:15           ` Eugenio Perez Martin
2024-05-09  6:27             ` Jason Wang
2024-05-09  7:10               ` Eugenio Perez Martin
2024-05-10  4:28                 ` Jason Wang
2024-05-10  7:16                   ` Eugenio Perez Martin
2024-05-11  4:00                     ` Jason Wang
2024-05-13  6:27                       ` Eugenio Perez Martin
2024-05-13  8:28                         ` Jason Wang
2024-05-13  9:56                           ` Eugenio Perez Martin
2024-05-14  3:56                             ` Jason Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.