All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v2 0/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
@ 2020-06-26  6:41 Eugenio Pérez
  2020-06-26  6:41 ` [RFC v2 1/1] " Eugenio Pérez
                   ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Eugenio Pérez @ 2020-06-26  6:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Juan Quintela, Jason Wang, Peter Xu, Avi Kivity,
	Paolo Bonzini

I am able to hit this assertion when a Red Hat 7 guest virtio_net device
raises an "Invalidation" of all the TLB entries. This happens in the
guest's startup if 'intel_iommu=on' argument is passed to the guest
kernel and right IOMMU/ATS devices are declared in qemu's command line.

Command line:
/home/qemu/x86_64-softmmu/qemu-system-x86_64 -name \
guest=rhel7-test,debug-threads=on -machine \
pc-q35-5.1,accel=kvm,usb=off,dump-guest-core=off,kernel_irqchip=split \
-cpu \
Broadwell,vme=on,ss=on,vmx=on,f16c=on,rdrand=on,hypervisor=on,arat=on,tsc-adjust=on,umip=on,arch-capabilities=on,xsaveopt=on,pdpe1gb=on,abm=on,skip-l1dfl-vmentry=on,rtm=on,hle=on \
-m 8096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid \
d022ecbf-679e-4755-87ce-eb87fc5bbc5d -display none -no-user-config \
-nodefaults -rtc base=utc,driftfix=slew -global \
kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global \
ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 -boot strict=on \
-device intel-iommu,intremap=on,device-iotlb=on -device \
pcie-root-port,port=0x8,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x1 \
-device \
pcie-root-port,port=0x9,chassis=2,id=pci.2,bus=pcie.0,addr=0x1.0x1 \
-device \
pcie-root-port,port=0xa,chassis=3,id=pci.3,bus=pcie.0,addr=0x1.0x2 \
-device \
pcie-root-port,port=0xb,chassis=4,id=pci.4,bus=pcie.0,addr=0x1.0x3 \
-device \
pcie-root-port,port=0xc,chassis=5,id=pci.5,bus=pcie.0,addr=0x1.0x4 \
-device \
pcie-root-port,port=0xd,chassis=6,id=pci.6,bus=pcie.0,addr=0x1.0x5 \
-device \
pcie-root-port,port=0xe,chassis=7,id=pci.7,bus=pcie.0,addr=0x1.0x6 \
-device qemu-xhci,p2=15,p3=15,id=usb,bus=pci.2,addr=0x0 -device \
virtio-serial-pci,id=virtio-serial0,bus=pci.3,addr=0x0 -drive \
file=/home/virtio-test2.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 \
-device \
virtio-blk-pci,scsi=off,bus=pci.4,addr=0x0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-netdev tap,id=hostnet0,vhost=on,vhostforce=on -device \
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:0d:1d:f2,bus=pci.1,addr=0x0,iommu_platform=on,ats=on \
-device virtio-balloon-pci,id=balloon0,bus=pci.5,addr=0x0 -object \
rng-random,id=objrng0,filename=/dev/urandom -device \
virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.6,addr=0x0 -s -msg \
timestamp=on

Full backtrace:

    at /home/qemu/hw/i386/intel_iommu.c:2468
    (mr=0x555557609330, addr=136, value=0x7ffde5dfe478, size=4, shift=0, mask=4294967295, attrs=...) at /home/qemu/memory.c:483
    (addr=136, value=0x7ffde5dfe478, size=4, access_size_min=4, access_size_max=8, access_fn=
    0x555555883d38 <memory_region_write_accessor>, mr=0x555557609330, attrs=...) at /home/qemu/memory.c:544
    at /home/qemu/memory.c:1476
    (fv=0x7ffde00935d0, addr=4275634312, attrs=..., ptr=0x7ffff7ff0028, len=4, addr1=136, l=4, mr=0x555557609330) at /home/qemu/exec.c:3146
    at /home/qemu/exec.c:3186
    (as=0x5555567ca640 <address_space_memory>, addr=4275634312, attrs=..., buf=0x7ffff7ff0028, len=4) at /home/qemu/exec.c:3277
    (as=0x5555567ca640 <address_space_memory>, addr=4275634312, attrs=..., buf=0x7ffff7ff0028, len=4, is_write=true)
    at /home/qemu/exec.c:3287

--

If we examinate *entry in frame 4 of backtrace:
*entry = {target_as = 0x555556f6c050, iova = 0x0, translated_addr = 0x0,
addr_mask = 0xffffffffffffffff, perm = 0x0}

Which (I think) tries to invalidate all the TLB registers of the device.

Just deleting that assert is enough for the VM to start and communicate
using IOMMU, but maybe a better alternative is possible. We could move
it to the caller functions in other cases than IOMMU invalidation, or
make it conditional only if not invalidating.

Any comment would be appreciated. Thanks!

Guest kernel version: kernel-3.10.0-1151.el7.x86_64

Bug reference: https://bugs.launchpad.net/qemu/+bug/1885175

v2: Actually delete assertion instead of just commenting out using C99

Eugenio Pérez (1):
  memory: Delete assertion in memory_region_unregister_iommu_notifier

 memory.c | 2 --
 1 file changed, 2 deletions(-)

-- 
2.18.1



^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-26  6:41 [RFC v2 0/1] memory: Delete assertion in memory_region_unregister_iommu_notifier Eugenio Pérez
@ 2020-06-26  6:41 ` Eugenio Pérez
  2020-06-26 21:29   ` Peter Xu
  2020-06-29 15:05 ` [RFC v2 0/1] " Paolo Bonzini
  2020-08-11 17:55 ` [RFC v3 " Eugenio Pérez
  2 siblings, 1 reply; 68+ messages in thread
From: Eugenio Pérez @ 2020-06-26  6:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Juan Quintela, Jason Wang, Peter Xu, Avi Kivity,
	Paolo Bonzini

Bug reference: https://bugs.launchpad.net/qemu/+bug/1885175

It is possible to hit this assertion on rhel7 guests if iommu is
properly enabled.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 memory.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/memory.c b/memory.c
index 2f15a4b250..7f789710d2 100644
--- a/memory.c
+++ b/memory.c
@@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
         return;
     }
 
-    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
-
     if (entry->perm & IOMMU_RW) {
         request_flags = IOMMU_NOTIFIER_MAP;
     } else {
-- 
2.18.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-26  6:41 ` [RFC v2 1/1] " Eugenio Pérez
@ 2020-06-26 21:29   ` Peter Xu
  2020-06-27  7:26     ` Yan Zhao
                       ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Peter Xu @ 2020-06-26 21:29 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Jason Wang,
	Michael S. Tsirkin, qemu-devel, Eric Auger, Paolo Bonzini

Hi, Eugenio,

(CCing Eric, Yan and Michael too)

On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
> diff --git a/memory.c b/memory.c
> index 2f15a4b250..7f789710d2 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>          return;
>      }
>  
> -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);

I can understand removing the assertion should solve the issue, however imho
the major issue is not about this single assertion but the whole addr_mask
issue behind with virtio...

For normal IOTLB invalidations, we were trying our best to always make
IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what we're
doing with the loop in vtd_address_space_unmap().

But this is not the first time that we may want to break this assumption for
virtio so that we make the IOTLB a tuple of (start, len), then that len can be
not a address mask any more.  That seems to be more efficient for things like
vhost because iotlbs there are not page based, so it'll be inefficient if we
always guarantee the addr_mask because it'll be quite a lot more roundtrips of
the same range of invalidation.  Here we've encountered another issue of
triggering the assertion with virtio-net, but only with the old RHEL7 guest.

I'm thinking whether we can make the IOTLB invalidation configurable by
specifying whether the backend of the notifier can handle arbitary address
range in some way.  So we still have the guaranteed addr_masks by default
(since I still don't think totally break the addr_mask restriction is wise...),
however we can allow the special backends to take adavantage of using arbitary
(start, len) ranges for reasons like performance.

To do that, a quick idea is to introduce a flag IOMMU_NOTIFIER_ARBITRARY_MASK
to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) can
take arbitrary address mask, then it can be any value and finally becomes a
length rather than an addr_mask.  Then for every iommu notify() we can directly
deliver whatever we've got from the upper layer to this notifier.  With the new
flag, vhost can do iommu_notifier_init() with UNMAP|ARBITRARY_MASK so it
declares this capability.  Then no matter for device iotlb or normal iotlb, we
skip the complicated procedure to split a big range into small ranges that are
with strict addr_mask, but directly deliver the message to the iommu notifier.
E.g., we can skip the loop in vtd_address_space_unmap() if the notifier is with
ARBITRARY flag set.

Then, the assert() is not accurate either, and may become something like:

diff --git a/memory.c b/memory.c
index 2f15a4b250..99d0492509 100644
--- a/memory.c
+++ b/memory.c
@@ -1906,6 +1906,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
 {
     IOMMUNotifierFlag request_flags;
     hwaddr entry_end = entry->iova + entry->addr_mask;
+    IOMMUTLBEntry tmp = *entry;

     /*
      * Skip the notification if the notification does not overlap
@@ -1915,7 +1916,13 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
         return;
     }

-    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
+    if (notifier->notifier_flags & IOMMU_NOTIFIER_ARBITRARY_MASK) {
+        tmp.iova = MAX(tmp.iova, notifier->start);
+        tmp.addr_mask = MIN(tmp.addr_mask, notifier->end);
+        assert(tmp.iova <= tmp.addr_mask);
+    } else {
+        assert(entry->iova >= notifier->start && entry_end <= notifier->end);
+    }

     if (entry->perm & IOMMU_RW) {
         request_flags = IOMMU_NOTIFIER_MAP;
@@ -1924,7 +1931,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
     }

     if (notifier->notifier_flags & request_flags) {
-        notifier->notify(notifier, entry);
+        notifier->notify(notifier, &tmp);
     }
 }

Then we can keep the assert() for e.g. vfio, however vhost can skip it and even
get some further performance boosts..  Does that make sense?

Thanks,

-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-26 21:29   ` Peter Xu
@ 2020-06-27  7:26     ` Yan Zhao
  2020-06-27 12:57       ` Peter Xu
  2020-06-28  7:03     ` Jason Wang
  2020-08-11 17:01     ` Eugenio Perez Martin
  2 siblings, 1 reply; 68+ messages in thread
From: Yan Zhao @ 2020-06-27  7:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Jason Wang,
	Michael S. Tsirkin, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini

On Fri, Jun 26, 2020 at 05:29:17PM -0400, Peter Xu wrote:
> Hi, Eugenio,
> 
> (CCing Eric, Yan and Michael too)

> On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
> > diff --git a/memory.c b/memory.c
> > index 2f15a4b250..7f789710d2 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> >          return;
> >      }
> >  
> > -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> 
> I can understand removing the assertion should solve the issue, however imho
> the major issue is not about this single assertion but the whole addr_mask
> issue behind with virtio...
Yes, the background for this assertion is
https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04218.html


> 
> For normal IOTLB invalidations, we were trying our best to always make
> IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what we're
> doing with the loop in vtd_address_space_unmap().
> 
> But this is not the first time that we may want to break this assumption for
> virtio so that we make the IOTLB a tuple of (start, len), then that len can be
> not a address mask any more.  That seems to be more efficient for things like
> vhost because iotlbs there are not page based, so it'll be inefficient if we
> always guarantee the addr_mask because it'll be quite a lot more roundtrips of
> the same range of invalidation.  Here we've encountered another issue of
> triggering the assertion with virtio-net, but only with the old RHEL7 guest.
> 
> I'm thinking whether we can make the IOTLB invalidation configurable by
> specifying whether the backend of the notifier can handle arbitary address
> range in some way.  So we still have the guaranteed addr_masks by default
> (since I still don't think totally break the addr_mask restriction is wise...),
> however we can allow the special backends to take adavantage of using arbitary
> (start, len) ranges for reasons like performance.
> 
> To do that, a quick idea is to introduce a flag IOMMU_NOTIFIER_ARBITRARY_MASK
> to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) can
> take arbitrary address mask, then it can be any value and finally becomes a
> length rather than an addr_mask.  Then for every iommu notify() we can directly
> deliver whatever we've got from the upper layer to this notifier.  With the new
> flag, vhost can do iommu_notifier_init() with UNMAP|ARBITRARY_MASK so it
> declares this capability.  Then no matter for device iotlb or normal iotlb, we
> skip the complicated procedure to split a big range into small ranges that are
> with strict addr_mask, but directly deliver the message to the iommu notifier.
> E.g., we can skip the loop in vtd_address_space_unmap() if the notifier is with
> ARBITRARY flag set.
> 
> Then, the assert() is not accurate either, and may become something like:
> 
> diff --git a/memory.c b/memory.c
> index 2f15a4b250..99d0492509 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1906,6 +1906,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>  {
>      IOMMUNotifierFlag request_flags;
>      hwaddr entry_end = entry->iova + entry->addr_mask;
> +    IOMMUTLBEntry tmp = *entry;
> 
>      /*
>       * Skip the notification if the notification does not overlap
> @@ -1915,7 +1916,13 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>          return;
>      }
> 
> -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> +    if (notifier->notifier_flags & IOMMU_NOTIFIER_ARBITRARY_MASK) {
> +        tmp.iova = MAX(tmp.iova, notifier->start);
> +        tmp.addr_mask = MIN(tmp.addr_mask, notifier->end);
NIT:
       tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
> +        assert(tmp.iova <= tmp.addr_mask);
no this assertion then.

Thanks
Yan
       
> +    } else {
> +        assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> +    }
> 
>      if (entry->perm & IOMMU_RW) {
>          request_flags = IOMMU_NOTIFIER_MAP;
> @@ -1924,7 +1931,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>      }
> 
>      if (notifier->notifier_flags & request_flags) {
> -        notifier->notify(notifier, entry);
> +        notifier->notify(notifier, &tmp);
>      }
>  }
> 
> Then we can keep the assert() for e.g. vfio, however vhost can skip it and even
> get some further performance boosts..  Does that make sense?
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-27  7:26     ` Yan Zhao
@ 2020-06-27 12:57       ` Peter Xu
  2020-06-28  1:36         ` Yan Zhao
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-06-27 12:57 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, Jason Wang,
	Juan Quintela, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini

On Sat, Jun 27, 2020 at 03:26:45AM -0400, Yan Zhao wrote:
> > -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> > +    if (notifier->notifier_flags & IOMMU_NOTIFIER_ARBITRARY_MASK) {
> > +        tmp.iova = MAX(tmp.iova, notifier->start);
> > +        tmp.addr_mask = MIN(tmp.addr_mask, notifier->end);
> NIT:
>        tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;

Right.  Thanks. :)

> > +        assert(tmp.iova <= tmp.addr_mask);
> no this assertion then.

Or change it into:

  assert(MIN(entry_end, notifier->end) >= tmp.iova);

To double confirm no overflow.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-27 12:57       ` Peter Xu
@ 2020-06-28  1:36         ` Yan Zhao
  0 siblings, 0 replies; 68+ messages in thread
From: Yan Zhao @ 2020-06-28  1:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Michael S. Tsirkin, Jason Wang, Juan Quintela,
	qemu-devel, Eugenio Pérez, Yan Zhao, Paolo Bonzini,
	Eric Auger

On Sat, Jun 27, 2020 at 08:57:14AM -0400, Peter Xu wrote:
> On Sat, Jun 27, 2020 at 03:26:45AM -0400, Yan Zhao wrote:
> > > -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> > > +    if (notifier->notifier_flags & IOMMU_NOTIFIER_ARBITRARY_MASK) {
> > > +        tmp.iova = MAX(tmp.iova, notifier->start);
> > > +        tmp.addr_mask = MIN(tmp.addr_mask, notifier->end);
> > NIT:
> >        tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
> 
> Right.  Thanks. :)
> 
> > > +        assert(tmp.iova <= tmp.addr_mask);
> > no this assertion then.
> 
> Or change it into:
> 
>   assert(MIN(entry_end, notifier->end) >= tmp.iova);
> 
> To double confirm no overflow.
>
what about assert in this way, so that it's also useful to check overflow
in the other condition.

hwaddr entry_end = entry->iova + entry->addr_mask;
+
+ assert(notifier->end >= notifer->start && entry_end >= entry->iova);


then as there's a following filter
    if (notifier->start > entry_end || notifier->end < entry->iova) {
        return;
    }

we can conclude that

entry_end >= entry->iova(tmp.iova)
entry_end >= notifier->start,
--> entry_end >= MAX(tmp.iova, notfier->start)
--> entry_end >= tmp.iova


notifier->end >= entry->iova (tmp.iova),
notifier->end >= notifer->start,
--> notifier->end >= MAX(tmp.iova, nofier->start)
--> notifier->end >= tmp.iova

==> MIN(end_end, notifer->end) >= tmp.iova

Thanks
Yan


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-26 21:29   ` Peter Xu
  2020-06-27  7:26     ` Yan Zhao
@ 2020-06-28  7:03     ` Jason Wang
  2020-06-28 14:47       ` Peter Xu
  2020-08-11 17:01     ` Eugenio Perez Martin
  2 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-06-28  7:03 UTC (permalink / raw)
  To: Peter Xu, Eugenio Pérez
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, Juan Quintela,
	qemu-devel, Eric Auger, Paolo Bonzini


On 2020/6/27 上午5:29, Peter Xu wrote:
> Hi, Eugenio,
>
> (CCing Eric, Yan and Michael too)
>
> On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
>> diff --git a/memory.c b/memory.c
>> index 2f15a4b250..7f789710d2 100644
>> --- a/memory.c
>> +++ b/memory.c
>> @@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>>           return;
>>       }
>>   
>> -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> I can understand removing the assertion should solve the issue, however imho
> the major issue is not about this single assertion but the whole addr_mask
> issue behind with virtio...


I don't get here, it looks to the the range was from guest IOMMU drivers.


>
> For normal IOTLB invalidations, we were trying our best to always make
> IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what we're
> doing with the loop in vtd_address_space_unmap().


I'm sure such such assumption can work for any type of IOMMU.


>
> But this is not the first time that we may want to break this assumption for
> virtio so that we make the IOTLB a tuple of (start, len), then that len can be
> not a address mask any more.  That seems to be more efficient for things like
> vhost because iotlbs there are not page based, so it'll be inefficient if we
> always guarantee the addr_mask because it'll be quite a lot more roundtrips of
> the same range of invalidation.  Here we've encountered another issue of
> triggering the assertion with virtio-net, but only with the old RHEL7 guest.
>
> I'm thinking whether we can make the IOTLB invalidation configurable by
> specifying whether the backend of the notifier can handle arbitary address
> range in some way.  So we still have the guaranteed addr_masks by default
> (since I still don't think totally break the addr_mask restriction is wise...),
> however we can allow the special backends to take adavantage of using arbitary
> (start, len) ranges for reasons like performance.
>
> To do that, a quick idea is to introduce a flag IOMMU_NOTIFIER_ARBITRARY_MASK
> to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) can
> take arbitrary address mask, then it can be any value and finally becomes a
> length rather than an addr_mask.  Then for every iommu notify() we can directly
> deliver whatever we've got from the upper layer to this notifier.  With the new
> flag, vhost can do iommu_notifier_init() with UNMAP|ARBITRARY_MASK so it
> declares this capability.  Then no matter for device iotlb or normal iotlb, we
> skip the complicated procedure to split a big range into small ranges that are
> with strict addr_mask, but directly deliver the message to the iommu notifier.
> E.g., we can skip the loop in vtd_address_space_unmap() if the notifier is with
> ARBITRARY flag set.


I'm not sure coupling IOMMU capability to notifier is the best choice.

How about just convert to use a range [start, end] for any notifier and 
move the checks (e.g the assert) into the actual notifier implemented 
(vhost or vfio)?

Thanks


>
> Then, the assert() is not accurate either, and may become something like:
>
> diff --git a/memory.c b/memory.c
> index 2f15a4b250..99d0492509 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1906,6 +1906,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>   {
>       IOMMUNotifierFlag request_flags;
>       hwaddr entry_end = entry->iova + entry->addr_mask;
> +    IOMMUTLBEntry tmp = *entry;
>
>       /*
>        * Skip the notification if the notification does not overlap
> @@ -1915,7 +1916,13 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>           return;
>       }
>
> -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> +    if (notifier->notifier_flags & IOMMU_NOTIFIER_ARBITRARY_MASK) {
> +        tmp.iova = MAX(tmp.iova, notifier->start);
> +        tmp.addr_mask = MIN(tmp.addr_mask, notifier->end);
> +        assert(tmp.iova <= tmp.addr_mask);
> +    } else {
> +        assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> +    }
>
>       if (entry->perm & IOMMU_RW) {
>           request_flags = IOMMU_NOTIFIER_MAP;
> @@ -1924,7 +1931,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>       }
>
>       if (notifier->notifier_flags & request_flags) {
> -        notifier->notify(notifier, entry);
> +        notifier->notify(notifier, &tmp);
>       }
>   }
>
> Then we can keep the assert() for e.g. vfio, however vhost can skip it and even
> get some further performance boosts..  Does that make sense?
>
> Thanks,
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-28  7:03     ` Jason Wang
@ 2020-06-28 14:47       ` Peter Xu
  2020-06-29  5:51         ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-06-28 14:47 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Michael S. Tsirkin,
	qemu-devel, Eugenio Pérez, Eric Auger, Paolo Bonzini

On Sun, Jun 28, 2020 at 03:03:41PM +0800, Jason Wang wrote:
> 
> On 2020/6/27 上午5:29, Peter Xu wrote:
> > Hi, Eugenio,
> > 
> > (CCing Eric, Yan and Michael too)
> > 
> > On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
> > > diff --git a/memory.c b/memory.c
> > > index 2f15a4b250..7f789710d2 100644
> > > --- a/memory.c
> > > +++ b/memory.c
> > > @@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> > >           return;
> > >       }
> > > -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> > I can understand removing the assertion should solve the issue, however imho
> > the major issue is not about this single assertion but the whole addr_mask
> > issue behind with virtio...
> 
> 
> I don't get here, it looks to the the range was from guest IOMMU drivers.

Yes.  Note that I didn't mean that it's a problem in virtio, it's just the fact
that virtio is the only one I know that would like to support arbitrary address
range for the translated region.  I don't know about tcg, but vfio should still
need some kind of page alignment in both the address and the addr_mask.  We
have that assumption too across the memory core when we do translations.

A further cause of the issue is the MSI region when vIOMMU enabled - currently
we implemented the interrupt region using another memory region so it split the
whole DMA region into two parts.  That's really a clean approach to IR
implementation, however that's also a burden to the invalidation part because
then we'll need to handle things like this when the listened range is not page
alighed at all (neither 0-0xfedffff, nor 0xfef0000-MAX).  If without the IR
region (so the whole iommu address range will be a single FlatRange), I think
we probably don't need most of the logic in vtd_address_space_unmap() at all,
then we can directly deliver all the IOTLB invalidations without splitting into
small page aligned ranges to all the iommu notifiers.  Sadly, so far I still
don't have ideal solution for it, because we definitely need IR.

> 
> 
> > 
> > For normal IOTLB invalidations, we were trying our best to always make
> > IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what we're
> > doing with the loop in vtd_address_space_unmap().
> 
> 
> I'm sure such such assumption can work for any type of IOMMU.
> 
> 
> > 
> > But this is not the first time that we may want to break this assumption for
> > virtio so that we make the IOTLB a tuple of (start, len), then that len can be
> > not a address mask any more.  That seems to be more efficient for things like
> > vhost because iotlbs there are not page based, so it'll be inefficient if we
> > always guarantee the addr_mask because it'll be quite a lot more roundtrips of
> > the same range of invalidation.  Here we've encountered another issue of
> > triggering the assertion with virtio-net, but only with the old RHEL7 guest.
> > 
> > I'm thinking whether we can make the IOTLB invalidation configurable by
> > specifying whether the backend of the notifier can handle arbitary address
> > range in some way.  So we still have the guaranteed addr_masks by default
> > (since I still don't think totally break the addr_mask restriction is wise...),
> > however we can allow the special backends to take adavantage of using arbitary
> > (start, len) ranges for reasons like performance.
> > 
> > To do that, a quick idea is to introduce a flag IOMMU_NOTIFIER_ARBITRARY_MASK
> > to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) can
> > take arbitrary address mask, then it can be any value and finally becomes a
> > length rather than an addr_mask.  Then for every iommu notify() we can directly
> > deliver whatever we've got from the upper layer to this notifier.  With the new
> > flag, vhost can do iommu_notifier_init() with UNMAP|ARBITRARY_MASK so it
> > declares this capability.  Then no matter for device iotlb or normal iotlb, we
> > skip the complicated procedure to split a big range into small ranges that are
> > with strict addr_mask, but directly deliver the message to the iommu notifier.
> > E.g., we can skip the loop in vtd_address_space_unmap() if the notifier is with
> > ARBITRARY flag set.
> 
> 
> I'm not sure coupling IOMMU capability to notifier is the best choice.

IMHO it's not an IOMMU capability.  The flag I wanted to introduce is a
capability of the one who listens to the IOMMU TLB updates.  For our case, it's
virtio/vhost's capability to allow arbitrary length. The IOMMU itself
definitely has some limitation on the address range to be bound to an IOTLB
invalidation, e.g., the device-iotlb we're talking here only accept both the
iova address and addr_mask to be aligned to 2**N-1.

> 
> How about just convert to use a range [start, end] for any notifier and move
> the checks (e.g the assert) into the actual notifier implemented (vhost or
> vfio)?

IOMMUTLBEntry itself is the abstraction layer of TLB entry.  Hardware TLB entry
is definitely not arbitrary range either (because AFAICT the hardware should
only cache PFN rather than address, so at least PAGE_SIZE aligned).
Introducing this flag will already make this trickier just to avoid introducing
another similar struct to IOMMUTLBEntry, but I really don't want to make it a
default option...  Not to mention I probably have no reason to urge the rest
iommu notifier users (tcg, vfio) to change their existing good code to suite
any of the backend who can cooperate with arbitrary address ranges...

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-28 14:47       ` Peter Xu
@ 2020-06-29  5:51         ` Jason Wang
  2020-06-29 13:34           ` Peter Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-06-29  5:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Michael S. Tsirkin,
	qemu-devel, Eugenio Pérez, Eric Auger, Paolo Bonzini


On 2020/6/28 下午10:47, Peter Xu wrote:
> On Sun, Jun 28, 2020 at 03:03:41PM +0800, Jason Wang wrote:
>> On 2020/6/27 上午5:29, Peter Xu wrote:
>>> Hi, Eugenio,
>>>
>>> (CCing Eric, Yan and Michael too)
>>>
>>> On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
>>>> diff --git a/memory.c b/memory.c
>>>> index 2f15a4b250..7f789710d2 100644
>>>> --- a/memory.c
>>>> +++ b/memory.c
>>>> @@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>>>>            return;
>>>>        }
>>>> -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
>>> I can understand removing the assertion should solve the issue, however imho
>>> the major issue is not about this single assertion but the whole addr_mask
>>> issue behind with virtio...
>>
>> I don't get here, it looks to the the range was from guest IOMMU drivers.
> Yes.  Note that I didn't mean that it's a problem in virtio, it's just the fact
> that virtio is the only one I know that would like to support arbitrary address
> range for the translated region.  I don't know about tcg, but vfio should still
> need some kind of page alignment in both the address and the addr_mask.  We
> have that assumption too across the memory core when we do translations.


Right but it looks to me the issue is not the alignment.


>
> A further cause of the issue is the MSI region when vIOMMU enabled - currently
> we implemented the interrupt region using another memory region so it split the
> whole DMA region into two parts.  That's really a clean approach to IR
> implementation, however that's also a burden to the invalidation part because
> then we'll need to handle things like this when the listened range is not page
> alighed at all (neither 0-0xfedffff, nor 0xfef0000-MAX).  If without the IR
> region (so the whole iommu address range will be a single FlatRange),


Is this a bug? I remember that at least for vtd, it won't do any DMAR on 
the intrrupt address range


>   I think
> we probably don't need most of the logic in vtd_address_space_unmap() at all,
> then we can directly deliver all the IOTLB invalidations without splitting into
> small page aligned ranges to all the iommu notifiers.  Sadly, so far I still
> don't have ideal solution for it, because we definitely need IR.


Another possible (theoretical) issue (for vhost) is that it can't 
trigger interrupt through the interrupt range.


>
>>
>>> For normal IOTLB invalidations, we were trying our best to always make
>>> IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what we're
>>> doing with the loop in vtd_address_space_unmap().
>>
>> I'm sure such such assumption can work for any type of IOMMU.
>>
>>
>>> But this is not the first time that we may want to break this assumption for
>>> virtio so that we make the IOTLB a tuple of (start, len), then that len can be
>>> not a address mask any more.  That seems to be more efficient for things like
>>> vhost because iotlbs there are not page based, so it'll be inefficient if we
>>> always guarantee the addr_mask because it'll be quite a lot more roundtrips of
>>> the same range of invalidation.  Here we've encountered another issue of
>>> triggering the assertion with virtio-net, but only with the old RHEL7 guest.
>>>
>>> I'm thinking whether we can make the IOTLB invalidation configurable by
>>> specifying whether the backend of the notifier can handle arbitary address
>>> range in some way.  So we still have the guaranteed addr_masks by default
>>> (since I still don't think totally break the addr_mask restriction is wise...),
>>> however we can allow the special backends to take adavantage of using arbitary
>>> (start, len) ranges for reasons like performance.
>>>
>>> To do that, a quick idea is to introduce a flag IOMMU_NOTIFIER_ARBITRARY_MASK
>>> to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) can
>>> take arbitrary address mask, then it can be any value and finally becomes a
>>> length rather than an addr_mask.  Then for every iommu notify() we can directly
>>> deliver whatever we've got from the upper layer to this notifier.  With the new
>>> flag, vhost can do iommu_notifier_init() with UNMAP|ARBITRARY_MASK so it
>>> declares this capability.  Then no matter for device iotlb or normal iotlb, we
>>> skip the complicated procedure to split a big range into small ranges that are
>>> with strict addr_mask, but directly deliver the message to the iommu notifier.
>>> E.g., we can skip the loop in vtd_address_space_unmap() if the notifier is with
>>> ARBITRARY flag set.
>>
>> I'm not sure coupling IOMMU capability to notifier is the best choice.
> IMHO it's not an IOMMU capability.  The flag I wanted to introduce is a
> capability of the one who listens to the IOMMU TLB updates.  For our case, it's
> virtio/vhost's capability to allow arbitrary length. The IOMMU itself
> definitely has some limitation on the address range to be bound to an IOTLB
> invalidation, e.g., the device-iotlb we're talking here only accept both the
> iova address and addr_mask to be aligned to 2**N-1.


I think this go back to one of our previous discussion of whether to 
introduce a dedicated notifiers for device IOTLB.

For IOMMU, it might have limitation like GAW, but for device IOTLB it 
probably doesn't. That's the reason we hit the assert here.


>
>> How about just convert to use a range [start, end] for any notifier and move
>> the checks (e.g the assert) into the actual notifier implemented (vhost or
>> vfio)?
> IOMMUTLBEntry itself is the abstraction layer of TLB entry.  Hardware TLB entry
> is definitely not arbitrary range either (because AFAICT the hardware should
> only cache PFN rather than address, so at least PAGE_SIZE aligned).
> Introducing this flag will already make this trickier just to avoid introducing
> another similar struct to IOMMUTLBEntry, but I really don't want to make it a
> default option...  Not to mention I probably have no reason to urge the rest
> iommu notifier users (tcg, vfio) to change their existing good code to suite
> any of the backend who can cooperate with arbitrary address ranges...


Ok, so it looks like we need a dedicated notifiers to device IOTLB.

Thanks


>
> Thanks,
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-29  5:51         ` Jason Wang
@ 2020-06-29 13:34           ` Peter Xu
  2020-06-30  2:41             ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-06-29 13:34 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Michael S. Tsirkin,
	qemu-devel, Eugenio Pérez, Eric Auger, Paolo Bonzini

On Mon, Jun 29, 2020 at 01:51:47PM +0800, Jason Wang wrote:
> 
> On 2020/6/28 下午10:47, Peter Xu wrote:
> > On Sun, Jun 28, 2020 at 03:03:41PM +0800, Jason Wang wrote:
> > > On 2020/6/27 上午5:29, Peter Xu wrote:
> > > > Hi, Eugenio,
> > > > 
> > > > (CCing Eric, Yan and Michael too)
> > > > 
> > > > On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
> > > > > diff --git a/memory.c b/memory.c
> > > > > index 2f15a4b250..7f789710d2 100644
> > > > > --- a/memory.c
> > > > > +++ b/memory.c
> > > > > @@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> > > > >            return;
> > > > >        }
> > > > > -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> > > > I can understand removing the assertion should solve the issue, however imho
> > > > the major issue is not about this single assertion but the whole addr_mask
> > > > issue behind with virtio...
> > > 
> > > I don't get here, it looks to the the range was from guest IOMMU drivers.
> > Yes.  Note that I didn't mean that it's a problem in virtio, it's just the fact
> > that virtio is the only one I know that would like to support arbitrary address
> > range for the translated region.  I don't know about tcg, but vfio should still
> > need some kind of page alignment in both the address and the addr_mask.  We
> > have that assumption too across the memory core when we do translations.
> 
> 
> Right but it looks to me the issue is not the alignment.
> 
> 
> > 
> > A further cause of the issue is the MSI region when vIOMMU enabled - currently
> > we implemented the interrupt region using another memory region so it split the
> > whole DMA region into two parts.  That's really a clean approach to IR
> > implementation, however that's also a burden to the invalidation part because
> > then we'll need to handle things like this when the listened range is not page
> > alighed at all (neither 0-0xfedffff, nor 0xfef0000-MAX).  If without the IR
> > region (so the whole iommu address range will be a single FlatRange),
> 
> 
> Is this a bug? I remember that at least for vtd, it won't do any DMAR on the
> intrrupt address range

I don't think it's a bug, at least it's working as how I understand...  that
interrupt range is using an IR region, that's why I said the IR region splits
the DMAR region into two pieces, so we have two FlatRange for the same
IOMMUMemoryRegion.

> 
> 
> >   I think
> > we probably don't need most of the logic in vtd_address_space_unmap() at all,
> > then we can directly deliver all the IOTLB invalidations without splitting into
> > small page aligned ranges to all the iommu notifiers.  Sadly, so far I still
> > don't have ideal solution for it, because we definitely need IR.
> 
> 
> Another possible (theoretical) issue (for vhost) is that it can't trigger
> interrupt through the interrupt range.

Hmm.. Could you explain?  When IR is enabled, all devices including virtio
who send interrupt to 0xfeeXXXXX should be trapped by IR.

> 
> 
> > 
> > > 
> > > > For normal IOTLB invalidations, we were trying our best to always make
> > > > IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what we're
> > > > doing with the loop in vtd_address_space_unmap().
> > > 
> > > I'm sure such such assumption can work for any type of IOMMU.
> > > 
> > > 
> > > > But this is not the first time that we may want to break this assumption for
> > > > virtio so that we make the IOTLB a tuple of (start, len), then that len can be
> > > > not a address mask any more.  That seems to be more efficient for things like
> > > > vhost because iotlbs there are not page based, so it'll be inefficient if we
> > > > always guarantee the addr_mask because it'll be quite a lot more roundtrips of
> > > > the same range of invalidation.  Here we've encountered another issue of
> > > > triggering the assertion with virtio-net, but only with the old RHEL7 guest.
> > > > 
> > > > I'm thinking whether we can make the IOTLB invalidation configurable by
> > > > specifying whether the backend of the notifier can handle arbitary address
> > > > range in some way.  So we still have the guaranteed addr_masks by default
> > > > (since I still don't think totally break the addr_mask restriction is wise...),
> > > > however we can allow the special backends to take adavantage of using arbitary
> > > > (start, len) ranges for reasons like performance.
> > > > 
> > > > To do that, a quick idea is to introduce a flag IOMMU_NOTIFIER_ARBITRARY_MASK
> > > > to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) can
> > > > take arbitrary address mask, then it can be any value and finally becomes a
> > > > length rather than an addr_mask.  Then for every iommu notify() we can directly
> > > > deliver whatever we've got from the upper layer to this notifier.  With the new
> > > > flag, vhost can do iommu_notifier_init() with UNMAP|ARBITRARY_MASK so it
> > > > declares this capability.  Then no matter for device iotlb or normal iotlb, we
> > > > skip the complicated procedure to split a big range into small ranges that are
> > > > with strict addr_mask, but directly deliver the message to the iommu notifier.
> > > > E.g., we can skip the loop in vtd_address_space_unmap() if the notifier is with
> > > > ARBITRARY flag set.
> > > 
> > > I'm not sure coupling IOMMU capability to notifier is the best choice.
> > IMHO it's not an IOMMU capability.  The flag I wanted to introduce is a
> > capability of the one who listens to the IOMMU TLB updates.  For our case, it's
> > virtio/vhost's capability to allow arbitrary length. The IOMMU itself
> > definitely has some limitation on the address range to be bound to an IOTLB
> > invalidation, e.g., the device-iotlb we're talking here only accept both the
> > iova address and addr_mask to be aligned to 2**N-1.
> 
> 
> I think this go back to one of our previous discussion of whether to
> introduce a dedicated notifiers for device IOTLB.
> 
> For IOMMU, it might have limitation like GAW, but for device IOTLB it
> probably doesn't. That's the reason we hit the assert here.

I feel like even for hardware it shouldn't be arbitrary either, because the
device iotlb sent from at least vt-d driver is very restricted too (borrowing
the comment you wrote :):

    /* According to ATS spec table 2.4:
     * S = 0, bits 15:12 = xxxx     range size: 4K
     * S = 1, bits 15:12 = xxx0     range size: 8K
     * S = 1, bits 15:12 = xx01     range size: 16K
     * S = 1, bits 15:12 = x011     range size: 32K
     * S = 1, bits 15:12 = 0111     range size: 64K
     * ...
     */

> 
> 
> > 
> > > How about just convert to use a range [start, end] for any notifier and move
> > > the checks (e.g the assert) into the actual notifier implemented (vhost or
> > > vfio)?
> > IOMMUTLBEntry itself is the abstraction layer of TLB entry.  Hardware TLB entry
> > is definitely not arbitrary range either (because AFAICT the hardware should
> > only cache PFN rather than address, so at least PAGE_SIZE aligned).
> > Introducing this flag will already make this trickier just to avoid introducing
> > another similar struct to IOMMUTLBEntry, but I really don't want to make it a
> > default option...  Not to mention I probably have no reason to urge the rest
> > iommu notifier users (tcg, vfio) to change their existing good code to suite
> > any of the backend who can cooperate with arbitrary address ranges...
> 
> 
> Ok, so it looks like we need a dedicated notifiers to device IOTLB.

Or we can also make a new flag for device iotlb just like current UNMAP? Then
we replace the vhost type from UNMAP to DEVICE_IOTLB.  But IMHO using the
ARBITRARY_LENGTH flag would work in a similar way.  DEVICE_IOTLB flag could
also allow virtio/vhost to only receive one invalidation (now IIUC it'll
receive both iotlb and device-iotlb for unmapping a page when ats=on), but then
ats=on will be a must and it could break some old (misconfiged) qemu because
afaict previously virtio/vhost could even work with vIOMMU (accidentally) even
without ats=on.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 0/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-26  6:41 [RFC v2 0/1] memory: Delete assertion in memory_region_unregister_iommu_notifier Eugenio Pérez
  2020-06-26  6:41 ` [RFC v2 1/1] " Eugenio Pérez
@ 2020-06-29 15:05 ` Paolo Bonzini
  2020-07-03  7:39   ` Eugenio Perez Martin
  2020-08-11 17:55 ` [RFC v3 " Eugenio Pérez
  2 siblings, 1 reply; 68+ messages in thread
From: Paolo Bonzini @ 2020-06-29 15:05 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Peter Maydell, Jason Wang, Avi Kivity, Peter Xu, Juan Quintela

On 26/06/20 08:41, Eugenio Pérez wrote:
> If we examinate *entry in frame 4 of backtrace:
> *entry = {target_as = 0x555556f6c050, iova = 0x0, translated_addr = 0x0,
> addr_mask = 0xffffffffffffffff, perm = 0x0}
> 
> Which (I think) tries to invalidate all the TLB registers of the device.
> 
> Just deleting that assert is enough for the VM to start and communicate
> using IOMMU, but maybe a better alternative is possible. We could move
> it to the caller functions in other cases than IOMMU invalidation, or
> make it conditional only if not invalidating.

Yes, I think moving it up in the call stack is better.  I cannot say
where because the backtrace was destroyed by git (due to lines starting
with "#").

Paolo

> Any comment would be appreciated. Thanks!
> 
> Guest kernel version: kernel-3.10.0-1151.el7.x86_64
> 
> Bug reference: https://bugs.launchpad.net/qemu/+bug/1885175
> 
> v2: Actually delete assertion instead of just commenting out using C99
> 
> Eugenio Pérez (1):
>   memory: Delete assertion in memory_region_unregister_iommu_notifier
> 
>  memory.c | 2 --
>  1 file changed, 2 deletions(-)
> 



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-29 13:34           ` Peter Xu
@ 2020-06-30  2:41             ` Jason Wang
  2020-06-30  8:29               ` Jason Wang
  2020-06-30 15:39               ` Peter Xu
  0 siblings, 2 replies; 68+ messages in thread
From: Jason Wang @ 2020-06-30  2:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Michael S. Tsirkin,
	qemu-devel, Eugenio Pérez, Eric Auger, Paolo Bonzini


On 2020/6/29 下午9:34, Peter Xu wrote:
> On Mon, Jun 29, 2020 at 01:51:47PM +0800, Jason Wang wrote:
>> On 2020/6/28 下午10:47, Peter Xu wrote:
>>> On Sun, Jun 28, 2020 at 03:03:41PM +0800, Jason Wang wrote:
>>>> On 2020/6/27 上午5:29, Peter Xu wrote:
>>>>> Hi, Eugenio,
>>>>>
>>>>> (CCing Eric, Yan and Michael too)
>>>>>
>>>>> On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
>>>>>> diff --git a/memory.c b/memory.c
>>>>>> index 2f15a4b250..7f789710d2 100644
>>>>>> --- a/memory.c
>>>>>> +++ b/memory.c
>>>>>> @@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>>>>>>             return;
>>>>>>         }
>>>>>> -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
>>>>> I can understand removing the assertion should solve the issue, however imho
>>>>> the major issue is not about this single assertion but the whole addr_mask
>>>>> issue behind with virtio...
>>>> I don't get here, it looks to the the range was from guest IOMMU drivers.
>>> Yes.  Note that I didn't mean that it's a problem in virtio, it's just the fact
>>> that virtio is the only one I know that would like to support arbitrary address
>>> range for the translated region.  I don't know about tcg, but vfio should still
>>> need some kind of page alignment in both the address and the addr_mask.  We
>>> have that assumption too across the memory core when we do translations.
>>
>> Right but it looks to me the issue is not the alignment.
>>
>>
>>> A further cause of the issue is the MSI region when vIOMMU enabled - currently
>>> we implemented the interrupt region using another memory region so it split the
>>> whole DMA region into two parts.  That's really a clean approach to IR
>>> implementation, however that's also a burden to the invalidation part because
>>> then we'll need to handle things like this when the listened range is not page
>>> alighed at all (neither 0-0xfedffff, nor 0xfef0000-MAX).  If without the IR
>>> region (so the whole iommu address range will be a single FlatRange),
>>
>> Is this a bug? I remember that at least for vtd, it won't do any DMAR on the
>> intrrupt address range
> I don't think it's a bug, at least it's working as how I understand...  that
> interrupt range is using an IR region, that's why I said the IR region splits
> the DMAR region into two pieces, so we have two FlatRange for the same
> IOMMUMemoryRegion.


I don't check the qemu code but if "a single FlatRange" means 
0xFEEx_xxxx is subject to DMA remapping, OS need to setup passthrough 
mapping for that range in order to get MSI to work. This is not what vtd 
spec said:

"""

3.14 Handling Requests to Interrupt Address Range

Requests without PASID to address range 0xFEEx_xxxx are treated as
potential interrupt requests and are not subjected to DMA remapping
(even if translation structures specify a mapping for this
range). Instead, remapping hardware can be enabled to subject such
interrupt requests to interrupt remapping.

"""

My understanding is vtd won't do any DMA translation on 0xFEEx_xxxx even 
if IR is not enabled.


>
>>
>>>    I think
>>> we probably don't need most of the logic in vtd_address_space_unmap() at all,
>>> then we can directly deliver all the IOTLB invalidations without splitting into
>>> small page aligned ranges to all the iommu notifiers.  Sadly, so far I still
>>> don't have ideal solution for it, because we definitely need IR.
>>
>> Another possible (theoretical) issue (for vhost) is that it can't trigger
>> interrupt through the interrupt range.
> Hmm.. Could you explain?  When IR is enabled, all devices including virtio
> who send interrupt to 0xfeeXXXXX should be trapped by IR.


I meant vhost not virtio, if you teach vhost to DMA to 0xFEEx_xxxx, it 
can't generate any interrupts as expected.


>
>>
>>>>> For normal IOTLB invalidations, we were trying our best to always make
>>>>> IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what we're
>>>>> doing with the loop in vtd_address_space_unmap().
>>>> I'm sure such such assumption can work for any type of IOMMU.
>>>>
>>>>
>>>>> But this is not the first time that we may want to break this assumption for
>>>>> virtio so that we make the IOTLB a tuple of (start, len), then that len can be
>>>>> not a address mask any more.  That seems to be more efficient for things like
>>>>> vhost because iotlbs there are not page based, so it'll be inefficient if we
>>>>> always guarantee the addr_mask because it'll be quite a lot more roundtrips of
>>>>> the same range of invalidation.  Here we've encountered another issue of
>>>>> triggering the assertion with virtio-net, but only with the old RHEL7 guest.
>>>>>
>>>>> I'm thinking whether we can make the IOTLB invalidation configurable by
>>>>> specifying whether the backend of the notifier can handle arbitary address
>>>>> range in some way.  So we still have the guaranteed addr_masks by default
>>>>> (since I still don't think totally break the addr_mask restriction is wise...),
>>>>> however we can allow the special backends to take adavantage of using arbitary
>>>>> (start, len) ranges for reasons like performance.
>>>>>
>>>>> To do that, a quick idea is to introduce a flag IOMMU_NOTIFIER_ARBITRARY_MASK
>>>>> to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) can
>>>>> take arbitrary address mask, then it can be any value and finally becomes a
>>>>> length rather than an addr_mask.  Then for every iommu notify() we can directly
>>>>> deliver whatever we've got from the upper layer to this notifier.  With the new
>>>>> flag, vhost can do iommu_notifier_init() with UNMAP|ARBITRARY_MASK so it
>>>>> declares this capability.  Then no matter for device iotlb or normal iotlb, we
>>>>> skip the complicated procedure to split a big range into small ranges that are
>>>>> with strict addr_mask, but directly deliver the message to the iommu notifier.
>>>>> E.g., we can skip the loop in vtd_address_space_unmap() if the notifier is with
>>>>> ARBITRARY flag set.
>>>> I'm not sure coupling IOMMU capability to notifier is the best choice.
>>> IMHO it's not an IOMMU capability.  The flag I wanted to introduce is a
>>> capability of the one who listens to the IOMMU TLB updates.  For our case, it's
>>> virtio/vhost's capability to allow arbitrary length. The IOMMU itself
>>> definitely has some limitation on the address range to be bound to an IOTLB
>>> invalidation, e.g., the device-iotlb we're talking here only accept both the
>>> iova address and addr_mask to be aligned to 2**N-1.
>>
>> I think this go back to one of our previous discussion of whether to
>> introduce a dedicated notifiers for device IOTLB.
>>
>> For IOMMU, it might have limitation like GAW, but for device IOTLB it
>> probably doesn't. That's the reason we hit the assert here.
> I feel like even for hardware it shouldn't be arbitrary either,


Yes, but from the view of IOMMU, it's hard to know about that. Allowing 
[0, ~0ULL] looks sane.


>   because the
> device iotlb sent from at least vt-d driver is very restricted too (borrowing
> the comment you wrote :):
>
>      /* According to ATS spec table 2.4:
>       * S = 0, bits 15:12 = xxxx     range size: 4K
>       * S = 1, bits 15:12 = xxx0     range size: 8K
>       * S = 1, bits 15:12 = xx01     range size: 16K
>       * S = 1, bits 15:12 = x011     range size: 32K
>       * S = 1, bits 15:12 = 0111     range size: 64K
>       * ...
>       */


Right, but the comment is probably misleading here, since it's for the 
PCI-E transaction between IOMMU and device not for the device IOTLB 
invalidation descriptor.

For device IOTLB invalidation descriptor, spec allows a [0, ~0ULL] 
invalidation:

"

6.5.2.5 Device-TLB Invalidate Descriptor

...

Size (S): The size field indicates the number of consecutive pages 
targeted by this invalidation
request. If S field is zero, a single page at page address specified by 
Address [63:12] is requested
to be invalidated. If S field is Set, the least significant bit in the 
Address field with value 0b
indicates the invalidation address range. For example, if S field is Set 
and Address[12] is Clear, it
indicates an 8KB invalidation address range with base address in Address 
[63:13]. If S field and
Address[12] is Set and bit 13 is Clear, it indicates a 16KB invalidation 
address range with base
address in Address [63:14], etc.

"

So if we receive an address whose [63] is 0 and the rest is all 1, it's 
then a [0, ~0ULL] invalidation.


>
>>
>>>> How about just convert to use a range [start, end] for any notifier and move
>>>> the checks (e.g the assert) into the actual notifier implemented (vhost or
>>>> vfio)?
>>> IOMMUTLBEntry itself is the abstraction layer of TLB entry.  Hardware TLB entry
>>> is definitely not arbitrary range either (because AFAICT the hardware should
>>> only cache PFN rather than address, so at least PAGE_SIZE aligned).
>>> Introducing this flag will already make this trickier just to avoid introducing
>>> another similar struct to IOMMUTLBEntry, but I really don't want to make it a
>>> default option...  Not to mention I probably have no reason to urge the rest
>>> iommu notifier users (tcg, vfio) to change their existing good code to suite
>>> any of the backend who can cooperate with arbitrary address ranges...
>>
>> Ok, so it looks like we need a dedicated notifiers to device IOTLB.
> Or we can also make a new flag for device iotlb just like current UNMAP? Then
> we replace the vhost type from UNMAP to DEVICE_IOTLB.  But IMHO using the
> ARBITRARY_LENGTH flag would work in a similar way.  DEVICE_IOTLB flag could
> also allow virtio/vhost to only receive one invalidation (now IIUC it'll
> receive both iotlb and device-iotlb for unmapping a page when ats=on), but then
> ats=on will be a must and it could break some old (misconfiged) qemu because
> afaict previously virtio/vhost could even work with vIOMMU (accidentally) even
> without ats=on.


That's a bug and I don't think we need to workaround mis-configurated 
qemu :)

Thanks


>
> Thanks,
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-30  2:41             ` Jason Wang
@ 2020-06-30  8:29               ` Jason Wang
  2020-06-30  9:21                 ` Michael S. Tsirkin
  2020-06-30 15:39               ` Peter Xu
  1 sibling, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-06-30  8:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Michael S. Tsirkin,
	qemu-devel, Eugenio Pérez, Eric Auger, Paolo Bonzini


On 2020/6/30 上午10:41, Jason Wang wrote:
>
> On 2020/6/29 下午9:34, Peter Xu wrote:
>> On Mon, Jun 29, 2020 at 01:51:47PM +0800, Jason Wang wrote:
>>> On 2020/6/28 下午10:47, Peter Xu wrote:
>>>> On Sun, Jun 28, 2020 at 03:03:41PM +0800, Jason Wang wrote:
>>>>> On 2020/6/27 上午5:29, Peter Xu wrote:
>>>>>> Hi, Eugenio,
>>>>>>
>>>>>> (CCing Eric, Yan and Michael too)
>>>>>>
>>>>>> On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
>>>>>>> diff --git a/memory.c b/memory.c
>>>>>>> index 2f15a4b250..7f789710d2 100644
>>>>>>> --- a/memory.c
>>>>>>> +++ b/memory.c
>>>>>>> @@ -1915,8 +1915,6 @@ void 
>>>>>>> memory_region_notify_one(IOMMUNotifier *notifier,
>>>>>>>             return;
>>>>>>>         }
>>>>>>> -    assert(entry->iova >= notifier->start && entry_end <= 
>>>>>>> notifier->end);
>>>>>> I can understand removing the assertion should solve the issue, 
>>>>>> however imho
>>>>>> the major issue is not about this single assertion but the whole 
>>>>>> addr_mask
>>>>>> issue behind with virtio...
>>>>> I don't get here, it looks to the the range was from guest IOMMU 
>>>>> drivers.
>>>> Yes.  Note that I didn't mean that it's a problem in virtio, it's 
>>>> just the fact
>>>> that virtio is the only one I know that would like to support 
>>>> arbitrary address
>>>> range for the translated region.  I don't know about tcg, but vfio 
>>>> should still
>>>> need some kind of page alignment in both the address and the 
>>>> addr_mask.  We
>>>> have that assumption too across the memory core when we do 
>>>> translations.
>>>
>>> Right but it looks to me the issue is not the alignment.
>>>
>>>
>>>> A further cause of the issue is the MSI region when vIOMMU enabled 
>>>> - currently
>>>> we implemented the interrupt region using another memory region so 
>>>> it split the
>>>> whole DMA region into two parts.  That's really a clean approach to IR
>>>> implementation, however that's also a burden to the invalidation 
>>>> part because
>>>> then we'll need to handle things like this when the listened range 
>>>> is not page
>>>> alighed at all (neither 0-0xfedffff, nor 0xfef0000-MAX).  If 
>>>> without the IR
>>>> region (so the whole iommu address range will be a single FlatRange),
>>>
>>> Is this a bug? I remember that at least for vtd, it won't do any 
>>> DMAR on the
>>> intrrupt address range
>> I don't think it's a bug, at least it's working as how I 
>> understand...  that
>> interrupt range is using an IR region, that's why I said the IR 
>> region splits
>> the DMAR region into two pieces, so we have two FlatRange for the same
>> IOMMUMemoryRegion.
>
>
> I don't check the qemu code but if "a single FlatRange" means 
> 0xFEEx_xxxx is subject to DMA remapping, OS need to setup passthrough 
> mapping for that range in order to get MSI to work. This is not what 
> vtd spec said:
>
> """
>
> 3.14 Handling Requests to Interrupt Address Range
>
> Requests without PASID to address range 0xFEEx_xxxx are treated as
> potential interrupt requests and are not subjected to DMA remapping
> (even if translation structures specify a mapping for this
> range). Instead, remapping hardware can be enabled to subject such
> interrupt requests to interrupt remapping.
>
> """
>
> My understanding is vtd won't do any DMA translation on 0xFEEx_xxxx 
> even if IR is not enabled.


Ok, we had a dedicated mr for interrupt:

memory_region_add_subregion_overlap(MEMORY_REGION(&vtd_dev_as->iommu),
VTD_INTERRUPT_ADDR_FIRST,
&vtd_dev_as->iommu_ir, 1);

So it should be fine. I guess the reason that I'm asking is that I 
thought "IR" means "Interrupt remapping" but in fact it means "Interrupt 
Region"?

But I'm still not clear about the invalidation part for interrupt 
region, maybe you can elaborate a little more on this.

Btw, I think guest can trigger the assert in vtd_do_iommu_translate() if 
we teach vhost to DMA to that region:

     /*
      * We have standalone memory region for interrupt addresses, we
      * should never receive translation requests in this region.
      */
     assert(!vtd_is_interrupt_addr(addr));

Is this better to return false here? (We can work on the fix for vhost 
but it should be not trivial)

Thanks




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-30  8:29               ` Jason Wang
@ 2020-06-30  9:21                 ` Michael S. Tsirkin
  2020-06-30  9:23                   ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Michael S. Tsirkin @ 2020-06-30  9:21 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, qemu-devel, Peter Xu,
	Eugenio Pérez, Eric Auger, Paolo Bonzini

On Tue, Jun 30, 2020 at 04:29:19PM +0800, Jason Wang wrote:
> 
> On 2020/6/30 上午10:41, Jason Wang wrote:
> > 
> > On 2020/6/29 下午9:34, Peter Xu wrote:
> > > On Mon, Jun 29, 2020 at 01:51:47PM +0800, Jason Wang wrote:
> > > > On 2020/6/28 下午10:47, Peter Xu wrote:
> > > > > On Sun, Jun 28, 2020 at 03:03:41PM +0800, Jason Wang wrote:
> > > > > > On 2020/6/27 上午5:29, Peter Xu wrote:
> > > > > > > Hi, Eugenio,
> > > > > > > 
> > > > > > > (CCing Eric, Yan and Michael too)
> > > > > > > 
> > > > > > > On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
> > > > > > > > diff --git a/memory.c b/memory.c
> > > > > > > > index 2f15a4b250..7f789710d2 100644
> > > > > > > > --- a/memory.c
> > > > > > > > +++ b/memory.c
> > > > > > > > @@ -1915,8 +1915,6 @@ void
> > > > > > > > memory_region_notify_one(IOMMUNotifier
> > > > > > > > *notifier,
> > > > > > > >             return;
> > > > > > > >         }
> > > > > > > > -    assert(entry->iova >= notifier->start &&
> > > > > > > > entry_end <= notifier->end);
> > > > > > > I can understand removing the assertion should solve
> > > > > > > the issue, however imho
> > > > > > > the major issue is not about this single assertion
> > > > > > > but the whole addr_mask
> > > > > > > issue behind with virtio...
> > > > > > I don't get here, it looks to the the range was from
> > > > > > guest IOMMU drivers.
> > > > > Yes.  Note that I didn't mean that it's a problem in virtio,
> > > > > it's just the fact
> > > > > that virtio is the only one I know that would like to
> > > > > support arbitrary address
> > > > > range for the translated region.  I don't know about tcg,
> > > > > but vfio should still
> > > > > need some kind of page alignment in both the address and the
> > > > > addr_mask.  We
> > > > > have that assumption too across the memory core when we do
> > > > > translations.
> > > > 
> > > > Right but it looks to me the issue is not the alignment.
> > > > 
> > > > 
> > > > > A further cause of the issue is the MSI region when vIOMMU
> > > > > enabled - currently
> > > > > we implemented the interrupt region using another memory
> > > > > region so it split the
> > > > > whole DMA region into two parts.  That's really a clean approach to IR
> > > > > implementation, however that's also a burden to the
> > > > > invalidation part because
> > > > > then we'll need to handle things like this when the listened
> > > > > range is not page
> > > > > alighed at all (neither 0-0xfedffff, nor 0xfef0000-MAX).  If
> > > > > without the IR
> > > > > region (so the whole iommu address range will be a single FlatRange),
> > > > 
> > > > Is this a bug? I remember that at least for vtd, it won't do any
> > > > DMAR on the
> > > > intrrupt address range
> > > I don't think it's a bug, at least it's working as how I
> > > understand...  that
> > > interrupt range is using an IR region, that's why I said the IR
> > > region splits
> > > the DMAR region into two pieces, so we have two FlatRange for the same
> > > IOMMUMemoryRegion.
> > 
> > 
> > I don't check the qemu code but if "a single FlatRange" means
> > 0xFEEx_xxxx is subject to DMA remapping, OS need to setup passthrough
> > mapping for that range in order to get MSI to work. This is not what vtd
> > spec said:
> > 
> > """
> > 
> > 3.14 Handling Requests to Interrupt Address Range
> > 
> > Requests without PASID to address range 0xFEEx_xxxx are treated as
> > potential interrupt requests and are not subjected to DMA remapping
> > (even if translation structures specify a mapping for this
> > range). Instead, remapping hardware can be enabled to subject such
> > interrupt requests to interrupt remapping.
> > 
> > """
> > 
> > My understanding is vtd won't do any DMA translation on 0xFEEx_xxxx even
> > if IR is not enabled.
> 
> 
> Ok, we had a dedicated mr for interrupt:
> 
> memory_region_add_subregion_overlap(MEMORY_REGION(&vtd_dev_as->iommu),
> VTD_INTERRUPT_ADDR_FIRST,
> &vtd_dev_as->iommu_ir, 1);
> 
> So it should be fine. I guess the reason that I'm asking is that I thought
> "IR" means "Interrupt remapping" but in fact it means "Interrupt Region"?
> 
> But I'm still not clear about the invalidation part for interrupt region,
> maybe you can elaborate a little more on this.
> 
> Btw, I think guest can trigger the assert in vtd_do_iommu_translate() if we
> teach vhost to DMA to that region:


Why would we want to?

> 
>     /*
>      * We have standalone memory region for interrupt addresses, we
>      * should never receive translation requests in this region.
>      */
>     assert(!vtd_is_interrupt_addr(addr));
> 
> Is this better to return false here? (We can work on the fix for vhost but
> it should be not trivial)
> 
> Thanks
> 



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-30  9:21                 ` Michael S. Tsirkin
@ 2020-06-30  9:23                   ` Jason Wang
  2020-06-30 15:20                     ` Peter Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-06-30  9:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, qemu-devel, Peter Xu,
	Eugenio Pérez, Eric Auger, Paolo Bonzini


On 2020/6/30 下午5:21, Michael S. Tsirkin wrote:
> On Tue, Jun 30, 2020 at 04:29:19PM +0800, Jason Wang wrote:
>> On 2020/6/30 上午10:41, Jason Wang wrote:
>>> On 2020/6/29 下午9:34, Peter Xu wrote:
>>>> On Mon, Jun 29, 2020 at 01:51:47PM +0800, Jason Wang wrote:
>>>>> On 2020/6/28 下午10:47, Peter Xu wrote:
>>>>>> On Sun, Jun 28, 2020 at 03:03:41PM +0800, Jason Wang wrote:
>>>>>>> On 2020/6/27 上午5:29, Peter Xu wrote:
>>>>>>>> Hi, Eugenio,
>>>>>>>>
>>>>>>>> (CCing Eric, Yan and Michael too)
>>>>>>>>
>>>>>>>> On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
>>>>>>>>> diff --git a/memory.c b/memory.c
>>>>>>>>> index 2f15a4b250..7f789710d2 100644
>>>>>>>>> --- a/memory.c
>>>>>>>>> +++ b/memory.c
>>>>>>>>> @@ -1915,8 +1915,6 @@ void
>>>>>>>>> memory_region_notify_one(IOMMUNotifier
>>>>>>>>> *notifier,
>>>>>>>>>              return;
>>>>>>>>>          }
>>>>>>>>> -    assert(entry->iova >= notifier->start &&
>>>>>>>>> entry_end <= notifier->end);
>>>>>>>> I can understand removing the assertion should solve
>>>>>>>> the issue, however imho
>>>>>>>> the major issue is not about this single assertion
>>>>>>>> but the whole addr_mask
>>>>>>>> issue behind with virtio...
>>>>>>> I don't get here, it looks to the the range was from
>>>>>>> guest IOMMU drivers.
>>>>>> Yes.  Note that I didn't mean that it's a problem in virtio,
>>>>>> it's just the fact
>>>>>> that virtio is the only one I know that would like to
>>>>>> support arbitrary address
>>>>>> range for the translated region.  I don't know about tcg,
>>>>>> but vfio should still
>>>>>> need some kind of page alignment in both the address and the
>>>>>> addr_mask.  We
>>>>>> have that assumption too across the memory core when we do
>>>>>> translations.
>>>>> Right but it looks to me the issue is not the alignment.
>>>>>
>>>>>
>>>>>> A further cause of the issue is the MSI region when vIOMMU
>>>>>> enabled - currently
>>>>>> we implemented the interrupt region using another memory
>>>>>> region so it split the
>>>>>> whole DMA region into two parts.  That's really a clean approach to IR
>>>>>> implementation, however that's also a burden to the
>>>>>> invalidation part because
>>>>>> then we'll need to handle things like this when the listened
>>>>>> range is not page
>>>>>> alighed at all (neither 0-0xfedffff, nor 0xfef0000-MAX).  If
>>>>>> without the IR
>>>>>> region (so the whole iommu address range will be a single FlatRange),
>>>>> Is this a bug? I remember that at least for vtd, it won't do any
>>>>> DMAR on the
>>>>> intrrupt address range
>>>> I don't think it's a bug, at least it's working as how I
>>>> understand...  that
>>>> interrupt range is using an IR region, that's why I said the IR
>>>> region splits
>>>> the DMAR region into two pieces, so we have two FlatRange for the same
>>>> IOMMUMemoryRegion.
>>>
>>> I don't check the qemu code but if "a single FlatRange" means
>>> 0xFEEx_xxxx is subject to DMA remapping, OS need to setup passthrough
>>> mapping for that range in order to get MSI to work. This is not what vtd
>>> spec said:
>>>
>>> """
>>>
>>> 3.14 Handling Requests to Interrupt Address Range
>>>
>>> Requests without PASID to address range 0xFEEx_xxxx are treated as
>>> potential interrupt requests and are not subjected to DMA remapping
>>> (even if translation structures specify a mapping for this
>>> range). Instead, remapping hardware can be enabled to subject such
>>> interrupt requests to interrupt remapping.
>>>
>>> """
>>>
>>> My understanding is vtd won't do any DMA translation on 0xFEEx_xxxx even
>>> if IR is not enabled.
>>
>> Ok, we had a dedicated mr for interrupt:
>>
>> memory_region_add_subregion_overlap(MEMORY_REGION(&vtd_dev_as->iommu),
>> VTD_INTERRUPT_ADDR_FIRST,
>> &vtd_dev_as->iommu_ir, 1);
>>
>> So it should be fine. I guess the reason that I'm asking is that I thought
>> "IR" means "Interrupt remapping" but in fact it means "Interrupt Region"?
>>
>> But I'm still not clear about the invalidation part for interrupt region,
>> maybe you can elaborate a little more on this.
>>
>> Btw, I think guest can trigger the assert in vtd_do_iommu_translate() if we
>> teach vhost to DMA to that region:
>
> Why would we want to?


I meant a buggy(malicious) guest driver.

Thanks


>
>>      /*
>>       * We have standalone memory region for interrupt addresses, we
>>       * should never receive translation requests in this region.
>>       */
>>      assert(!vtd_is_interrupt_addr(addr));
>>
>> Is this better to return false here? (We can work on the fix for vhost but
>> it should be not trivial)
>>
>> Thanks
>>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-30  9:23                   ` Jason Wang
@ 2020-06-30 15:20                     ` Peter Xu
  2020-07-01  8:11                       ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-06-30 15:20 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Michael S. Tsirkin,
	qemu-devel, Eugenio Pérez, Eric Auger, Paolo Bonzini

On Tue, Jun 30, 2020 at 05:23:31PM +0800, Jason Wang wrote:
> > > Ok, we had a dedicated mr for interrupt:
> > > 
> > > memory_region_add_subregion_overlap(MEMORY_REGION(&vtd_dev_as->iommu),
> > > VTD_INTERRUPT_ADDR_FIRST,
> > > &vtd_dev_as->iommu_ir, 1);
> > > 
> > > So it should be fine. I guess the reason that I'm asking is that I thought
> > > "IR" means "Interrupt remapping" but in fact it means "Interrupt Region"?

I was meaning "interrupt remapping", and of course it's the interrupt region
too when IR enabled...

> > > 
> > > But I'm still not clear about the invalidation part for interrupt region,
> > > maybe you can elaborate a little more on this.
> > > 
> > > Btw, I think guest can trigger the assert in vtd_do_iommu_translate() if we
> > > teach vhost to DMA to that region:
> > 
> > Why would we want to?
> 
> 
> I meant a buggy(malicious) guest driver.

Yes seems possible.  Do you want to post a patch?  Let me know if you want me
to...  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-30  2:41             ` Jason Wang
  2020-06-30  8:29               ` Jason Wang
@ 2020-06-30 15:39               ` Peter Xu
  2020-07-01  8:09                 ` Jason Wang
  1 sibling, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-06-30 15:39 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Michael S. Tsirkin,
	qemu-devel, Eugenio Pérez, Eric Auger, Paolo Bonzini

On Tue, Jun 30, 2020 at 10:41:10AM +0800, Jason Wang wrote:
> >      /* According to ATS spec table 2.4:
> >       * S = 0, bits 15:12 = xxxx     range size: 4K
> >       * S = 1, bits 15:12 = xxx0     range size: 8K
> >       * S = 1, bits 15:12 = xx01     range size: 16K
> >       * S = 1, bits 15:12 = x011     range size: 32K
> >       * S = 1, bits 15:12 = 0111     range size: 64K
> >       * ...
> >       */
> 
> 
> Right, but the comment is probably misleading here, since it's for the PCI-E
> transaction between IOMMU and device not for the device IOTLB invalidation
> descriptor.
> 
> For device IOTLB invalidation descriptor, spec allows a [0, ~0ULL]
> invalidation:
> 
> "
> 
> 6.5.2.5 Device-TLB Invalidate Descriptor
> 
> ...
> 
> Size (S): The size field indicates the number of consecutive pages targeted
> by this invalidation
> request. If S field is zero, a single page at page address specified by
> Address [63:12] is requested
> to be invalidated. If S field is Set, the least significant bit in the
> Address field with value 0b
> indicates the invalidation address range. For example, if S field is Set and
> Address[12] is Clear, it
> indicates an 8KB invalidation address range with base address in Address
> [63:13]. If S field and
> Address[12] is Set and bit 13 is Clear, it indicates a 16KB invalidation
> address range with base
> address in Address [63:14], etc.
> 
> "
> 
> So if we receive an address whose [63] is 0 and the rest is all 1, it's then
> a [0, ~0ULL] invalidation.

Yes.  I think invalidating the whole range is always fine.  It's still not
arbitrary, right?  E.g., we can't even invalidate (0x1000, 0x3000) with
device-iotlb because of the address mask, not to say sub-pages.

> 
> 
> > 
> > > 
> > > > > How about just convert to use a range [start, end] for any notifier and move
> > > > > the checks (e.g the assert) into the actual notifier implemented (vhost or
> > > > > vfio)?
> > > > IOMMUTLBEntry itself is the abstraction layer of TLB entry.  Hardware TLB entry
> > > > is definitely not arbitrary range either (because AFAICT the hardware should
> > > > only cache PFN rather than address, so at least PAGE_SIZE aligned).
> > > > Introducing this flag will already make this trickier just to avoid introducing
> > > > another similar struct to IOMMUTLBEntry, but I really don't want to make it a
> > > > default option...  Not to mention I probably have no reason to urge the rest
> > > > iommu notifier users (tcg, vfio) to change their existing good code to suite
> > > > any of the backend who can cooperate with arbitrary address ranges...
> > > 
> > > Ok, so it looks like we need a dedicated notifiers to device IOTLB.
> > Or we can also make a new flag for device iotlb just like current UNMAP? Then
> > we replace the vhost type from UNMAP to DEVICE_IOTLB.  But IMHO using the
> > ARBITRARY_LENGTH flag would work in a similar way.  DEVICE_IOTLB flag could
> > also allow virtio/vhost to only receive one invalidation (now IIUC it'll
> > receive both iotlb and device-iotlb for unmapping a page when ats=on), but then
> > ats=on will be a must and it could break some old (misconfiged) qemu because
> > afaict previously virtio/vhost could even work with vIOMMU (accidentally) even
> > without ats=on.
> 
> 
> That's a bug and I don't think we need to workaround mis-configurated qemu
> :)

IMHO it depends on the strictness we want on the qemu cmdline API. :)

We should at least check libvirt to make sure it's using ats=on always, then I
agree maybe we can avoid considering the rest...

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-30 15:39               ` Peter Xu
@ 2020-07-01  8:09                 ` Jason Wang
  2020-07-02  3:01                   ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-07-01  8:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, libvir-list,
	Michael S. Tsirkin, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini


On 2020/6/30 下午11:39, Peter Xu wrote:
> On Tue, Jun 30, 2020 at 10:41:10AM +0800, Jason Wang wrote:
>>>       /* According to ATS spec table 2.4:
>>>        * S = 0, bits 15:12 = xxxx     range size: 4K
>>>        * S = 1, bits 15:12 = xxx0     range size: 8K
>>>        * S = 1, bits 15:12 = xx01     range size: 16K
>>>        * S = 1, bits 15:12 = x011     range size: 32K
>>>        * S = 1, bits 15:12 = 0111     range size: 64K
>>>        * ...
>>>        */
>>
>> Right, but the comment is probably misleading here, since it's for the PCI-E
>> transaction between IOMMU and device not for the device IOTLB invalidation
>> descriptor.
>>
>> For device IOTLB invalidation descriptor, spec allows a [0, ~0ULL]
>> invalidation:
>>
>> "
>>
>> 6.5.2.5 Device-TLB Invalidate Descriptor
>>
>> ...
>>
>> Size (S): The size field indicates the number of consecutive pages targeted
>> by this invalidation
>> request. If S field is zero, a single page at page address specified by
>> Address [63:12] is requested
>> to be invalidated. If S field is Set, the least significant bit in the
>> Address field with value 0b
>> indicates the invalidation address range. For example, if S field is Set and
>> Address[12] is Clear, it
>> indicates an 8KB invalidation address range with base address in Address
>> [63:13]. If S field and
>> Address[12] is Set and bit 13 is Clear, it indicates a 16KB invalidation
>> address range with base
>> address in Address [63:14], etc.
>>
>> "
>>
>> So if we receive an address whose [63] is 0 and the rest is all 1, it's then
>> a [0, ~0ULL] invalidation.
> Yes.  I think invalidating the whole range is always fine.  It's still not
> arbitrary, right?  E.g., we can't even invalidate (0x1000, 0x3000) with
> device-iotlb because of the address mask, not to say sub-pages.


Yes.


>
>>
>>>>>> How about just convert to use a range [start, end] for any notifier and move
>>>>>> the checks (e.g the assert) into the actual notifier implemented (vhost or
>>>>>> vfio)?
>>>>> IOMMUTLBEntry itself is the abstraction layer of TLB entry.  Hardware TLB entry
>>>>> is definitely not arbitrary range either (because AFAICT the hardware should
>>>>> only cache PFN rather than address, so at least PAGE_SIZE aligned).
>>>>> Introducing this flag will already make this trickier just to avoid introducing
>>>>> another similar struct to IOMMUTLBEntry, but I really don't want to make it a
>>>>> default option...  Not to mention I probably have no reason to urge the rest
>>>>> iommu notifier users (tcg, vfio) to change their existing good code to suite
>>>>> any of the backend who can cooperate with arbitrary address ranges...
>>>> Ok, so it looks like we need a dedicated notifiers to device IOTLB.
>>> Or we can also make a new flag for device iotlb just like current UNMAP? Then
>>> we replace the vhost type from UNMAP to DEVICE_IOTLB.  But IMHO using the
>>> ARBITRARY_LENGTH flag would work in a similar way.  DEVICE_IOTLB flag could
>>> also allow virtio/vhost to only receive one invalidation (now IIUC it'll
>>> receive both iotlb and device-iotlb for unmapping a page when ats=on), but then
>>> ats=on will be a must and it could break some old (misconfiged) qemu because
>>> afaict previously virtio/vhost could even work with vIOMMU (accidentally) even
>>> without ats=on.
>>
>> That's a bug and I don't think we need to workaround mis-configurated qemu
>> :)
> IMHO it depends on the strictness we want on the qemu cmdline API. :)
>
> We should at least check libvirt to make sure it's using ats=on always, then I
> agree maybe we can avoid considering the rest...
>
> Thanks,


Cc libvirt list, but I think we should fix libvirt if they don't provide 
"ats=on".

Thanks




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-30 15:20                     ` Peter Xu
@ 2020-07-01  8:11                       ` Jason Wang
  2020-07-01 12:16                         ` Peter Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-07-01  8:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, Juan Quintela,
	qemu-devel, Eugenio Pérez, Eric Auger, Paolo Bonzini


On 2020/6/30 下午11:20, Peter Xu wrote:
> On Tue, Jun 30, 2020 at 05:23:31PM +0800, Jason Wang wrote:
>>>> Ok, we had a dedicated mr for interrupt:
>>>>
>>>> memory_region_add_subregion_overlap(MEMORY_REGION(&vtd_dev_as->iommu),
>>>> VTD_INTERRUPT_ADDR_FIRST,
>>>> &vtd_dev_as->iommu_ir, 1);
>>>>
>>>> So it should be fine. I guess the reason that I'm asking is that I thought
>>>> "IR" means "Interrupt remapping" but in fact it means "Interrupt Region"?
> I was meaning "interrupt remapping", and of course it's the interrupt region
> too when IR enabled...


Right.


>
>>>> But I'm still not clear about the invalidation part for interrupt region,
>>>> maybe you can elaborate a little more on this.
>>>>
>>>> Btw, I think guest can trigger the assert in vtd_do_iommu_translate() if we
>>>> teach vhost to DMA to that region:
>>> Why would we want to?
>>
>> I meant a buggy(malicious) guest driver.
> Yes seems possible.  Do you want to post a patch?  Let me know if you want me
> to...  Thanks,


Yes please.

Thanks


>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-01  8:11                       ` Jason Wang
@ 2020-07-01 12:16                         ` Peter Xu
  2020-07-01 12:30                           ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-07-01 12:16 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, Juan Quintela,
	qemu-devel, Eugenio Pérez, Eric Auger, Paolo Bonzini

On Wed, Jul 01, 2020 at 04:11:49PM +0800, Jason Wang wrote:
> 
> On 2020/6/30 下午11:20, Peter Xu wrote:
> > On Tue, Jun 30, 2020 at 05:23:31PM +0800, Jason Wang wrote:
> > > > > Ok, we had a dedicated mr for interrupt:
> > > > > 
> > > > > memory_region_add_subregion_overlap(MEMORY_REGION(&vtd_dev_as->iommu),
> > > > > VTD_INTERRUPT_ADDR_FIRST,
> > > > > &vtd_dev_as->iommu_ir, 1);
> > > > > 
> > > > > So it should be fine. I guess the reason that I'm asking is that I thought
> > > > > "IR" means "Interrupt remapping" but in fact it means "Interrupt Region"?
> > I was meaning "interrupt remapping", and of course it's the interrupt region
> > too when IR enabled...
> 
> 
> Right.
> 
> 
> > 
> > > > > But I'm still not clear about the invalidation part for interrupt region,
> > > > > maybe you can elaborate a little more on this.
> > > > > 
> > > > > Btw, I think guest can trigger the assert in vtd_do_iommu_translate() if we
> > > > > teach vhost to DMA to that region:
> > > > Why would we want to?
> > > 
> > > I meant a buggy(malicious) guest driver.
> > Yes seems possible.  Do you want to post a patch?  Let me know if you want me
> > to...  Thanks,
> 
> 
> Yes please.

Oh wait...  Actually the comment above explains...

    /*
     * We have standalone memory region for interrupt addresses, we
     * should never receive translation requests in this region.
     */
    assert(!vtd_is_interrupt_addr(addr));

I overlooked myself that the IR region will be there even if ir=off.  So I
think the assert should stand.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-01 12:16                         ` Peter Xu
@ 2020-07-01 12:30                           ` Jason Wang
  2020-07-01 12:41                             ` Peter Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-07-01 12:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, Juan Quintela,
	qemu-devel, Eugenio Pérez, Eric Auger, Paolo Bonzini


On 2020/7/1 下午8:16, Peter Xu wrote:
> On Wed, Jul 01, 2020 at 04:11:49PM +0800, Jason Wang wrote:
>> On 2020/6/30 下午11:20, Peter Xu wrote:
>>> On Tue, Jun 30, 2020 at 05:23:31PM +0800, Jason Wang wrote:
>>>>>> Ok, we had a dedicated mr for interrupt:
>>>>>>
>>>>>> memory_region_add_subregion_overlap(MEMORY_REGION(&vtd_dev_as->iommu),
>>>>>> VTD_INTERRUPT_ADDR_FIRST,
>>>>>> &vtd_dev_as->iommu_ir, 1);
>>>>>>
>>>>>> So it should be fine. I guess the reason that I'm asking is that I thought
>>>>>> "IR" means "Interrupt remapping" but in fact it means "Interrupt Region"?
>>> I was meaning "interrupt remapping", and of course it's the interrupt region
>>> too when IR enabled...
>>
>> Right.
>>
>>
>>>>>> But I'm still not clear about the invalidation part for interrupt region,
>>>>>> maybe you can elaborate a little more on this.
>>>>>>
>>>>>> Btw, I think guest can trigger the assert in vtd_do_iommu_translate() if we
>>>>>> teach vhost to DMA to that region:
>>>>> Why would we want to?
>>>> I meant a buggy(malicious) guest driver.
>>> Yes seems possible.  Do you want to post a patch?  Let me know if you want me
>>> to...  Thanks,
>>
>> Yes please.
> Oh wait...  Actually the comment above explains...
>
>      /*
>       * We have standalone memory region for interrupt addresses, we
>       * should never receive translation requests in this region.
>       */
>      assert(!vtd_is_interrupt_addr(addr));
>
> I overlooked myself that the IR region will be there even if ir=off.


Yes, but the point stands still but the issue is still if ir=off.


>    So I
> think the assert should stand.


Do you mean vhost can't trigger the assert()? If yes, I don't get how it 
can't.

Thanks


>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-01 12:30                           ` Jason Wang
@ 2020-07-01 12:41                             ` Peter Xu
  2020-07-02  3:00                               ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-07-01 12:41 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, Juan Quintela,
	qemu-devel, Eugenio Pérez, Eric Auger, Paolo Bonzini

On Wed, Jul 01, 2020 at 08:30:07PM +0800, Jason Wang wrote:
> > I overlooked myself that the IR region will be there even if ir=off.
> 
> 
> Yes, but the point stands still but the issue is still if ir=off.
> 
> 
> >    So I
> > think the assert should stand.
> 
> 
> Do you mean vhost can't trigger the assert()? If yes, I don't get how it
> can't.

Yes.  vhost can only trigger the translate() via memory API.  No matter whether
ir is on/off, the memory region is always enabled, so any access to 0xfeexxxxx
from memory API should still land at s->mr_ir, afaict, rather than the dmar region.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-01 12:41                             ` Peter Xu
@ 2020-07-02  3:00                               ` Jason Wang
  0 siblings, 0 replies; 68+ messages in thread
From: Jason Wang @ 2020-07-02  3:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, Juan Quintela,
	qemu-devel, Eugenio Pérez, Eric Auger, Paolo Bonzini


On 2020/7/1 下午8:41, Peter Xu wrote:
> On Wed, Jul 01, 2020 at 08:30:07PM +0800, Jason Wang wrote:
>>> I overlooked myself that the IR region will be there even if ir=off.
>>
>> Yes, but the point stands still but the issue is still if ir=off.
>>
>>
>>>     So I
>>> think the assert should stand.
>>
>> Do you mean vhost can't trigger the assert()? If yes, I don't get how it
>> can't.
> Yes.  vhost can only trigger the translate() via memory API.  No matter whether
> ir is on/off, the memory region is always enabled, so any access to 0xfeexxxxx
> from memory API should still land at s->mr_ir, afaict, rather than the dmar region.


Right.

Thanks

>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-01  8:09                 ` Jason Wang
@ 2020-07-02  3:01                   ` Jason Wang
  2020-07-02 15:45                     ` Peter Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-07-02  3:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, libvir-list,
	Juan Quintela, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini


On 2020/7/1 下午4:09, Jason Wang wrote:
>
> On 2020/6/30 下午11:39, Peter Xu wrote:
>> On Tue, Jun 30, 2020 at 10:41:10AM +0800, Jason Wang wrote:
>>>>       /* According to ATS spec table 2.4:
>>>>        * S = 0, bits 15:12 = xxxx     range size: 4K
>>>>        * S = 1, bits 15:12 = xxx0     range size: 8K
>>>>        * S = 1, bits 15:12 = xx01     range size: 16K
>>>>        * S = 1, bits 15:12 = x011     range size: 32K
>>>>        * S = 1, bits 15:12 = 0111     range size: 64K
>>>>        * ...
>>>>        */
>>>
>>> Right, but the comment is probably misleading here, since it's for 
>>> the PCI-E
>>> transaction between IOMMU and device not for the device IOTLB 
>>> invalidation
>>> descriptor.
>>>
>>> For device IOTLB invalidation descriptor, spec allows a [0, ~0ULL]
>>> invalidation:
>>>
>>> "
>>>
>>> 6.5.2.5 Device-TLB Invalidate Descriptor
>>>
>>> ...
>>>
>>> Size (S): The size field indicates the number of consecutive pages 
>>> targeted
>>> by this invalidation
>>> request. If S field is zero, a single page at page address specified by
>>> Address [63:12] is requested
>>> to be invalidated. If S field is Set, the least significant bit in the
>>> Address field with value 0b
>>> indicates the invalidation address range. For example, if S field is 
>>> Set and
>>> Address[12] is Clear, it
>>> indicates an 8KB invalidation address range with base address in 
>>> Address
>>> [63:13]. If S field and
>>> Address[12] is Set and bit 13 is Clear, it indicates a 16KB 
>>> invalidation
>>> address range with base
>>> address in Address [63:14], etc.
>>>
>>> "
>>>
>>> So if we receive an address whose [63] is 0 and the rest is all 1, 
>>> it's then
>>> a [0, ~0ULL] invalidation.
>> Yes.  I think invalidating the whole range is always fine.  It's 
>> still not
>> arbitrary, right?  E.g., we can't even invalidate (0x1000, 0x3000) with
>> device-iotlb because of the address mask, not to say sub-pages.
>
>
> Yes.
>
>
>>
>>>
>>>>>>> How about just convert to use a range [start, end] for any 
>>>>>>> notifier and move
>>>>>>> the checks (e.g the assert) into the actual notifier implemented 
>>>>>>> (vhost or
>>>>>>> vfio)?
>>>>>> IOMMUTLBEntry itself is the abstraction layer of TLB entry.  
>>>>>> Hardware TLB entry
>>>>>> is definitely not arbitrary range either (because AFAICT the 
>>>>>> hardware should
>>>>>> only cache PFN rather than address, so at least PAGE_SIZE aligned).
>>>>>> Introducing this flag will already make this trickier just to 
>>>>>> avoid introducing
>>>>>> another similar struct to IOMMUTLBEntry, but I really don't want 
>>>>>> to make it a
>>>>>> default option...  Not to mention I probably have no reason to 
>>>>>> urge the rest
>>>>>> iommu notifier users (tcg, vfio) to change their existing good 
>>>>>> code to suite
>>>>>> any of the backend who can cooperate with arbitrary address 
>>>>>> ranges...
>>>>> Ok, so it looks like we need a dedicated notifiers to device IOTLB.
>>>> Or we can also make a new flag for device iotlb just like current 
>>>> UNMAP? Then
>>>> we replace the vhost type from UNMAP to DEVICE_IOTLB.  But IMHO 
>>>> using the
>>>> ARBITRARY_LENGTH flag would work in a similar way. DEVICE_IOTLB 
>>>> flag could
>>>> also allow virtio/vhost to only receive one invalidation (now IIUC 
>>>> it'll
>>>> receive both iotlb and device-iotlb for unmapping a page when 
>>>> ats=on), but then
>>>> ats=on will be a must and it could break some old (misconfiged) 
>>>> qemu because
>>>> afaict previously virtio/vhost could even work with vIOMMU 
>>>> (accidentally) even
>>>> without ats=on.
>>>
>>> That's a bug and I don't think we need to workaround 
>>> mis-configurated qemu
>>> :)
>> IMHO it depends on the strictness we want on the qemu cmdline API. :)
>>
>> We should at least check libvirt to make sure it's using ats=on 
>> always, then I
>> agree maybe we can avoid considering the rest...
>>
>> Thanks,
>
>
> Cc libvirt list, but I think we should fix libvirt if they don't 
> provide "ats=on".
>
> Thanks


Libvirt looks fine, according to the domain  XML documentation[1]:

  QEMU's virtio devices have some attributes related to the virtio 
transport under the driver element: The iommu attribute enables the use 
of emulated IOMMU by the device. The attribute ats controls the Address 
Translation Service support for PCIe devices. This is needed to make use 
of IOTLB support (see IOMMU device). Possible values are on or off. 
Since 3.5.0

So I think we agree that a new notifier is needed?

Thanks

[1] https://libvirt.org/formatdomain.html


>
>
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-02  3:01                   ` Jason Wang
@ 2020-07-02 15:45                     ` Peter Xu
  2020-07-03  7:24                       ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-07-02 15:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, libvir-list,
	Juan Quintela, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini

On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
> So I think we agree that a new notifier is needed?

Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-02 15:45                     ` Peter Xu
@ 2020-07-03  7:24                       ` Jason Wang
  2020-07-03 13:03                         ` Peter Xu
  2020-08-03 16:00                         ` Eugenio Pérez
  0 siblings, 2 replies; 68+ messages in thread
From: Jason Wang @ 2020-07-03  7:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, libvir-list,
	Juan Quintela, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini


On 2020/7/2 下午11:45, Peter Xu wrote:
> On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
>> So I think we agree that a new notifier is needed?
> Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?


That should work but I wonder something as following is better.

Instead of introducing new flags, how about carry the type of event in 
the notifier then the device (vhost) can choose the message it want to 
process like:

static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)

{

switch (event->type) {

case IOMMU_MAP:
case IOMMU_UNMAP:
case IOMMU_DEV_IOTLB_UNMAP:
...

}

Thanks


>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 0/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-29 15:05 ` [RFC v2 0/1] " Paolo Bonzini
@ 2020-07-03  7:39   ` Eugenio Perez Martin
  2020-07-03 10:10     ` Paolo Bonzini
  0 siblings, 1 reply; 68+ messages in thread
From: Eugenio Perez Martin @ 2020-07-03  7:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Maydell, Juan Quintela, Jason Wang, qemu-devel, Peter Xu,
	Avi Kivity

On Mon, Jun 29, 2020 at 5:05 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 26/06/20 08:41, Eugenio Pérez wrote:
> > If we examinate *entry in frame 4 of backtrace:
> > *entry = {target_as = 0x555556f6c050, iova = 0x0, translated_addr = 0x0,
> > addr_mask = 0xffffffffffffffff, perm = 0x0}
> >
> > Which (I think) tries to invalidate all the TLB registers of the device.
> >
> > Just deleting that assert is enough for the VM to start and communicate
> > using IOMMU, but maybe a better alternative is possible. We could move
> > it to the caller functions in other cases than IOMMU invalidation, or
> > make it conditional only if not invalidating.
>
> Yes, I think moving it up in the call stack is better. I cannot say
> where because the backtrace was destroyed by git (due to lines starting
> with "#").
>

Ouch, what a failure!

Pasting here for completion, sorry!

(gdb) bt
#0  0x00007ffff521370f in raise () at /lib64/libc.so.6
#1  0x00007ffff51fdb25 in abort () at /lib64/libc.so.6
#2  0x00007ffff51fd9f9 in _nl_load_domain.cold.0 () at /lib64/libc.so.6
#3  0x00007ffff520bcc6 in .annobin_assert.c_end () at /lib64/libc.so.6
#4  0x0000555555888171 in memory_region_notify_one
(notifier=0x7ffde0487fa8, entry=0x7ffde5dfe200) at
/home/qemu/memory.c:1918
#5  0x0000555555888247 in memory_region_notify_iommu
(iommu_mr=0x555556f6c0b0, iommu_idx=0, entry=...) at
/home/qemu/memory.c:1941
#6  0x0000555555951c8d in vtd_process_device_iotlb_desc
(s=0x555557609000, inv_desc=0x7ffde5dfe2d0)
    at /home/qemu/hw/i386/intel_iommu.c:2468
#7  0x0000555555951e6a in vtd_process_inv_desc (s=0x555557609000) at
/home/qemu/hw/i386/intel_iommu.c:2531
#8  0x0000555555951fa5 in vtd_fetch_inv_desc (s=0x555557609000) at
/home/qemu/hw/i386/intel_iommu.c:2563
#9  0x00005555559520e5 in vtd_handle_iqt_write (s=0x555557609000) at
/home/qemu/hw/i386/intel_iommu.c:2590
#10 0x0000555555952b45 in vtd_mem_write (opaque=0x555557609000,
addr=136, val=2688, size=4) at /home/qemu/hw/i386/intel_iommu.c:2837
#11 0x0000555555883e17 in memory_region_write_accessor
    (mr=0x555557609330, addr=136, value=0x7ffde5dfe478, size=4,
shift=0, mask=4294967295, attrs=...) at /home/qemu/memory.c:483
#12 0x000055555588401d in access_with_adjusted_size
    (addr=136, value=0x7ffde5dfe478, size=4, access_size_min=4,
access_size_max=8, access_fn=
    0x555555883d38 <memory_region_write_accessor>, mr=0x555557609330,
attrs=...) at /home/qemu/memory.c:544
#13 0x0000555555886f37 in memory_region_dispatch_write
(mr=0x555557609330, addr=136, data=2688, op=MO_32, attrs=...)
    at /home/qemu/memory.c:1476
#14 0x0000555555827a03 in flatview_write_continue
    (fv=0x7ffdd8503150, addr=4275634312, attrs=...,
ptr=0x7ffff7ff0028, len=4, addr1=136, l=4, mr=0x555557609330) at
/home/qemu/exec.c:3146
#15 0x0000555555827b48 in flatview_write (fv=0x7ffdd8503150,
addr=4275634312, attrs=..., buf=0x7ffff7ff0028, len=4)
    at /home/qemu/exec.c:3186
#16 0x0000555555827e9d in address_space_write
    (as=0x5555567ca640 <address_space_memory>, addr=4275634312,
attrs=..., buf=0x7ffff7ff0028, len=4) at /home/qemu/exec.c:3277
#17 0x0000555555827f0a in address_space_rw
    (as=0x5555567ca640 <address_space_memory>, addr=4275634312,
attrs=..., buf=0x7ffff7ff0028, len=4, is_write=true)
    at /home/qemu/exec.c:3287
#18 0x000055555589b633 in kvm_cpu_exec (cpu=0x555556b65640) at
/home/qemu/accel/kvm/kvm-all.c:2511
#19 0x0000555555876ba8 in qemu_kvm_cpu_thread_fn (arg=0x555556b65640)
at /home/qemu/cpus.c:1284
#20 0x0000555555dafff1 in qemu_thread_start (args=0x555556b8c3b0) at
util/qemu-thread-posix.c:521
#21 0x00007ffff55a62de in start_thread () at /lib64/libpthread.so.0
#22 0x00007ffff52d7e83 in clone () at /lib64/libc.so.6

(gdb) frame 4
#4  0x0000555555888171 in memory_region_notify_one
(notifier=0x7ffde0487fa8, entry=0x7ffde5dfe200) at
/home/qemu/memory.c:1918
1918        assert(entry->iova >= notifier->start && entry_end <=
notifier->end);
(gdb) p *entry
$1 = {target_as = 0x555556f6c050, iova = 0, translated_addr = 0,
addr_mask = 18446744073709551615, perm = IOMMU_NONE}

Thanks!

> Paolo
>
> > Any comment would be appreciated. Thanks!
> >
> > Guest kernel version: kernel-3.10.0-1151.el7.x86_64
> >
> > Bug reference: https://bugs.launchpad.net/qemu/+bug/1885175
> >
> > v2: Actually delete assertion instead of just commenting out using C99
> >
> > Eugenio Pérez (1):
> >   memory: Delete assertion in memory_region_unregister_iommu_notifier
> >
> >  memory.c | 2 --
> >  1 file changed, 2 deletions(-)
> >
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 0/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-03  7:39   ` Eugenio Perez Martin
@ 2020-07-03 10:10     ` Paolo Bonzini
  0 siblings, 0 replies; 68+ messages in thread
From: Paolo Bonzini @ 2020-07-03 10:10 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Peter Maydell, Juan Quintela, Jason Wang, qemu-devel, Peter Xu,
	Avi Kivity

On 03/07/20 09:39, Eugenio Perez Martin wrote:
> #4  0x0000555555888171 in memory_region_notify_one
> (notifier=0x7ffde0487fa8, entry=0x7ffde5dfe200) at
> /home/qemu/memory.c:1918
> 1918        assert(entry->iova >= notifier->start && entry_end <=
> notifier->end);
> (gdb) p *entry
> $1 = {target_as = 0x555556f6c050, iova = 0, translated_addr = 0,
> addr_mask = 18446744073709551615, perm = IOMMU_NONE}

Oh, I see now.  I am worried that an IOMMU notifier could interpret the
IOMMUTLBEntry incorrectly if there is only partial overlap.  There are
various possibilities:

1) create another IOMMUTLBEntry like

	hwaddr offset = notifier->start > entry->iova ? notifier->start -
entry->iova : 0;
	IOMMUTLBEntry partial = {
	    .target_as = entry->target_as,
	    .iova = entry->iova + offset,
	    .translated_addr = entry->translated_addr + offset,
	    .addr_mask = MIN(entry->addr_mask, notifier->end - notifier->start),
	    .perm = entry->perm
	};

The addr_mask however would not be a mask if the notifier is not
naturally aligned

2) pass the offset/size pair (computed as above) as extra arguments to
the IOMMUNotify function

3) add a function to compute the offset/size and call it in the notifier

You choose. :)

Paolo



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-03  7:24                       ` Jason Wang
@ 2020-07-03 13:03                         ` Peter Xu
  2020-07-07  8:03                           ` Jason Wang
  2020-08-03 16:00                         ` Eugenio Pérez
  1 sibling, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-07-03 13:03 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, libvir-list,
	Juan Quintela, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini

On Fri, Jul 03, 2020 at 03:24:19PM +0800, Jason Wang wrote:
> 
> On 2020/7/2 下午11:45, Peter Xu wrote:
> > On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
> > > So I think we agree that a new notifier is needed?
> > Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?
> 
> 
> That should work but I wonder something as following is better.
> 
> Instead of introducing new flags, how about carry the type of event in the
> notifier then the device (vhost) can choose the message it want to process
> like:
> 
> static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)
> 
> {
> 
> switch (event->type) {
> 
> case IOMMU_MAP:
> case IOMMU_UNMAP:
> case IOMMU_DEV_IOTLB_UNMAP:
> ...
> 
> }

Looks ok to me, though imo we should still keep the registration information,
so VT-d knows which notifiers is interested in which events.  E.g., we can
still do something like vtd_as_has_map_notifier().

So these are probably two different things: the new IOMMU_NOTIFIER_DEV_IOTLB
flag is one as discussed; the other one is whether we would like to introduce
IOMMUTLBEvent to include the type, so that we can avoid introduce two notifiers
for one device majorly to identify dev-iotlb from unmaps.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-03 13:03                         ` Peter Xu
@ 2020-07-07  8:03                           ` Jason Wang
  2020-07-07 19:54                             ` Peter Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-07-07  8:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, libvir-list,
	Michael S. Tsirkin, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini


On 2020/7/3 下午9:03, Peter Xu wrote:
> On Fri, Jul 03, 2020 at 03:24:19PM +0800, Jason Wang wrote:
>> On 2020/7/2 下午11:45, Peter Xu wrote:
>>> On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
>>>> So I think we agree that a new notifier is needed?
>>> Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?
>>
>> That should work but I wonder something as following is better.
>>
>> Instead of introducing new flags, how about carry the type of event in the
>> notifier then the device (vhost) can choose the message it want to process
>> like:
>>
>> static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)
>>
>> {
>>
>> switch (event->type) {
>>
>> case IOMMU_MAP:
>> case IOMMU_UNMAP:
>> case IOMMU_DEV_IOTLB_UNMAP:
>> ...
>>
>> }
> Looks ok to me, though imo we should still keep the registration information,
> so VT-d knows which notifiers is interested in which events.  E.g., we can
> still do something like vtd_as_has_map_notifier().


Is this for a better performance?

I wonder whether it's worth since we can't not have both vhost and vtd 
to be registered into the same as.

So it should be functional equivalent to vtd_as_has_notifier().

Thanks


>
> So these are probably two different things: the new IOMMU_NOTIFIER_DEV_IOTLB
> flag is one as discussed; the other one is whether we would like to introduce
> IOMMUTLBEvent to include the type, so that we can avoid introduce two notifiers
> for one device majorly to identify dev-iotlb from unmaps.
>
> Thanks,
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-07  8:03                           ` Jason Wang
@ 2020-07-07 19:54                             ` Peter Xu
  2020-07-08  5:42                               ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-07-07 19:54 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, libvir-list,
	Michael S. Tsirkin, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini

On Tue, Jul 07, 2020 at 04:03:10PM +0800, Jason Wang wrote:
> 
> On 2020/7/3 下午9:03, Peter Xu wrote:
> > On Fri, Jul 03, 2020 at 03:24:19PM +0800, Jason Wang wrote:
> > > On 2020/7/2 下午11:45, Peter Xu wrote:
> > > > On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
> > > > > So I think we agree that a new notifier is needed?
> > > > Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?
> > > 
> > > That should work but I wonder something as following is better.
> > > 
> > > Instead of introducing new flags, how about carry the type of event in the
> > > notifier then the device (vhost) can choose the message it want to process
> > > like:
> > > 
> > > static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)
> > > 
> > > {
> > > 
> > > switch (event->type) {
> > > 
> > > case IOMMU_MAP:
> > > case IOMMU_UNMAP:
> > > case IOMMU_DEV_IOTLB_UNMAP:
> > > ...
> > > 
> > > }
> > Looks ok to me, though imo we should still keep the registration information,
> > so VT-d knows which notifiers is interested in which events.  E.g., we can
> > still do something like vtd_as_has_map_notifier().
> 
> 
> Is this for a better performance?
> 
> I wonder whether it's worth since we can't not have both vhost and vtd to be
> registered into the same as.

/me failed to follow this sentence.. :(

> 
> So it should be functional equivalent to vtd_as_has_notifier().

For example: in vtd_iommu_replay() we'll skip the replay if vhost has
registered the iommu notifier because vtd_as_has_map_notifier() will return
false.  It'll only return true if the device is a vfio-pci device.

Without vtd_as_has_map_notifier(), how could we do that?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-07 19:54                             ` Peter Xu
@ 2020-07-08  5:42                               ` Jason Wang
  2020-07-08 14:16                                 ` Peter Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-07-08  5:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, libvir-list,
	Juan Quintela, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini


On 2020/7/8 上午3:54, Peter Xu wrote:
> On Tue, Jul 07, 2020 at 04:03:10PM +0800, Jason Wang wrote:
>> On 2020/7/3 下午9:03, Peter Xu wrote:
>>> On Fri, Jul 03, 2020 at 03:24:19PM +0800, Jason Wang wrote:
>>>> On 2020/7/2 下午11:45, Peter Xu wrote:
>>>>> On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
>>>>>> So I think we agree that a new notifier is needed?
>>>>> Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?
>>>> That should work but I wonder something as following is better.
>>>>
>>>> Instead of introducing new flags, how about carry the type of event in the
>>>> notifier then the device (vhost) can choose the message it want to process
>>>> like:
>>>>
>>>> static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)
>>>>
>>>> {
>>>>
>>>> switch (event->type) {
>>>>
>>>> case IOMMU_MAP:
>>>> case IOMMU_UNMAP:
>>>> case IOMMU_DEV_IOTLB_UNMAP:
>>>> ...
>>>>
>>>> }
>>> Looks ok to me, though imo we should still keep the registration information,
>>> so VT-d knows which notifiers is interested in which events.  E.g., we can
>>> still do something like vtd_as_has_map_notifier().
>>
>> Is this for a better performance?
>>
>> I wonder whether it's worth since we can't not have both vhost and vtd to be
>> registered into the same as.
> /me failed to follow this sentence.. :(


Sorry, I meant "vfio" instead "vtd".


>
>> So it should be functional equivalent to vtd_as_has_notifier().
> For example: in vtd_iommu_replay() we'll skip the replay if vhost has
> registered the iommu notifier because vtd_as_has_map_notifier() will return
> false.


Two questions:

- Do we care the performance here? If not, vhost may just ignore the MAP 
event?
- If we care the performance, it's better to implement the MAP event for 
vhost, otherwise it could be a lot of IOTLB miss

Thanks


>    It'll only return true if the device is a vfio-pci device.
>
> Without vtd_as_has_map_notifier(), how could we do that?
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-08  5:42                               ` Jason Wang
@ 2020-07-08 14:16                                 ` Peter Xu
  2020-07-09  5:58                                   ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-07-08 14:16 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, libvir-list,
	Juan Quintela, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini

On Wed, Jul 08, 2020 at 01:42:30PM +0800, Jason Wang wrote:
> > > So it should be functional equivalent to vtd_as_has_notifier().
> > For example: in vtd_iommu_replay() we'll skip the replay if vhost has
> > registered the iommu notifier because vtd_as_has_map_notifier() will return
> > false.
> 
> 
> Two questions:
> 
> - Do we care the performance here? If not, vhost may just ignore the MAP
> event?

I think we care, because vtd_page_walk() can be expensive.

> - If we care the performance, it's better to implement the MAP event for
> vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

  - The first vmexit caused by an invalidation to MAP the page tables, so vhost
    will setup the page table before IO starts

  - IO/DMA triggers and completes

  - The second vmexit caused by another invalidation to UNMAP the page tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-08 14:16                                 ` Peter Xu
@ 2020-07-09  5:58                                   ` Jason Wang
  2020-07-09 14:10                                     ` Peter Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-07-09  5:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, libvir-list,
	Michael S. Tsirkin, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini


On 2020/7/8 下午10:16, Peter Xu wrote:
> On Wed, Jul 08, 2020 at 01:42:30PM +0800, Jason Wang wrote:
>>>> So it should be functional equivalent to vtd_as_has_notifier().
>>> For example: in vtd_iommu_replay() we'll skip the replay if vhost has
>>> registered the iommu notifier because vtd_as_has_map_notifier() will return
>>> false.
>>
>> Two questions:
>>
>> - Do we care the performance here? If not, vhost may just ignore the MAP
>> event?
> I think we care, because vtd_page_walk() can be expensive.


Ok.


>
>> - If we care the performance, it's better to implement the MAP event for
>> vhost, otherwise it could be a lot of IOTLB miss
> I feel like these are two things.
>
> So far what we are talking about is whether vt-d should have knowledge about
> what kind of events one iommu notifier is interested in.  I still think we
> should keep this as answered in question 1.
>
> The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
> events even without vDMA, so that vhost can establish the mapping even before
> IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
> the guest is using dynamic iommu page mappings, I feel like that can be even
> slower, because then the worst case is for each IO we'll need to vmexit twice:
>
>    - The first vmexit caused by an invalidation to MAP the page tables, so vhost
>      will setup the page table before IO starts
>
>    - IO/DMA triggers and completes
>
>    - The second vmexit caused by another invalidation to UNMAP the page tables
>
> So it seems to be worse than when vhost only uses UNMAP like right now.  At
> least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
> request from kernel to userspace, but IMHO that's cheaper than the vmexit.


Right but then I would still prefer to have another notifier.

Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a 
dedicated command for flushing device IOTLB. But the check for 
vtd_as_has_map_notifier is used to skip the device which can do demand 
paging via ATS or device specific way. If we have two different 
notifiers, vhost will be on the device iotlb notifier so we don't need 
it at all?

Thanks


>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-09  5:58                                   ` Jason Wang
@ 2020-07-09 14:10                                     ` Peter Xu
  2020-07-10  6:34                                       ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-07-09 14:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, libvir-list,
	Michael S. Tsirkin, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini

On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:
> > > - If we care the performance, it's better to implement the MAP event for
> > > vhost, otherwise it could be a lot of IOTLB miss
> > I feel like these are two things.
> > 
> > So far what we are talking about is whether vt-d should have knowledge about
> > what kind of events one iommu notifier is interested in.  I still think we
> > should keep this as answered in question 1.
> > 
> > The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
> > events even without vDMA, so that vhost can establish the mapping even before
> > IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
> > the guest is using dynamic iommu page mappings, I feel like that can be even
> > slower, because then the worst case is for each IO we'll need to vmexit twice:
> > 
> >    - The first vmexit caused by an invalidation to MAP the page tables, so vhost
> >      will setup the page table before IO starts
> > 
> >    - IO/DMA triggers and completes
> > 
> >    - The second vmexit caused by another invalidation to UNMAP the page tables
> > 
> > So it seems to be worse than when vhost only uses UNMAP like right now.  At
> > least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
> > request from kernel to userspace, but IMHO that's cheaper than the vmexit.
> 
> 
> Right but then I would still prefer to have another notifier.
> 
> Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
> dedicated command for flushing device IOTLB. But the check for
> vtd_as_has_map_notifier is used to skip the device which can do demand
> paging via ATS or device specific way. If we have two different notifiers,
> vhost will be on the device iotlb notifier so we don't need it at all?

But we can still have iommu notifier that only registers to UNMAP even after we
introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
TCG should be the only one so far, but I don't know.. maybe there can still be
new ones?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-09 14:10                                     ` Peter Xu
@ 2020-07-10  6:34                                       ` Jason Wang
  2020-07-10 13:30                                         ` Peter Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-07-10  6:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, libvir-list,
	Juan Quintela, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini


On 2020/7/9 下午10:10, Peter Xu wrote:
> On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:
>>>> - If we care the performance, it's better to implement the MAP event for
>>>> vhost, otherwise it could be a lot of IOTLB miss
>>> I feel like these are two things.
>>>
>>> So far what we are talking about is whether vt-d should have knowledge about
>>> what kind of events one iommu notifier is interested in.  I still think we
>>> should keep this as answered in question 1.
>>>
>>> The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
>>> events even without vDMA, so that vhost can establish the mapping even before
>>> IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
>>> the guest is using dynamic iommu page mappings, I feel like that can be even
>>> slower, because then the worst case is for each IO we'll need to vmexit twice:
>>>
>>>     - The first vmexit caused by an invalidation to MAP the page tables, so vhost
>>>       will setup the page table before IO starts
>>>
>>>     - IO/DMA triggers and completes
>>>
>>>     - The second vmexit caused by another invalidation to UNMAP the page tables
>>>
>>> So it seems to be worse than when vhost only uses UNMAP like right now.  At
>>> least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
>>> request from kernel to userspace, but IMHO that's cheaper than the vmexit.
>>
>> Right but then I would still prefer to have another notifier.
>>
>> Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
>> dedicated command for flushing device IOTLB. But the check for
>> vtd_as_has_map_notifier is used to skip the device which can do demand
>> paging via ATS or device specific way. If we have two different notifiers,
>> vhost will be on the device iotlb notifier so we don't need it at all?
> But we can still have iommu notifier that only registers to UNMAP even after we
> introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
> TCG should be the only one so far, but I don't know.. maybe there can still be
> new ones?


I think you're right. But looking at the codes, it looks like the check 
of vtd_as_has_map_notifier() was only used in:

1) vtd_iommu_replay()
2) vtd_iotlb_page_invalidate_notify() (PSI)

For the replay, it's expensive anyhow. For PSI, I think it's just about 
one or few mappings, not sure it will have obvious performance impact.

And I had two questions:

1) The codes doesn't check map for DSI or GI, does this match what spec 
said? (It looks to me the spec is unclear in this part)
2) for the replay() I don't see other implementations (either spapr or 
generic one) that did unmap (actually they skip unmap explicitly), any 
reason for doing this in intel IOMMU?

Thanks


>
> Thanks,
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-10  6:34                                       ` Jason Wang
@ 2020-07-10 13:30                                         ` Peter Xu
  2020-07-13  4:04                                           ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-07-10 13:30 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, libvir-list,
	Juan Quintela, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini

On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:
> 
> On 2020/7/9 下午10:10, Peter Xu wrote:
> > On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:
> > > > > - If we care the performance, it's better to implement the MAP event for
> > > > > vhost, otherwise it could be a lot of IOTLB miss
> > > > I feel like these are two things.
> > > > 
> > > > So far what we are talking about is whether vt-d should have knowledge about
> > > > what kind of events one iommu notifier is interested in.  I still think we
> > > > should keep this as answered in question 1.
> > > > 
> > > > The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
> > > > events even without vDMA, so that vhost can establish the mapping even before
> > > > IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
> > > > the guest is using dynamic iommu page mappings, I feel like that can be even
> > > > slower, because then the worst case is for each IO we'll need to vmexit twice:
> > > > 
> > > >     - The first vmexit caused by an invalidation to MAP the page tables, so vhost
> > > >       will setup the page table before IO starts
> > > > 
> > > >     - IO/DMA triggers and completes
> > > > 
> > > >     - The second vmexit caused by another invalidation to UNMAP the page tables
> > > > 
> > > > So it seems to be worse than when vhost only uses UNMAP like right now.  At
> > > > least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
> > > > request from kernel to userspace, but IMHO that's cheaper than the vmexit.
> > > 
> > > Right but then I would still prefer to have another notifier.
> > > 
> > > Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
> > > dedicated command for flushing device IOTLB. But the check for
> > > vtd_as_has_map_notifier is used to skip the device which can do demand
> > > paging via ATS or device specific way. If we have two different notifiers,
> > > vhost will be on the device iotlb notifier so we don't need it at all?
> > But we can still have iommu notifier that only registers to UNMAP even after we
> > introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
> > TCG should be the only one so far, but I don't know.. maybe there can still be
> > new ones?
> 
> 
> I think you're right. But looking at the codes, it looks like the check of
> vtd_as_has_map_notifier() was only used in:
> 
> 1) vtd_iommu_replay()
> 2) vtd_iotlb_page_invalidate_notify() (PSI)
> 
> For the replay, it's expensive anyhow. For PSI, I think it's just about one
> or few mappings, not sure it will have obvious performance impact.
> 
> And I had two questions:
> 
> 1) The codes doesn't check map for DSI or GI, does this match what spec
> said? (It looks to me the spec is unclear in this part)

Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
vtd_iotlb_domain_invalidate().

> 2) for the replay() I don't see other implementations (either spapr or
> generic one) that did unmap (actually they skip unmap explicitly), any
> reason for doing this in intel IOMMU?

I could be wrong, but I'd guess it's because vt-d implemented the caching mode
by leveraging the same invalidation strucuture, so it's harder to make all
things right (IOW, we can't clearly identify MAP with UNMAP when we receive an
invalidation request, because MAP/UNMAP requests look the same).

I didn't check others, but I believe spapr is doing it differently by using
some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to
what virtio-iommu is doing.  Anyway, the point is if we have explicit MAP/UNMAP
from the guest, logically the replay indeed does not need to do any unmap
because we don't need to call replay() on an already existing device but only
for e.g. hot plug.  VT-d does not have that clear interface, so VT-d needs to
maintain its own mapping structures, and also vt-d is using the same replay &
page_walk operations to sync all these structures, which complicated the vt-d
replay a bit.  With that, we assume replay() can be called anytime on a device,
and we won't notify duplicated MAPs to lower layer like vfio if it is mapped
before.  At the meantime, since we'll compare the latest mapping with the one
we cached in the iova tree, UNMAP becomes possible too.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-10 13:30                                         ` Peter Xu
@ 2020-07-13  4:04                                           ` Jason Wang
  2020-07-16  1:00                                             ` Peter Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-07-13  4:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, libvir-list,
	Juan Quintela, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini


On 2020/7/10 下午9:30, Peter Xu wrote:
> On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:
>> On 2020/7/9 下午10:10, Peter Xu wrote:
>>> On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:
>>>>>> - If we care the performance, it's better to implement the MAP event for
>>>>>> vhost, otherwise it could be a lot of IOTLB miss
>>>>> I feel like these are two things.
>>>>>
>>>>> So far what we are talking about is whether vt-d should have knowledge about
>>>>> what kind of events one iommu notifier is interested in.  I still think we
>>>>> should keep this as answered in question 1.
>>>>>
>>>>> The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
>>>>> events even without vDMA, so that vhost can establish the mapping even before
>>>>> IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
>>>>> the guest is using dynamic iommu page mappings, I feel like that can be even
>>>>> slower, because then the worst case is for each IO we'll need to vmexit twice:
>>>>>
>>>>>      - The first vmexit caused by an invalidation to MAP the page tables, so vhost
>>>>>        will setup the page table before IO starts
>>>>>
>>>>>      - IO/DMA triggers and completes
>>>>>
>>>>>      - The second vmexit caused by another invalidation to UNMAP the page tables
>>>>>
>>>>> So it seems to be worse than when vhost only uses UNMAP like right now.  At
>>>>> least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
>>>>> request from kernel to userspace, but IMHO that's cheaper than the vmexit.
>>>> Right but then I would still prefer to have another notifier.
>>>>
>>>> Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
>>>> dedicated command for flushing device IOTLB. But the check for
>>>> vtd_as_has_map_notifier is used to skip the device which can do demand
>>>> paging via ATS or device specific way. If we have two different notifiers,
>>>> vhost will be on the device iotlb notifier so we don't need it at all?
>>> But we can still have iommu notifier that only registers to UNMAP even after we
>>> introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
>>> TCG should be the only one so far, but I don't know.. maybe there can still be
>>> new ones?
>>
>> I think you're right. But looking at the codes, it looks like the check of
>> vtd_as_has_map_notifier() was only used in:
>>
>> 1) vtd_iommu_replay()
>> 2) vtd_iotlb_page_invalidate_notify() (PSI)
>>
>> For the replay, it's expensive anyhow. For PSI, I think it's just about one
>> or few mappings, not sure it will have obvious performance impact.
>>
>> And I had two questions:
>>
>> 1) The codes doesn't check map for DSI or GI, does this match what spec
>> said? (It looks to me the spec is unclear in this part)
> Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
> vtd_iotlb_domain_invalidate().


I meant the code doesn't check whether there's an MAP notifier :)


>
>> 2) for the replay() I don't see other implementations (either spapr or
>> generic one) that did unmap (actually they skip unmap explicitly), any
>> reason for doing this in intel IOMMU?
> I could be wrong, but I'd guess it's because vt-d implemented the caching mode
> by leveraging the same invalidation strucuture, so it's harder to make all
> things right (IOW, we can't clearly identify MAP with UNMAP when we receive an
> invalidation request, because MAP/UNMAP requests look the same).
>
> I didn't check others, but I believe spapr is doing it differently by using
> some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to
> what virtio-iommu is doing.  Anyway, the point is if we have explicit MAP/UNMAP
> from the guest, logically the replay indeed does not need to do any unmap
> because we don't need to call replay() on an already existing device but only
> for e.g. hot plug.


But this looks conflict with what memory_region_iommu_replay( ) did, for 
IOMMU that doesn't have a replay method, it skips UNMAP request:

     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
         iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
         if (iotlb.perm != IOMMU_NONE) {
             n->notify(n, &iotlb);
         }

I guess there's no knowledge of whether guest have an explicit MAP/UMAP 
for this generic code. Or replay implies that guest doesn't have 
explicit MAP/UNMAP?

(btw, the code shortcut the memory_region_notify_one(), not sure the reason)


>   VT-d does not have that clear interface, so VT-d needs to
> maintain its own mapping structures, and also vt-d is using the same replay &
> page_walk operations to sync all these structures, which complicated the vt-d
> replay a bit.  With that, we assume replay() can be called anytime on a device,
> and we won't notify duplicated MAPs to lower layer like vfio if it is mapped
> before.  At the meantime, since we'll compare the latest mapping with the one
> we cached in the iova tree, UNMAP becomes possible too.


AFAIK vtd_iommu_replay() did a completely UNMAP:

     /*
      * The replay can be triggered by either a invalidation or a newly
      * created entry. No matter what, we release existing mappings
      * (it means flushing caches for UNMAP-only registers).
      */
     vtd_address_space_unmap(vtd_as, n);

Since it doesn't do any comparison with iova tree. Will this cause 
unnecessary UNMAP to be sent to VFIO?

Thanks


>
> Thanks,
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-13  4:04                                           ` Jason Wang
@ 2020-07-16  1:00                                             ` Peter Xu
  2020-07-16  2:54                                               ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-07-16  1:00 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, libvir-list,
	Juan Quintela, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini

On Mon, Jul 13, 2020 at 12:04:16PM +0800, Jason Wang wrote:
> 
> On 2020/7/10 下午9:30, Peter Xu wrote:
> > On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:
> > > On 2020/7/9 下午10:10, Peter Xu wrote:
> > > > On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:
> > > > > > > - If we care the performance, it's better to implement the MAP event for
> > > > > > > vhost, otherwise it could be a lot of IOTLB miss
> > > > > > I feel like these are two things.
> > > > > > 
> > > > > > So far what we are talking about is whether vt-d should have knowledge about
> > > > > > what kind of events one iommu notifier is interested in.  I still think we
> > > > > > should keep this as answered in question 1.
> > > > > > 
> > > > > > The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
> > > > > > events even without vDMA, so that vhost can establish the mapping even before
> > > > > > IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
> > > > > > the guest is using dynamic iommu page mappings, I feel like that can be even
> > > > > > slower, because then the worst case is for each IO we'll need to vmexit twice:
> > > > > > 
> > > > > >      - The first vmexit caused by an invalidation to MAP the page tables, so vhost
> > > > > >        will setup the page table before IO starts
> > > > > > 
> > > > > >      - IO/DMA triggers and completes
> > > > > > 
> > > > > >      - The second vmexit caused by another invalidation to UNMAP the page tables
> > > > > > 
> > > > > > So it seems to be worse than when vhost only uses UNMAP like right now.  At
> > > > > > least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
> > > > > > request from kernel to userspace, but IMHO that's cheaper than the vmexit.
> > > > > Right but then I would still prefer to have another notifier.
> > > > > 
> > > > > Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
> > > > > dedicated command for flushing device IOTLB. But the check for
> > > > > vtd_as_has_map_notifier is used to skip the device which can do demand
> > > > > paging via ATS or device specific way. If we have two different notifiers,
> > > > > vhost will be on the device iotlb notifier so we don't need it at all?
> > > > But we can still have iommu notifier that only registers to UNMAP even after we
> > > > introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
> > > > TCG should be the only one so far, but I don't know.. maybe there can still be
> > > > new ones?
> > > 
> > > I think you're right. But looking at the codes, it looks like the check of
> > > vtd_as_has_map_notifier() was only used in:
> > > 
> > > 1) vtd_iommu_replay()
> > > 2) vtd_iotlb_page_invalidate_notify() (PSI)
> > > 
> > > For the replay, it's expensive anyhow. For PSI, I think it's just about one
> > > or few mappings, not sure it will have obvious performance impact.
> > > 
> > > And I had two questions:
> > > 
> > > 1) The codes doesn't check map for DSI or GI, does this match what spec
> > > said? (It looks to me the spec is unclear in this part)
> > Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
> > vtd_iotlb_domain_invalidate().
> 
> 
> I meant the code doesn't check whether there's an MAP notifier :)

It's actually checked, because it loops over vtd_as_with_notifiers, and only
MAP notifiers register to that. :)

But I agree with you that it should be cleaner to introduce the dev-iotlb
notifier type.

> 
> 
> > 
> > > 2) for the replay() I don't see other implementations (either spapr or
> > > generic one) that did unmap (actually they skip unmap explicitly), any
> > > reason for doing this in intel IOMMU?
> > I could be wrong, but I'd guess it's because vt-d implemented the caching mode
> > by leveraging the same invalidation strucuture, so it's harder to make all
> > things right (IOW, we can't clearly identify MAP with UNMAP when we receive an
> > invalidation request, because MAP/UNMAP requests look the same).
> > 
> > I didn't check others, but I believe spapr is doing it differently by using
> > some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to
> > what virtio-iommu is doing.  Anyway, the point is if we have explicit MAP/UNMAP
> > from the guest, logically the replay indeed does not need to do any unmap
> > because we don't need to call replay() on an already existing device but only
> > for e.g. hot plug.
> 
> 
> But this looks conflict with what memory_region_iommu_replay( ) did, for
> IOMMU that doesn't have a replay method, it skips UNMAP request:
> 
>     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
>         iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
>         if (iotlb.perm != IOMMU_NONE) {
>             n->notify(n, &iotlb);
>         }
> 
> I guess there's no knowledge of whether guest have an explicit MAP/UMAP for
> this generic code. Or replay implies that guest doesn't have explicit
> MAP/UNMAP?

I think it matches exactly with a hot plug case?  Note that when IOMMU_NONE
could also mean the translation does not exist.  So it's actually trying to map
everything that can be translated and then notify().

> 
> (btw, the code shortcut the memory_region_notify_one(), not sure the reason)

I think it's simply because memory_region_notify_one() came later. :)

> 
> 
> >   VT-d does not have that clear interface, so VT-d needs to
> > maintain its own mapping structures, and also vt-d is using the same replay &
> > page_walk operations to sync all these structures, which complicated the vt-d
> > replay a bit.  With that, we assume replay() can be called anytime on a device,
> > and we won't notify duplicated MAPs to lower layer like vfio if it is mapped
> > before.  At the meantime, since we'll compare the latest mapping with the one
> > we cached in the iova tree, UNMAP becomes possible too.
> 
> 
> AFAIK vtd_iommu_replay() did a completely UNMAP:
> 
>     /*
>      * The replay can be triggered by either a invalidation or a newly
>      * created entry. No matter what, we release existing mappings
>      * (it means flushing caches for UNMAP-only registers).
>      */
>     vtd_address_space_unmap(vtd_as, n);
> 
> Since it doesn't do any comparison with iova tree. Will this cause
> unnecessary UNMAP to be sent to VFIO?

I feel like that can be removed now, but needs some testings...

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-16  1:00                                             ` Peter Xu
@ 2020-07-16  2:54                                               ` Jason Wang
  2020-07-17 14:18                                                 ` Peter Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-07-16  2:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, libvir-list,
	Juan Quintela, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini


On 2020/7/16 上午9:00, Peter Xu wrote:
> On Mon, Jul 13, 2020 at 12:04:16PM +0800, Jason Wang wrote:
>> On 2020/7/10 下午9:30, Peter Xu wrote:
>>> On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:
>>>> On 2020/7/9 下午10:10, Peter Xu wrote:
>>>>> On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:
>>>>>>>> - If we care the performance, it's better to implement the MAP event for
>>>>>>>> vhost, otherwise it could be a lot of IOTLB miss
>>>>>>> I feel like these are two things.
>>>>>>>
>>>>>>> So far what we are talking about is whether vt-d should have knowledge about
>>>>>>> what kind of events one iommu notifier is interested in.  I still think we
>>>>>>> should keep this as answered in question 1.
>>>>>>>
>>>>>>> The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
>>>>>>> events even without vDMA, so that vhost can establish the mapping even before
>>>>>>> IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
>>>>>>> the guest is using dynamic iommu page mappings, I feel like that can be even
>>>>>>> slower, because then the worst case is for each IO we'll need to vmexit twice:
>>>>>>>
>>>>>>>       - The first vmexit caused by an invalidation to MAP the page tables, so vhost
>>>>>>>         will setup the page table before IO starts
>>>>>>>
>>>>>>>       - IO/DMA triggers and completes
>>>>>>>
>>>>>>>       - The second vmexit caused by another invalidation to UNMAP the page tables
>>>>>>>
>>>>>>> So it seems to be worse than when vhost only uses UNMAP like right now.  At
>>>>>>> least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
>>>>>>> request from kernel to userspace, but IMHO that's cheaper than the vmexit.
>>>>>> Right but then I would still prefer to have another notifier.
>>>>>>
>>>>>> Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
>>>>>> dedicated command for flushing device IOTLB. But the check for
>>>>>> vtd_as_has_map_notifier is used to skip the device which can do demand
>>>>>> paging via ATS or device specific way. If we have two different notifiers,
>>>>>> vhost will be on the device iotlb notifier so we don't need it at all?
>>>>> But we can still have iommu notifier that only registers to UNMAP even after we
>>>>> introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
>>>>> TCG should be the only one so far, but I don't know.. maybe there can still be
>>>>> new ones?
>>>> I think you're right. But looking at the codes, it looks like the check of
>>>> vtd_as_has_map_notifier() was only used in:
>>>>
>>>> 1) vtd_iommu_replay()
>>>> 2) vtd_iotlb_page_invalidate_notify() (PSI)
>>>>
>>>> For the replay, it's expensive anyhow. For PSI, I think it's just about one
>>>> or few mappings, not sure it will have obvious performance impact.
>>>>
>>>> And I had two questions:
>>>>
>>>> 1) The codes doesn't check map for DSI or GI, does this match what spec
>>>> said? (It looks to me the spec is unclear in this part)
>>> Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
>>> vtd_iotlb_domain_invalidate().
>>
>> I meant the code doesn't check whether there's an MAP notifier :)
> It's actually checked, because it loops over vtd_as_with_notifiers, and only
> MAP notifiers register to that. :)


I may miss something but I don't find the code to block UNMAP notifiers?

vhost_iommu_region_add()
     memory_region_register_iommu_notifier()
         memory_region_update_iommu_notify_flags()
             imrc->notify_flag_changed()
                 vtd_iommu_notify_flag_changed()

?


>
> But I agree with you that it should be cleaner to introduce the dev-iotlb
> notifier type.


Yes, it can solve the issues of duplicated UNMAP issue of vhost.


>>
>>>> 2) for the replay() I don't see other implementations (either spapr or
>>>> generic one) that did unmap (actually they skip unmap explicitly), any
>>>> reason for doing this in intel IOMMU?
>>> I could be wrong, but I'd guess it's because vt-d implemented the caching mode
>>> by leveraging the same invalidation strucuture, so it's harder to make all
>>> things right (IOW, we can't clearly identify MAP with UNMAP when we receive an
>>> invalidation request, because MAP/UNMAP requests look the same).
>>>
>>> I didn't check others, but I believe spapr is doing it differently by using
>>> some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to
>>> what virtio-iommu is doing.  Anyway, the point is if we have explicit MAP/UNMAP
>>> from the guest, logically the replay indeed does not need to do any unmap
>>> because we don't need to call replay() on an already existing device but only
>>> for e.g. hot plug.
>>
>> But this looks conflict with what memory_region_iommu_replay( ) did, for
>> IOMMU that doesn't have a replay method, it skips UNMAP request:
>>
>>      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
>>          iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
>>          if (iotlb.perm != IOMMU_NONE) {
>>              n->notify(n, &iotlb);
>>          }
>>
>> I guess there's no knowledge of whether guest have an explicit MAP/UMAP for
>> this generic code. Or replay implies that guest doesn't have explicit
>> MAP/UNMAP?
> I think it matches exactly with a hot plug case?  Note that when IOMMU_NONE
> could also mean the translation does not exist.  So it's actually trying to map
> everything that can be translated and then notify().


Yes, so the question is what's the assumption before calling 
memory_region_iommu_replay(). If it assumes an empty mapping, there's 
probably no need for unamp.


>
>> (btw, the code shortcut the memory_region_notify_one(), not sure the reason)
> I think it's simply because memory_region_notify_one() came later. :)


Ok, that explains.


>
>>
>>>    VT-d does not have that clear interface, so VT-d needs to
>>> maintain its own mapping structures, and also vt-d is using the same replay &
>>> page_walk operations to sync all these structures, which complicated the vt-d
>>> replay a bit.  With that, we assume replay() can be called anytime on a device,
>>> and we won't notify duplicated MAPs to lower layer like vfio if it is mapped
>>> before.  At the meantime, since we'll compare the latest mapping with the one
>>> we cached in the iova tree, UNMAP becomes possible too.
>>
>> AFAIK vtd_iommu_replay() did a completely UNMAP:
>>
>>      /*
>>       * The replay can be triggered by either a invalidation or a newly
>>       * created entry. No matter what, we release existing mappings
>>       * (it means flushing caches for UNMAP-only registers).
>>       */
>>      vtd_address_space_unmap(vtd_as, n);
>>
>> Since it doesn't do any comparison with iova tree. Will this cause
>> unnecessary UNMAP to be sent to VFIO?
> I feel like that can be removed now, but needs some testings...


Probably, but need to answer the above question about replay first.

Thanks


>
> Thanks,
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-16  2:54                                               ` Jason Wang
@ 2020-07-17 14:18                                                 ` Peter Xu
  2020-07-20  4:02                                                   ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-07-17 14:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, libvir-list,
	Juan Quintela, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini

On Thu, Jul 16, 2020 at 10:54:31AM +0800, Jason Wang wrote:
> 
> On 2020/7/16 上午9:00, Peter Xu wrote:
> > On Mon, Jul 13, 2020 at 12:04:16PM +0800, Jason Wang wrote:
> > > On 2020/7/10 下午9:30, Peter Xu wrote:
> > > > On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:
> > > > > On 2020/7/9 下午10:10, Peter Xu wrote:
> > > > > > On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:
> > > > > > > > > - If we care the performance, it's better to implement the MAP event for
> > > > > > > > > vhost, otherwise it could be a lot of IOTLB miss
> > > > > > > > I feel like these are two things.
> > > > > > > > 
> > > > > > > > So far what we are talking about is whether vt-d should have knowledge about
> > > > > > > > what kind of events one iommu notifier is interested in.  I still think we
> > > > > > > > should keep this as answered in question 1.
> > > > > > > > 
> > > > > > > > The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
> > > > > > > > events even without vDMA, so that vhost can establish the mapping even before
> > > > > > > > IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
> > > > > > > > the guest is using dynamic iommu page mappings, I feel like that can be even
> > > > > > > > slower, because then the worst case is for each IO we'll need to vmexit twice:
> > > > > > > > 
> > > > > > > >       - The first vmexit caused by an invalidation to MAP the page tables, so vhost
> > > > > > > >         will setup the page table before IO starts
> > > > > > > > 
> > > > > > > >       - IO/DMA triggers and completes
> > > > > > > > 
> > > > > > > >       - The second vmexit caused by another invalidation to UNMAP the page tables
> > > > > > > > 
> > > > > > > > So it seems to be worse than when vhost only uses UNMAP like right now.  At
> > > > > > > > least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
> > > > > > > > request from kernel to userspace, but IMHO that's cheaper than the vmexit.
> > > > > > > Right but then I would still prefer to have another notifier.
> > > > > > > 
> > > > > > > Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
> > > > > > > dedicated command for flushing device IOTLB. But the check for
> > > > > > > vtd_as_has_map_notifier is used to skip the device which can do demand
> > > > > > > paging via ATS or device specific way. If we have two different notifiers,
> > > > > > > vhost will be on the device iotlb notifier so we don't need it at all?
> > > > > > But we can still have iommu notifier that only registers to UNMAP even after we
> > > > > > introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
> > > > > > TCG should be the only one so far, but I don't know.. maybe there can still be
> > > > > > new ones?
> > > > > I think you're right. But looking at the codes, it looks like the check of
> > > > > vtd_as_has_map_notifier() was only used in:
> > > > > 
> > > > > 1) vtd_iommu_replay()
> > > > > 2) vtd_iotlb_page_invalidate_notify() (PSI)
> > > > > 
> > > > > For the replay, it's expensive anyhow. For PSI, I think it's just about one
> > > > > or few mappings, not sure it will have obvious performance impact.
> > > > > 
> > > > > And I had two questions:
> > > > > 
> > > > > 1) The codes doesn't check map for DSI or GI, does this match what spec
> > > > > said? (It looks to me the spec is unclear in this part)
> > > > Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
> > > > vtd_iotlb_domain_invalidate().
> > > 
> > > I meant the code doesn't check whether there's an MAP notifier :)
> > It's actually checked, because it loops over vtd_as_with_notifiers, and only
> > MAP notifiers register to that. :)
> 
> 
> I may miss something but I don't find the code to block UNMAP notifiers?
> 
> vhost_iommu_region_add()
>     memory_region_register_iommu_notifier()
>         memory_region_update_iommu_notify_flags()
>             imrc->notify_flag_changed()
>                 vtd_iommu_notify_flag_changed()
> 
> ?

Yeah I think you're right.  I might have confused with some previous
implementations.  Maybe we should also do similar thing for DSI/GI just like
what we do in PSI.

> > > > > 2) for the replay() I don't see other implementations (either spapr or
> > > > > generic one) that did unmap (actually they skip unmap explicitly), any
> > > > > reason for doing this in intel IOMMU?
> > > > I could be wrong, but I'd guess it's because vt-d implemented the caching mode
> > > > by leveraging the same invalidation strucuture, so it's harder to make all
> > > > things right (IOW, we can't clearly identify MAP with UNMAP when we receive an
> > > > invalidation request, because MAP/UNMAP requests look the same).
> > > > 
> > > > I didn't check others, but I believe spapr is doing it differently by using
> > > > some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to
> > > > what virtio-iommu is doing.  Anyway, the point is if we have explicit MAP/UNMAP
> > > > from the guest, logically the replay indeed does not need to do any unmap
> > > > because we don't need to call replay() on an already existing device but only
> > > > for e.g. hot plug.
> > > 
> > > But this looks conflict with what memory_region_iommu_replay( ) did, for
> > > IOMMU that doesn't have a replay method, it skips UNMAP request:
> > > 
> > >      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
> > >          iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
> > >          if (iotlb.perm != IOMMU_NONE) {
> > >              n->notify(n, &iotlb);
> > >          }
> > > 
> > > I guess there's no knowledge of whether guest have an explicit MAP/UMAP for
> > > this generic code. Or replay implies that guest doesn't have explicit
> > > MAP/UNMAP?
> > I think it matches exactly with a hot plug case?  Note that when IOMMU_NONE
> > could also mean the translation does not exist.  So it's actually trying to map
> > everything that can be translated and then notify().
> 
> 
> Yes, so the question is what's the assumption before calling
> memory_region_iommu_replay(). If it assumes an empty mapping, there's
> probably no need for unamp.

The only caller of memory_region_iommu_replay() is vfio_listener_region_add(),
when there's a new vIOMMU memory region detected.  So IIUC that guarantees the
previous state should be all empty.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-17 14:18                                                 ` Peter Xu
@ 2020-07-20  4:02                                                   ` Jason Wang
  2020-07-20 13:03                                                     ` Peter Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-07-20  4:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, libvir-list,
	Juan Quintela, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini


On 2020/7/17 下午10:18, Peter Xu wrote:
> On Thu, Jul 16, 2020 at 10:54:31AM +0800, Jason Wang wrote:
>> On 2020/7/16 上午9:00, Peter Xu wrote:
>>> On Mon, Jul 13, 2020 at 12:04:16PM +0800, Jason Wang wrote:
>>>> On 2020/7/10 下午9:30, Peter Xu wrote:
>>>>> On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:
>>>>>> On 2020/7/9 下午10:10, Peter Xu wrote:
>>>>>>> On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:
>>>>>>>>>> - If we care the performance, it's better to implement the MAP event for
>>>>>>>>>> vhost, otherwise it could be a lot of IOTLB miss
>>>>>>>>> I feel like these are two things.
>>>>>>>>>
>>>>>>>>> So far what we are talking about is whether vt-d should have knowledge about
>>>>>>>>> what kind of events one iommu notifier is interested in.  I still think we
>>>>>>>>> should keep this as answered in question 1.
>>>>>>>>>
>>>>>>>>> The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
>>>>>>>>> events even without vDMA, so that vhost can establish the mapping even before
>>>>>>>>> IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
>>>>>>>>> the guest is using dynamic iommu page mappings, I feel like that can be even
>>>>>>>>> slower, because then the worst case is for each IO we'll need to vmexit twice:
>>>>>>>>>
>>>>>>>>>        - The first vmexit caused by an invalidation to MAP the page tables, so vhost
>>>>>>>>>          will setup the page table before IO starts
>>>>>>>>>
>>>>>>>>>        - IO/DMA triggers and completes
>>>>>>>>>
>>>>>>>>>        - The second vmexit caused by another invalidation to UNMAP the page tables
>>>>>>>>>
>>>>>>>>> So it seems to be worse than when vhost only uses UNMAP like right now.  At
>>>>>>>>> least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
>>>>>>>>> request from kernel to userspace, but IMHO that's cheaper than the vmexit.
>>>>>>>> Right but then I would still prefer to have another notifier.
>>>>>>>>
>>>>>>>> Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
>>>>>>>> dedicated command for flushing device IOTLB. But the check for
>>>>>>>> vtd_as_has_map_notifier is used to skip the device which can do demand
>>>>>>>> paging via ATS or device specific way. If we have two different notifiers,
>>>>>>>> vhost will be on the device iotlb notifier so we don't need it at all?
>>>>>>> But we can still have iommu notifier that only registers to UNMAP even after we
>>>>>>> introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
>>>>>>> TCG should be the only one so far, but I don't know.. maybe there can still be
>>>>>>> new ones?
>>>>>> I think you're right. But looking at the codes, it looks like the check of
>>>>>> vtd_as_has_map_notifier() was only used in:
>>>>>>
>>>>>> 1) vtd_iommu_replay()
>>>>>> 2) vtd_iotlb_page_invalidate_notify() (PSI)
>>>>>>
>>>>>> For the replay, it's expensive anyhow. For PSI, I think it's just about one
>>>>>> or few mappings, not sure it will have obvious performance impact.
>>>>>>
>>>>>> And I had two questions:
>>>>>>
>>>>>> 1) The codes doesn't check map for DSI or GI, does this match what spec
>>>>>> said? (It looks to me the spec is unclear in this part)
>>>>> Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
>>>>> vtd_iotlb_domain_invalidate().
>>>> I meant the code doesn't check whether there's an MAP notifier :)
>>> It's actually checked, because it loops over vtd_as_with_notifiers, and only
>>> MAP notifiers register to that. :)
>>
>> I may miss something but I don't find the code to block UNMAP notifiers?
>>
>> vhost_iommu_region_add()
>>      memory_region_register_iommu_notifier()
>>          memory_region_update_iommu_notify_flags()
>>              imrc->notify_flag_changed()
>>                  vtd_iommu_notify_flag_changed()
>>
>> ?
> Yeah I think you're right.  I might have confused with some previous
> implementations.  Maybe we should also do similar thing for DSI/GI just like
> what we do in PSI.


Ok.


>
>>>>>> 2) for the replay() I don't see other implementations (either spapr or
>>>>>> generic one) that did unmap (actually they skip unmap explicitly), any
>>>>>> reason for doing this in intel IOMMU?
>>>>> I could be wrong, but I'd guess it's because vt-d implemented the caching mode
>>>>> by leveraging the same invalidation strucuture, so it's harder to make all
>>>>> things right (IOW, we can't clearly identify MAP with UNMAP when we receive an
>>>>> invalidation request, because MAP/UNMAP requests look the same).
>>>>>
>>>>> I didn't check others, but I believe spapr is doing it differently by using
>>>>> some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to
>>>>> what virtio-iommu is doing.  Anyway, the point is if we have explicit MAP/UNMAP
>>>>> from the guest, logically the replay indeed does not need to do any unmap
>>>>> because we don't need to call replay() on an already existing device but only
>>>>> for e.g. hot plug.
>>>> But this looks conflict with what memory_region_iommu_replay( ) did, for
>>>> IOMMU that doesn't have a replay method, it skips UNMAP request:
>>>>
>>>>       for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
>>>>           iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
>>>>           if (iotlb.perm != IOMMU_NONE) {
>>>>               n->notify(n, &iotlb);
>>>>           }
>>>>
>>>> I guess there's no knowledge of whether guest have an explicit MAP/UMAP for
>>>> this generic code. Or replay implies that guest doesn't have explicit
>>>> MAP/UNMAP?
>>> I think it matches exactly with a hot plug case?  Note that when IOMMU_NONE
>>> could also mean the translation does not exist.  So it's actually trying to map
>>> everything that can be translated and then notify().
>>
>> Yes, so the question is what's the assumption before calling
>> memory_region_iommu_replay(). If it assumes an empty mapping, there's
>> probably no need for unamp.
> The only caller of memory_region_iommu_replay() is vfio_listener_region_add(),
> when there's a new vIOMMU memory region detected.  So IIUC that guarantees the
> previous state should be all empty.


Right, so there's no need to deal with unmap in vtd's replay 
implementation (as what generic one did).

Thanks


>
> Thanks,
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-20  4:02                                                   ` Jason Wang
@ 2020-07-20 13:03                                                     ` Peter Xu
  2020-07-21  6:20                                                       ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-07-20 13:03 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, libvir-list,
	Juan Quintela, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini

On Mon, Jul 20, 2020 at 12:02:06PM +0800, Jason Wang wrote:
> Right, so there's no need to deal with unmap in vtd's replay implementation
> (as what generic one did).

We don't even for now; see vtd_page_walk_info.notify_unmap.  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-20 13:03                                                     ` Peter Xu
@ 2020-07-21  6:20                                                       ` Jason Wang
  2020-07-21 15:10                                                         ` Peter Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-07-21  6:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, libvir-list,
	Michael S. Tsirkin, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini


On 2020/7/20 下午9:03, Peter Xu wrote:
> On Mon, Jul 20, 2020 at 12:02:06PM +0800, Jason Wang wrote:
>> Right, so there's no need to deal with unmap in vtd's replay implementation
>> (as what generic one did).
> We don't even for now; see vtd_page_walk_info.notify_unmap.  Thanks,


Right, but I meant the vtd_address_space_unmap() in vtd_iomm_replay(). 
It looks to me it will trigger UNMAP notifiers.

Thanks




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-21  6:20                                                       ` Jason Wang
@ 2020-07-21 15:10                                                         ` Peter Xu
  0 siblings, 0 replies; 68+ messages in thread
From: Peter Xu @ 2020-07-21 15:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, libvir-list,
	Michael S. Tsirkin, qemu-devel, Eugenio Pérez, Eric Auger,
	Paolo Bonzini

On Tue, Jul 21, 2020 at 02:20:01PM +0800, Jason Wang wrote:
> 
> On 2020/7/20 下午9:03, Peter Xu wrote:
> > On Mon, Jul 20, 2020 at 12:02:06PM +0800, Jason Wang wrote:
> > > Right, so there's no need to deal with unmap in vtd's replay implementation
> > > (as what generic one did).
> > We don't even for now; see vtd_page_walk_info.notify_unmap.  Thanks,
> 
> 
> Right, but I meant the vtd_address_space_unmap() in vtd_iomm_replay(). It
> looks to me it will trigger UNMAP notifiers.

Should be pretty safe for now, but I agree it seems optional (as we've
discussed about this previously).  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-07-03  7:24                       ` Jason Wang
  2020-07-03 13:03                         ` Peter Xu
@ 2020-08-03 16:00                         ` Eugenio Pérez
  2020-08-04 20:30                           ` Peter Xu
  1 sibling, 1 reply; 68+ messages in thread
From: Eugenio Pérez @ 2020-08-03 16:00 UTC (permalink / raw)
  To: Jason Wang, Peter Xu
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, libvir-list,
	Juan Quintela, qemu-devel, Eric Auger, Paolo Bonzini

On Fri, 2020-07-03 at 15:24 +0800, Jason Wang wrote:
> On 2020/7/2 下午11:45, Peter Xu wrote:
> > On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
> > > So I think we agree that a new notifier is needed?
> > Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?
> 
> That should work but I wonder something as following is better.
> 
> Instead of introducing new flags, how about carry the type of event in 
> the notifier then the device (vhost) can choose the message it want to 
> process like:
> 
> static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)
> 
> {
> 
> switch (event->type) {
> 
> case IOMMU_MAP:
> case IOMMU_UNMAP:
> case IOMMU_DEV_IOTLB_UNMAP:
> ...
> 
> }
> 
> Thanks
> 
> 

Hi!

Sorry, I thought I had this clear but now it seems not so clear to me. Do you mean to add that switch to the current
vhost_iommu_unmap_notify, and then the "type" field to the IOMMUTLBEntry? Is that the scope of the changes, or there is
something I'm missing?

If that is correct, what is the advantage for vhost or other notifiers? I understand that move the IOMMUTLBEntry (addr,
len) -> (iova, mask) split/transformation to the different notifiers implementation could pollute them, but this is even a deeper change and vhost is not insterested in other events but IOMMU_UNMAP, isn't?

On the other hand, who decide what type of event is? If I follow the backtrace of the assert in 
https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg01015.html, it seems to me that it should be
vtd_process_device_iotlb_desc. How do I know if it should be IOMMU_UNMAP or IOMMU_DEV_IOTLB_UNMAP? If I set it in some
function of memory.c, I should decide the type looking the actual notifier, isn't?

Thanks!



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-08-03 16:00                         ` Eugenio Pérez
@ 2020-08-04 20:30                           ` Peter Xu
  2020-08-05  5:45                             ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-08-04 20:30 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, libvir-list,
	Jason Wang, Juan Quintela, qemu-devel, Eric Auger, Paolo Bonzini

On Mon, Aug 03, 2020 at 06:00:34PM +0200, Eugenio Pérez wrote:
> On Fri, 2020-07-03 at 15:24 +0800, Jason Wang wrote:
> > On 2020/7/2 下午11:45, Peter Xu wrote:
> > > On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
> > > > So I think we agree that a new notifier is needed?
> > > Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?
> > 
> > That should work but I wonder something as following is better.
> > 
> > Instead of introducing new flags, how about carry the type of event in 
> > the notifier then the device (vhost) can choose the message it want to 
> > process like:
> > 
> > static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)
> > 
> > {
> > 
> > switch (event->type) {
> > 
> > case IOMMU_MAP:
> > case IOMMU_UNMAP:
> > case IOMMU_DEV_IOTLB_UNMAP:
> > ...
> > 
> > }
> > 
> > Thanks
> > 
> > 
> 
> Hi!
> 
> Sorry, I thought I had this clear but now it seems not so clear to me. Do you mean to add that switch to the current
> vhost_iommu_unmap_notify, and then the "type" field to the IOMMUTLBEntry? Is that the scope of the changes, or there is
> something I'm missing?
> 
> If that is correct, what is the advantage for vhost or other notifiers? I understand that move the IOMMUTLBEntry (addr,
> len) -> (iova, mask) split/transformation to the different notifiers implementation could pollute them, but this is even a deeper change and vhost is not insterested in other events but IOMMU_UNMAP, isn't?
> 
> On the other hand, who decide what type of event is? If I follow the backtrace of the assert in 
> https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg01015.html, it seems to me that it should be
> vtd_process_device_iotlb_desc. How do I know if it should be IOMMU_UNMAP or IOMMU_DEV_IOTLB_UNMAP? If I set it in some
> function of memory.c, I should decide the type looking the actual notifier, isn't?

(Since Jason didn't reply yesterday, I'll try to; Jason, feel free to correct
 me...)

IMHO whether to put the type into the IOMMUTLBEntry is not important.  The
important change should be that we introduce IOMMU_DEV_IOTLB_UNMAP (or I'd
rather call it IOMMU_DEV_IOTLB directly which is shorter and cleaner).  With
that information we can make the failing assertion conditional for MAP/UNMAP
only.  We can also allow dev-iotlb messages to take arbitrary addr_mask (so it
becomes a length of address range; imho we can keep using addr_mask for
simplicity, but we can comment for addr_mask that for dev-iotlb it can be not a
real mask).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-08-04 20:30                           ` Peter Xu
@ 2020-08-05  5:45                             ` Jason Wang
  0 siblings, 0 replies; 68+ messages in thread
From: Jason Wang @ 2020-08-05  5:45 UTC (permalink / raw)
  To: Peter Xu, Eugenio Pérez
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, libvir-list,
	Juan Quintela, qemu-devel, Eric Auger, Paolo Bonzini


On 2020/8/5 上午4:30, Peter Xu wrote:
> On Mon, Aug 03, 2020 at 06:00:34PM +0200, Eugenio Pérez wrote:
>> On Fri, 2020-07-03 at 15:24 +0800, Jason Wang wrote:
>>> On 2020/7/2 下午11:45, Peter Xu wrote:
>>>> On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
>>>>> So I think we agree that a new notifier is needed?
>>>> Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?
>>> That should work but I wonder something as following is better.
>>>
>>> Instead of introducing new flags, how about carry the type of event in
>>> the notifier then the device (vhost) can choose the message it want to
>>> process like:
>>>
>>> static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)
>>>
>>> {
>>>
>>> switch (event->type) {
>>>
>>> case IOMMU_MAP:
>>> case IOMMU_UNMAP:
>>> case IOMMU_DEV_IOTLB_UNMAP:
>>> ...
>>>
>>> }
>>>
>>> Thanks
>>>
>>>
>> Hi!
>>
>> Sorry, I thought I had this clear but now it seems not so clear to me. Do you mean to add that switch to the current
>> vhost_iommu_unmap_notify, and then the "type" field to the IOMMUTLBEntry? Is that the scope of the changes, or there is
>> something I'm missing?
>>
>> If that is correct, what is the advantage for vhost or other notifiers? I understand that move the IOMMUTLBEntry (addr,
>> len) -> (iova, mask) split/transformation to the different notifiers implementation could pollute them, but this is even a deeper change and vhost is not insterested in other events but IOMMU_UNMAP, isn't?
>>
>> On the other hand, who decide what type of event is? If I follow the backtrace of the assert in
>> https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg01015.html, it seems to me that it should be
>> vtd_process_device_iotlb_desc. How do I know if it should be IOMMU_UNMAP or IOMMU_DEV_IOTLB_UNMAP? If I set it in some
>> function of memory.c, I should decide the type looking the actual notifier, isn't?
> (Since Jason didn't reply yesterday, I'll try to; Jason, feel free to correct
>   me...)
>
> IMHO whether to put the type into the IOMMUTLBEntry is not important.  The
> important change should be that we introduce IOMMU_DEV_IOTLB_UNMAP (or I'd
> rather call it IOMMU_DEV_IOTLB directly which is shorter and cleaner).  With
> that information we can make the failing assertion conditional for MAP/UNMAP
> only.


Or having another dedicated device IOTLB notifier.


>    We can also allow dev-iotlb messages to take arbitrary addr_mask (so it
> becomes a length of address range; imho we can keep using addr_mask for
> simplicity, but we can comment for addr_mask that for dev-iotlb it can be not a
> real mask).


Yes.

Thanks


>
> Thanks,
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-26 21:29   ` Peter Xu
  2020-06-27  7:26     ` Yan Zhao
  2020-06-28  7:03     ` Jason Wang
@ 2020-08-11 17:01     ` Eugenio Perez Martin
  2020-08-11 17:10       ` Eugenio Perez Martin
  2 siblings, 1 reply; 68+ messages in thread
From: Eugenio Perez Martin @ 2020-08-11 17:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Jason Wang,
	Michael S. Tsirkin, qemu-devel, Eric Auger, Paolo Bonzini

On Fri, Jun 26, 2020 at 11:29 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Eugenio,
>
> (CCing Eric, Yan and Michael too)
>
> On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
> > diff --git a/memory.c b/memory.c
> > index 2f15a4b250..7f789710d2 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> >          return;
> >      }
> >
> > -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
>
> I can understand removing the assertion should solve the issue, however imho
> the major issue is not about this single assertion but the whole addr_mask
> issue behind with virtio...
>
> For normal IOTLB invalidations, we were trying our best to always make
> IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what we're
> doing with the loop in vtd_address_space_unmap().
>
> But this is not the first time that we may want to break this assumption for
> virtio so that we make the IOTLB a tuple of (start, len), then that len can be
> not a address mask any more.  That seems to be more efficient for things like
> vhost because iotlbs there are not page based, so it'll be inefficient if we
> always guarantee the addr_mask because it'll be quite a lot more roundtrips of
> the same range of invalidation.  Here we've encountered another issue of
> triggering the assertion with virtio-net, but only with the old RHEL7 guest.
>
> I'm thinking whether we can make the IOTLB invalidation configurable by
> specifying whether the backend of the notifier can handle arbitary address
> range in some way.  So we still have the guaranteed addr_masks by default
> (since I still don't think totally break the addr_mask restriction is wise...),
> however we can allow the special backends to take adavantage of using arbitary
> (start, len) ranges for reasons like performance.
>
> To do that, a quick idea is to introduce a flag IOMMU_NOTIFIER_ARBITRARY_MASK
> to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) can
> take arbitrary address mask, then it can be any value and finally becomes a
> length rather than an addr_mask.  Then for every iommu notify() we can directly
> deliver whatever we've got from the upper layer to this notifier.  With the new
> flag, vhost can do iommu_notifier_init() with UNMAP|ARBITRARY_MASK so it
> declares this capability.  Then no matter for device iotlb or normal iotlb, we
> skip the complicated procedure to split a big range into small ranges that are
> with strict addr_mask, but directly deliver the message to the iommu notifier.
> E.g., we can skip the loop in vtd_address_space_unmap() if the notifier is with
> ARBITRARY flag set.
>
> Then, the assert() is not accurate either, and may become something like:
>
> diff --git a/memory.c b/memory.c
> index 2f15a4b250..99d0492509 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1906,6 +1906,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>  {
>      IOMMUNotifierFlag request_flags;
>      hwaddr entry_end = entry->iova + entry->addr_mask;
> +    IOMMUTLBEntry tmp = *entry;
>
>      /*
>       * Skip the notification if the notification does not overlap
> @@ -1915,7 +1916,13 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>          return;
>      }
>
> -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> +    if (notifier->notifier_flags & IOMMU_NOTIFIER_ARBITRARY_MASK) {
> +        tmp.iova = MAX(tmp.iova, notifier->start);

Hi!

If I modify the tmp.iova, the guest will complain (in dmesg):
[  154.426828] DMAR: DRHD: handling fault status reg 2
[  154.427700] DMAR: [DMA Read] Request device [01:00.0] fault addr
ffff90d53fada000 [fault reason 04] Access beyond MGAW

And will not forward packets anymore on that interface. Guests are
totally ok if I only modify addr_mask.

Still investigating the issue.

Thanks!


> +        tmp.addr_mask = MIN(tmp.addr_mask, notifier->end);
> +        assert(tmp.iova <= tmp.addr_mask);
> +    } else {
> +        assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> +    }
>
>      if (entry->perm & IOMMU_RW) {
>          request_flags = IOMMU_NOTIFIER_MAP;
> @@ -1924,7 +1931,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>      }
>
>      if (notifier->notifier_flags & request_flags) {
> -        notifier->notify(notifier, entry);
> +        notifier->notify(notifier, &tmp);
>      }
>  }
>
> Then we can keep the assert() for e.g. vfio, however vhost can skip it and even
> get some further performance boosts..  Does that make sense?
>
> Thanks,
>
> --
> Peter Xu
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-08-11 17:01     ` Eugenio Perez Martin
@ 2020-08-11 17:10       ` Eugenio Perez Martin
  0 siblings, 0 replies; 68+ messages in thread
From: Eugenio Perez Martin @ 2020-08-11 17:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Jason Wang,
	Michael S. Tsirkin, qemu-devel, Eric Auger, Paolo Bonzini

On Tue, Aug 11, 2020 at 7:01 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Fri, Jun 26, 2020 at 11:29 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, Eugenio,
> >
> > (CCing Eric, Yan and Michael too)
> >
> > On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
> > > diff --git a/memory.c b/memory.c
> > > index 2f15a4b250..7f789710d2 100644
> > > --- a/memory.c
> > > +++ b/memory.c
> > > @@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> > >          return;
> > >      }
> > >
> > > -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> >
> > I can understand removing the assertion should solve the issue, however imho
> > the major issue is not about this single assertion but the whole addr_mask
> > issue behind with virtio...
> >
> > For normal IOTLB invalidations, we were trying our best to always make
> > IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what we're
> > doing with the loop in vtd_address_space_unmap().
> >
> > But this is not the first time that we may want to break this assumption for
> > virtio so that we make the IOTLB a tuple of (start, len), then that len can be
> > not a address mask any more.  That seems to be more efficient for things like
> > vhost because iotlbs there are not page based, so it'll be inefficient if we
> > always guarantee the addr_mask because it'll be quite a lot more roundtrips of
> > the same range of invalidation.  Here we've encountered another issue of
> > triggering the assertion with virtio-net, but only with the old RHEL7 guest.
> >
> > I'm thinking whether we can make the IOTLB invalidation configurable by
> > specifying whether the backend of the notifier can handle arbitary address
> > range in some way.  So we still have the guaranteed addr_masks by default
> > (since I still don't think totally break the addr_mask restriction is wise...),
> > however we can allow the special backends to take adavantage of using arbitary
> > (start, len) ranges for reasons like performance.
> >
> > To do that, a quick idea is to introduce a flag IOMMU_NOTIFIER_ARBITRARY_MASK
> > to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) can
> > take arbitrary address mask, then it can be any value and finally becomes a
> > length rather than an addr_mask.  Then for every iommu notify() we can directly
> > deliver whatever we've got from the upper layer to this notifier.  With the new
> > flag, vhost can do iommu_notifier_init() with UNMAP|ARBITRARY_MASK so it
> > declares this capability.  Then no matter for device iotlb or normal iotlb, we
> > skip the complicated procedure to split a big range into small ranges that are
> > with strict addr_mask, but directly deliver the message to the iommu notifier.
> > E.g., we can skip the loop in vtd_address_space_unmap() if the notifier is with
> > ARBITRARY flag set.
> >
> > Then, the assert() is not accurate either, and may become something like:
> >
> > diff --git a/memory.c b/memory.c
> > index 2f15a4b250..99d0492509 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -1906,6 +1906,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> >  {
> >      IOMMUNotifierFlag request_flags;
> >      hwaddr entry_end = entry->iova + entry->addr_mask;
> > +    IOMMUTLBEntry tmp = *entry;
> >
> >      /*
> >       * Skip the notification if the notification does not overlap
> > @@ -1915,7 +1916,13 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> >          return;
> >      }
> >
> > -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> > +    if (notifier->notifier_flags & IOMMU_NOTIFIER_ARBITRARY_MASK) {
> > +        tmp.iova = MAX(tmp.iova, notifier->start);
>
> Hi!
>
> If I modify the tmp.iova, the guest will complain (in dmesg):
> [  154.426828] DMAR: DRHD: handling fault status reg 2
> [  154.427700] DMAR: [DMA Read] Request device [01:00.0] fault addr
> ffff90d53fada000 [fault reason 04] Access beyond MGAW
>
> And will not forward packets anymore on that interface. Guests are
> totally ok if I only modify addr_mask.
>
> Still investigating the issue.
>
> Thanks!
>

Sorry it seems that I lost the nitpick Yan pointed out :).

Sending RFC v3.

>
> > +        tmp.addr_mask = MIN(tmp.addr_mask, notifier->end);
> > +        assert(tmp.iova <= tmp.addr_mask);
> > +    } else {
> > +        assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> > +    }
> >
> >      if (entry->perm & IOMMU_RW) {
> >          request_flags = IOMMU_NOTIFIER_MAP;
> > @@ -1924,7 +1931,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> >      }
> >
> >      if (notifier->notifier_flags & request_flags) {
> > -        notifier->notify(notifier, entry);
> > +        notifier->notify(notifier, &tmp);
> >      }
> >  }
> >
> > Then we can keep the assert() for e.g. vfio, however vhost can skip it and even
> > get some further performance boosts..  Does that make sense?
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >



^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v3 0/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-06-26  6:41 [RFC v2 0/1] memory: Delete assertion in memory_region_unregister_iommu_notifier Eugenio Pérez
  2020-06-26  6:41 ` [RFC v2 1/1] " Eugenio Pérez
  2020-06-29 15:05 ` [RFC v2 0/1] " Paolo Bonzini
@ 2020-08-11 17:55 ` Eugenio Pérez
  2020-08-11 17:55   ` [RFC v3 1/1] memory: Skip bad range assertion if notifier supports arbitrary masks Eugenio Pérez
  2020-08-11 18:10   ` [RFC v3 0/1] memory: Delete assertion in memory_region_unregister_iommu_notifier Eugenio Perez Martin
  2 siblings, 2 replies; 68+ messages in thread
From: Eugenio Pérez @ 2020-08-11 17:55 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, Jason Wang,
	Juan Quintela, Eric Auger, Avi Kivity, Paolo Bonzini

I am able to hit this assertion when a Red Hat 7 guest virtio_net device
raises an "Invalidation" of all the TLB entries. This happens in the
guest's startup if 'intel_iommu=on' argument is passed to the guest
kernel and right IOMMU/ATS devices are declared in qemu's command line.

Command line:
/home/qemu/x86_64-softmmu/qemu-system-x86_64 -name \
guest=rhel7-test,debug-threads=on -machine \
pc-q35-5.1,accel=kvm,usb=off,dump-guest-core=off,kernel_irqchip=split \
-cpu \
Broadwell,vme=on,ss=on,vmx=on,f16c=on,rdrand=on,hypervisor=on,arat=on,tsc-adjust=on,umip=on,arch-capabilities=on,xsaveopt=on,pdpe1gb=on,abm=on,skip-l1dfl-vmentry=on,rtm=on,hle=on \
-m 8096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid \
d022ecbf-679e-4755-87ce-eb87fc5bbc5d -display none -no-user-config \
-nodefaults -rtc base=utc,driftfix=slew -global \
kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global \
ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 -boot strict=on \
-device intel-iommu,intremap=on,device-iotlb=on -device \
pcie-root-port,port=0x8,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x1 \
-device \
pcie-root-port,port=0x9,chassis=2,id=pci.2,bus=pcie.0,addr=0x1.0x1 \
-device \
pcie-root-port,port=0xa,chassis=3,id=pci.3,bus=pcie.0,addr=0x1.0x2 \
-device \
pcie-root-port,port=0xb,chassis=4,id=pci.4,bus=pcie.0,addr=0x1.0x3 \
-device \
pcie-root-port,port=0xc,chassis=5,id=pci.5,bus=pcie.0,addr=0x1.0x4 \
-device \
pcie-root-port,port=0xd,chassis=6,id=pci.6,bus=pcie.0,addr=0x1.0x5 \
-device \
pcie-root-port,port=0xe,chassis=7,id=pci.7,bus=pcie.0,addr=0x1.0x6 \
-device qemu-xhci,p2=15,p3=15,id=usb,bus=pci.2,addr=0x0 -device \
virtio-serial-pci,id=virtio-serial0,bus=pci.3,addr=0x0 -drive \
file=/home/virtio-test2.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 \
-device \
virtio-blk-pci,scsi=off,bus=pci.4,addr=0x0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-netdev tap,id=hostnet0,vhost=on,vhostforce=on -device \
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:0d:1d:f2,bus=pci.1,addr=0x0,iommu_platform=on,ats=on \
-device virtio-balloon-pci,id=balloon0,bus=pci.5,addr=0x0 -object \
rng-random,id=objrng0,filename=/dev/urandom -device \
virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.6,addr=0x0 -s -msg \
timestamp=on

Full backtrace:

    at /home/qemu/hw/i386/intel_iommu.c:2468
    (mr=0x555557609330, addr=136, value=0x7ffde5dfe478, size=4, shift=0, mask=4294967295, attrs=...) at /home/qemu/memory.c:483
    (addr=136, value=0x7ffde5dfe478, size=4, access_size_min=4, access_size_max=8, access_fn=
    0x555555883d38 <memory_region_write_accessor>, mr=0x555557609330, attrs=...) at /home/qemu/memory.c:544
    at /home/qemu/memory.c:1476
    (fv=0x7ffde00935d0, addr=4275634312, attrs=..., ptr=0x7ffff7ff0028, len=4, addr1=136, l=4, mr=0x555557609330) at /home/qemu/exec.c:3146
    at /home/qemu/exec.c:3186
    (as=0x5555567ca640 <address_space_memory>, addr=4275634312, attrs=..., buf=0x7ffff7ff0028, len=4) at /home/qemu/exec.c:3277
    (as=0x5555567ca640 <address_space_memory>, addr=4275634312, attrs=..., buf=0x7ffff7ff0028, len=4, is_write=true)
    at /home/qemu/exec.c:3287

--

Tested with vhost-net, with a linux bridge to forward packets.
Forwarding with vhostuser interfaces + dpdk-testpmd io forwarding mode seems
broken also in v5.1.0-rc3.

v3: Skip the assertion in case notifier is a IOTLB one, since they can manage
    arbitrary ranges. Using a flag in the notifier for now, as Peter suggested.

v2: Actually delete assertion instead of just commenting out using C99

Eugenio Pérez (1):
  memory: Skip bad range assertion if notifier supports arbitrary masks

 hw/virtio/vhost.c     |  2 +-
 include/exec/memory.h |  2 ++
 softmmu/memory.c      | 10 ++++++++--
 3 files changed, 11 insertions(+), 3 deletions(-)

-- 
2.18.1



^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC v3 1/1] memory: Skip bad range assertion if notifier supports arbitrary masks
  2020-08-11 17:55 ` [RFC v3 " Eugenio Pérez
@ 2020-08-11 17:55   ` Eugenio Pérez
  2020-08-12  2:24     ` Jason Wang
  2020-08-11 18:10   ` [RFC v3 0/1] memory: Delete assertion in memory_region_unregister_iommu_notifier Eugenio Perez Martin
  1 sibling, 1 reply; 68+ messages in thread
From: Eugenio Pérez @ 2020-08-11 17:55 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, Jason Wang,
	Juan Quintela, Eric Auger, Avi Kivity, Paolo Bonzini

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost.c     |  2 +-
 include/exec/memory.h |  2 ++
 softmmu/memory.c      | 10 ++++++++--
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 1a1384e7a6..e74ad9e09b 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -729,7 +729,7 @@ static void vhost_iommu_region_add(MemoryListener *listener,
     iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
                                                    MEMTXATTRS_UNSPECIFIED);
     iommu_notifier_init(&iommu->n, vhost_iommu_unmap_notify,
-                        IOMMU_NOTIFIER_UNMAP,
+                        IOMMU_NOTIFIER_UNMAP | IOMMU_NOTIFIER_IOTLB,
                         section->offset_within_region,
                         int128_get64(end),
                         iommu_idx);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 307e527835..4d94c1e984 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -87,6 +87,8 @@ typedef enum {
     IOMMU_NOTIFIER_UNMAP = 0x1,
     /* Notify entry changes (newly created entries) */
     IOMMU_NOTIFIER_MAP = 0x2,
+    /* Notify changes on IOTLB entries */
+    IOMMU_NOTIFIER_IOTLB = 0x04,
 } IOMMUNotifierFlag;
 
 #define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
diff --git a/softmmu/memory.c b/softmmu/memory.c
index af25987518..e2c5f6d0e7 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1895,6 +1895,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
 {
     IOMMUNotifierFlag request_flags;
     hwaddr entry_end = entry->iova + entry->addr_mask;
+    IOMMUTLBEntry tmp = *entry;
 
     /*
      * Skip the notification if the notification does not overlap
@@ -1904,7 +1905,12 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
         return;
     }
 
-    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
+    if (notifier->notifier_flags & IOMMU_NOTIFIER_IOTLB) {
+        tmp.iova = MAX(tmp.iova, notifier->start);
+        tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
+    } else {
+        assert(entry->iova >= notifier->start && entry_end <= notifier->end);
+    }
 
     if (entry->perm & IOMMU_RW) {
         request_flags = IOMMU_NOTIFIER_MAP;
@@ -1913,7 +1919,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
     }
 
     if (notifier->notifier_flags & request_flags) {
-        notifier->notify(notifier, entry);
+        notifier->notify(notifier, &tmp);
     }
 }
 
-- 
2.18.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [RFC v3 0/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-08-11 17:55 ` [RFC v3 " Eugenio Pérez
  2020-08-11 17:55   ` [RFC v3 1/1] memory: Skip bad range assertion if notifier supports arbitrary masks Eugenio Pérez
@ 2020-08-11 18:10   ` Eugenio Perez Martin
  2020-08-11 19:27     ` Peter Xu
  1 sibling, 1 reply; 68+ messages in thread
From: Eugenio Perez Martin @ 2020-08-11 18:10 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Jason Wang,
	Michael S. Tsirkin, Eric Auger, Avi Kivity, Paolo Bonzini

On Tue, Aug 11, 2020 at 7:56 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> I am able to hit this assertion when a Red Hat 7 guest virtio_net device
> raises an "Invalidation" of all the TLB entries. This happens in the
> guest's startup if 'intel_iommu=on' argument is passed to the guest
> kernel and right IOMMU/ATS devices are declared in qemu's command line.
>
> Command line:
> /home/qemu/x86_64-softmmu/qemu-system-x86_64 -name \
> guest=rhel7-test,debug-threads=on -machine \
> pc-q35-5.1,accel=kvm,usb=off,dump-guest-core=off,kernel_irqchip=split \
> -cpu \
> Broadwell,vme=on,ss=on,vmx=on,f16c=on,rdrand=on,hypervisor=on,arat=on,tsc-adjust=on,umip=on,arch-capabilities=on,xsaveopt=on,pdpe1gb=on,abm=on,skip-l1dfl-vmentry=on,rtm=on,hle=on \
> -m 8096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid \
> d022ecbf-679e-4755-87ce-eb87fc5bbc5d -display none -no-user-config \
> -nodefaults -rtc base=utc,driftfix=slew -global \
> kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global \
> ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 -boot strict=on \
> -device intel-iommu,intremap=on,device-iotlb=on -device \
> pcie-root-port,port=0x8,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x1 \
> -device \
> pcie-root-port,port=0x9,chassis=2,id=pci.2,bus=pcie.0,addr=0x1.0x1 \
> -device \
> pcie-root-port,port=0xa,chassis=3,id=pci.3,bus=pcie.0,addr=0x1.0x2 \
> -device \
> pcie-root-port,port=0xb,chassis=4,id=pci.4,bus=pcie.0,addr=0x1.0x3 \
> -device \
> pcie-root-port,port=0xc,chassis=5,id=pci.5,bus=pcie.0,addr=0x1.0x4 \
> -device \
> pcie-root-port,port=0xd,chassis=6,id=pci.6,bus=pcie.0,addr=0x1.0x5 \
> -device \
> pcie-root-port,port=0xe,chassis=7,id=pci.7,bus=pcie.0,addr=0x1.0x6 \
> -device qemu-xhci,p2=15,p3=15,id=usb,bus=pci.2,addr=0x0 -device \
> virtio-serial-pci,id=virtio-serial0,bus=pci.3,addr=0x0 -drive \
> file=/home/virtio-test2.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 \
> -device \
> virtio-blk-pci,scsi=off,bus=pci.4,addr=0x0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
> -netdev tap,id=hostnet0,vhost=on,vhostforce=on -device \
> virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:0d:1d:f2,bus=pci.1,addr=0x0,iommu_platform=on,ats=on \
> -device virtio-balloon-pci,id=balloon0,bus=pci.5,addr=0x0 -object \
> rng-random,id=objrng0,filename=/dev/urandom -device \
> virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.6,addr=0x0 -s -msg \
> timestamp=on
>
> Full backtrace:
>
>     at /home/qemu/hw/i386/intel_iommu.c:2468
>     (mr=0x555557609330, addr=136, value=0x7ffde5dfe478, size=4, shift=0, mask=4294967295, attrs=...) at /home/qemu/memory.c:483
>     (addr=136, value=0x7ffde5dfe478, size=4, access_size_min=4, access_size_max=8, access_fn=
>     0x555555883d38 <memory_region_write_accessor>, mr=0x555557609330, attrs=...) at /home/qemu/memory.c:544
>     at /home/qemu/memory.c:1476
>     (fv=0x7ffde00935d0, addr=4275634312, attrs=..., ptr=0x7ffff7ff0028, len=4, addr1=136, l=4, mr=0x555557609330) at /home/qemu/exec.c:3146
>     at /home/qemu/exec.c:3186
>     (as=0x5555567ca640 <address_space_memory>, addr=4275634312, attrs=..., buf=0x7ffff7ff0028, len=4) at /home/qemu/exec.c:3277
>     (as=0x5555567ca640 <address_space_memory>, addr=4275634312, attrs=..., buf=0x7ffff7ff0028, len=4, is_write=true)
>     at /home/qemu/exec.c:3287
>
> --
>
> Tested with vhost-net, with a linux bridge to forward packets.
> Forwarding with vhostuser interfaces + dpdk-testpmd io forwarding mode seems
> broken also in v5.1.0-rc3.
>
> v3: Skip the assertion in case notifier is a IOTLB one, since they can manage
>     arbitrary ranges. Using a flag in the notifier for now, as Peter suggested.
>
> v2: Actually delete assertion instead of just commenting out using C99
>
> Eugenio Pérez (1):
>   memory: Skip bad range assertion if notifier supports arbitrary masks
>
>  hw/virtio/vhost.c     |  2 +-
>  include/exec/memory.h |  2 ++
>  softmmu/memory.c      | 10 ++++++++--
>  3 files changed, 11 insertions(+), 3 deletions(-)
>
> --
> 2.18.1
>
>

Using this patch as a reference, I'm having problems to understand:

- I'm not sure that the flag name expresses clearly the notifier capability.
- What would be the advantages of using another field (NotifierType?)
in the notifier to express that it accepts arbitrary ranges for
unmapping? (If I understood correctly Jason's proposal)
- Is it possible (or advisable) to skip all the page splitting in
vtd_page_walk if the memory range notifier supports these arbitrary
ranges? What would be the disadvantages? (Maybe in a future patch). It
seems it is advisable to me, but I would like to double confirm.

I think I don't miss anything, please let me know otherwise.

Thanks!

PS: Sorry I forgot to recover the backtrace properly, it will be
included in the next RFC/patch version. In case somebody misses it, it
is here: https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg01015.html
.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v3 0/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-08-11 18:10   ` [RFC v3 0/1] memory: Delete assertion in memory_region_unregister_iommu_notifier Eugenio Perez Martin
@ 2020-08-11 19:27     ` Peter Xu
  2020-08-12 14:33       ` Eugenio Perez Martin
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-08-11 19:27 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Jason Wang,
	Michael S. Tsirkin, qemu-devel, Eric Auger, Avi Kivity,
	Paolo Bonzini

Hi, Eugenio,

On Tue, Aug 11, 2020 at 08:10:44PM +0200, Eugenio Perez Martin wrote:
> Using this patch as a reference, I'm having problems to understand:
> 
> - I'm not sure that the flag name expresses clearly the notifier capability.

The old code is kind of messed up for dev-iotlb invalidations, by always
sending UNMAP notifications for both iotlb and dev-iotlb invalidations.

Now if we introduce the new DEV_IOTLB type, we can separate the two:

  - We notify IOMMU_NOTIFIER_UNMAP for iotlb invalidations

  - We notify IOMMU_NOTIFIER_DEV_IOTLB for dev-iotlb invalidations

Vhost should always be with ats=on when vIOMMU enabled, so it will enable
device iotlb.  Then it does not need iotlb (UNMAP) invalidation any more
because they'll normally be duplicated (one is to invalidate vIOMMU cache, one
is to invalidate device cache).

Also, we can drop UNMAP type for vhost if we introduce DEV_IOTLB.  It works
just like on the real hardwares - the device won't be able to receive iotlb
invalidation messages, but only device iotlb invalidation messages.  Here
vhost (or, virtio-pci) is the device.

> - What would be the advantages of using another field (NotifierType?)
> in the notifier to express that it accepts arbitrary ranges for
> unmapping? (If I understood correctly Jason's proposal)

(Please refer to above too..)

> - Is it possible (or advisable) to skip all the page splitting in
> vtd_page_walk if the memory range notifier supports these arbitrary
> ranges? What would be the disadvantages? (Maybe in a future patch). It
> seems it is advisable to me, but I would like to double confirm.

vtd_page_walk is not used for dev-iotlb, we don't need to change that.  We also
want to explicitly keep the page granularity of vtd_page_walk for the other
IOMMU notifiers, e.g. vfio.

Though we'll need to modify vtd_process_device_iotlb_desc() to only send
notifications to the notifiers that registered with DEV_IOTLB flag.

Hope it helps..

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v3 1/1] memory: Skip bad range assertion if notifier supports arbitrary masks
  2020-08-11 17:55   ` [RFC v3 1/1] memory: Skip bad range assertion if notifier supports arbitrary masks Eugenio Pérez
@ 2020-08-12  2:24     ` Jason Wang
  2020-08-12  8:49       ` Eugenio Perez Martin
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-08-12  2:24 UTC (permalink / raw)
  To: Eugenio Pérez, Peter Xu, qemu-devel
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Michael S. Tsirkin,
	Eric Auger, Avi Kivity, Paolo Bonzini


On 2020/8/12 上午1:55, Eugenio Pérez wrote:
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost.c     |  2 +-
>   include/exec/memory.h |  2 ++
>   softmmu/memory.c      | 10 ++++++++--
>   3 files changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 1a1384e7a6..e74ad9e09b 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -729,7 +729,7 @@ static void vhost_iommu_region_add(MemoryListener *listener,
>       iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
>                                                      MEMTXATTRS_UNSPECIFIED);
>       iommu_notifier_init(&iommu->n, vhost_iommu_unmap_notify,
> -                        IOMMU_NOTIFIER_UNMAP,
> +                        IOMMU_NOTIFIER_UNMAP | IOMMU_NOTIFIER_IOTLB,


I think we can safely drop IOMMU_NOTIFIER_UNMAP here since device IOTLB 
is sufficient.

Btw, IOMMU_NOTIFIER_IOTLB is kind of confusing, maybe something like 
IOMMU_NOTIFIER_DEVIOTLB.


>                           section->offset_within_region,
>                           int128_get64(end),
>                           iommu_idx);
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 307e527835..4d94c1e984 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -87,6 +87,8 @@ typedef enum {
>       IOMMU_NOTIFIER_UNMAP = 0x1,
>       /* Notify entry changes (newly created entries) */
>       IOMMU_NOTIFIER_MAP = 0x2,
> +    /* Notify changes on IOTLB entries */
> +    IOMMU_NOTIFIER_IOTLB = 0x04,
>   } IOMMUNotifierFlag;
>   
>   #define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
> diff --git a/softmmu/memory.c b/softmmu/memory.c
> index af25987518..e2c5f6d0e7 100644
> --- a/softmmu/memory.c
> +++ b/softmmu/memory.c
> @@ -1895,6 +1895,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,


(we probably need a better name of this function, at least something 
like "memory_region_iommu_notify_one").


>   {
>       IOMMUNotifierFlag request_flags;
>       hwaddr entry_end = entry->iova + entry->addr_mask;
> +    IOMMUTLBEntry tmp = *entry;
>   
>       /*
>        * Skip the notification if the notification does not overlap
> @@ -1904,7 +1905,12 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>           return;
>       }
>   
> -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> +    if (notifier->notifier_flags & IOMMU_NOTIFIER_IOTLB) {
> +        tmp.iova = MAX(tmp.iova, notifier->start);
> +        tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;


Any reason for doing such re-calculation here, a comment would be helpful.


> +    } else {
> +        assert(entry->iova >= notifier->start && entry_end <= notifier->end);


I wonder if it's better to convert the assert so some kind of log or 
warn here.

Thanks


> +    }
>   
>       if (entry->perm & IOMMU_RW) {
>           request_flags = IOMMU_NOTIFIER_MAP;
> @@ -1913,7 +1919,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>       }
>   
>       if (notifier->notifier_flags & request_flags) {
> -        notifier->notify(notifier, entry);
> +        notifier->notify(notifier, &tmp);
>       }
>   }
>   



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v3 1/1] memory: Skip bad range assertion if notifier supports arbitrary masks
  2020-08-12  2:24     ` Jason Wang
@ 2020-08-12  8:49       ` Eugenio Perez Martin
  2020-08-18 14:24         ` Eugenio Perez Martin
  0 siblings, 1 reply; 68+ messages in thread
From: Eugenio Perez Martin @ 2020-08-12  8:49 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Michael S. Tsirkin,
	qemu-devel, Peter Xu, Eric Auger, Avi Kivity, Paolo Bonzini

On Wed, Aug 12, 2020 at 4:24 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/8/12 上午1:55, Eugenio Pérez wrote:
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost.c     |  2 +-
> >   include/exec/memory.h |  2 ++
> >   softmmu/memory.c      | 10 ++++++++--
> >   3 files changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> > index 1a1384e7a6..e74ad9e09b 100644
> > --- a/hw/virtio/vhost.c
> > +++ b/hw/virtio/vhost.c
> > @@ -729,7 +729,7 @@ static void vhost_iommu_region_add(MemoryListener *listener,
> >       iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
> >                                                      MEMTXATTRS_UNSPECIFIED);
> >       iommu_notifier_init(&iommu->n, vhost_iommu_unmap_notify,
> > -                        IOMMU_NOTIFIER_UNMAP,
> > +                        IOMMU_NOTIFIER_UNMAP | IOMMU_NOTIFIER_IOTLB,
>
>
> I think we can safely drop IOMMU_NOTIFIER_UNMAP here since device IOTLB
> is sufficient.
>
> Btw, IOMMU_NOTIFIER_IOTLB is kind of confusing, maybe something like
> IOMMU_NOTIFIER_DEVIOTLB.
>

Got it, will change.

>
> >                           section->offset_within_region,
> >                           int128_get64(end),
> >                           iommu_idx);
> > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > index 307e527835..4d94c1e984 100644
> > --- a/include/exec/memory.h
> > +++ b/include/exec/memory.h
> > @@ -87,6 +87,8 @@ typedef enum {
> >       IOMMU_NOTIFIER_UNMAP = 0x1,
> >       /* Notify entry changes (newly created entries) */
> >       IOMMU_NOTIFIER_MAP = 0x2,
> > +    /* Notify changes on IOTLB entries */
> > +    IOMMU_NOTIFIER_IOTLB = 0x04,
> >   } IOMMUNotifierFlag;
> >
> >   #define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
> > diff --git a/softmmu/memory.c b/softmmu/memory.c
> > index af25987518..e2c5f6d0e7 100644
> > --- a/softmmu/memory.c
> > +++ b/softmmu/memory.c
> > @@ -1895,6 +1895,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>
>
> (we probably need a better name of this function, at least something
> like "memory_region_iommu_notify_one").
>

Ok will change.

>
> >   {
> >       IOMMUNotifierFlag request_flags;
> >       hwaddr entry_end = entry->iova + entry->addr_mask;
> > +    IOMMUTLBEntry tmp = *entry;
> >
> >       /*
> >        * Skip the notification if the notification does not overlap
> > @@ -1904,7 +1905,12 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> >           return;
> >       }
> >
> > -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> > +    if (notifier->notifier_flags & IOMMU_NOTIFIER_IOTLB) {
> > +        tmp.iova = MAX(tmp.iova, notifier->start);
> > +        tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
>
>
> Any reason for doing such re-calculation here, a comment would be helpful.
>

It was proposed by Peter, but I understand as limiting the
address+range we pass to the notifier. Although vhost seems to support
it as long as it contains (notifier->start, notifier->end) in range, a
future notifier might not.

It could be done as iommu_entry_crop(IOMMUTLBEntry *entry, const
IOMMUNotifier *notifier) though.

>
> > +    } else {
> > +        assert(entry->iova >= notifier->start && entry_end <= notifier->end);
>
>
> I wonder if it's better to convert the assert so some kind of log or
> warn here.
>

I think that if we transform that assert to a log, we should also tell
the guest that something went wrong. Or would it be enough notifying
the bad range?

If a malicious guest cannot reach that point, I think that leaving it
as an assertion allows us to detect earlier the fail in my opinion
(Assert early and assert often).

Thanks!

> Thanks
>
>
> > +    }
> >
> >       if (entry->perm & IOMMU_RW) {
> >           request_flags = IOMMU_NOTIFIER_MAP;
> > @@ -1913,7 +1919,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> >       }
> >
> >       if (notifier->notifier_flags & request_flags) {
> > -        notifier->notify(notifier, entry);
> > +        notifier->notify(notifier, &tmp);
> >       }
> >   }
> >
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v3 0/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-08-11 19:27     ` Peter Xu
@ 2020-08-12 14:33       ` Eugenio Perez Martin
  2020-08-12 21:12         ` Peter Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Eugenio Perez Martin @ 2020-08-12 14:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Jason Wang,
	Michael S. Tsirkin, qemu-devel, Eric Auger, Avi Kivity,
	Paolo Bonzini

On Tue, Aug 11, 2020 at 9:28 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Eugenio,
>
> On Tue, Aug 11, 2020 at 08:10:44PM +0200, Eugenio Perez Martin wrote:
> > Using this patch as a reference, I'm having problems to understand:
> >
> > - I'm not sure that the flag name expresses clearly the notifier capability.
>
> The old code is kind of messed up for dev-iotlb invalidations, by always
> sending UNMAP notifications for both iotlb and dev-iotlb invalidations.
>
> Now if we introduce the new DEV_IOTLB type, we can separate the two:
>
>   - We notify IOMMU_NOTIFIER_UNMAP for iotlb invalidations
>
>   - We notify IOMMU_NOTIFIER_DEV_IOTLB for dev-iotlb invalidations
>
> Vhost should always be with ats=on when vIOMMU enabled, so it will enable
> device iotlb.  Then it does not need iotlb (UNMAP) invalidation any more
> because they'll normally be duplicated (one is to invalidate vIOMMU cache, one
> is to invalidate device cache).
>
> Also, we can drop UNMAP type for vhost if we introduce DEV_IOTLB.  It works
> just like on the real hardwares - the device won't be able to receive iotlb
> invalidation messages, but only device iotlb invalidation messages.  Here
> vhost (or, virtio-pci) is the device.
>
> > - What would be the advantages of using another field (NotifierType?)
> > in the notifier to express that it accepts arbitrary ranges for
> > unmapping? (If I understood correctly Jason's proposal)
>
> (Please refer to above too..)
>
> > - Is it possible (or advisable) to skip all the page splitting in
> > vtd_page_walk if the memory range notifier supports these arbitrary
> > ranges? What would be the disadvantages? (Maybe in a future patch). It
> > seems it is advisable to me, but I would like to double confirm.
>
> vtd_page_walk is not used for dev-iotlb, we don't need to change that.  We also
> want to explicitly keep the page granularity of vtd_page_walk for the other
> IOMMU notifiers, e.g. vfio.
>

I'm not sure if I'm understanding it.

I have here a backtrace in a regular call (not [0,~0ULL]):
#0  0x000055555597ca63 in memory_region_notify_one_iommu
(notifier=0x7fffe4976f08, entry=0x7fffddff9d20)
    at /home/qemu/softmmu/memory.c:1895
#1  0x000055555597cc87 in memory_region_notify_iommu
(iommu_mr=0x55555728f2e0, iommu_idx=0, entry=...) at
/home/qemu/softmmu/memory.c:1938
#2  0x000055555594b95a in vtd_sync_shadow_page_hook
(entry=0x7fffddff9e70, private=0x55555728f2e0) at
/home/qemu/hw/i386/intel_iommu.c:1436
#3  0x000055555594af7b in vtd_page_walk_one (entry=0x7fffddff9e70,
info=0x7fffddffa140) at /home/qemu/hw/i386/intel_iommu.c:1173
#4  0x000055555594b2b3 in vtd_page_walk_level
    (addr=10531758080, start=4292870144, end=4294967296, level=1,
read=true, write=true, info=0x7fffddffa140)
    at /home/qemu/hw/i386/intel_iommu.c:1254
#5  0x000055555594b225 in vtd_page_walk_level
    (addr=10530922496, start=3221225472, end=4294967296, level=2,
read=true, write=true, info=0x7fffddffa140)
    at /home/qemu/hw/i386/intel_iommu.c:1236
#6  0x000055555594b225 in vtd_page_walk_level
    (addr=10529021952, start=0, end=549755813888, level=3, read=true,
write=true, info=0x7fffddffa140)
    at /home/qemu/hw/i386/intel_iommu.c:1236
#7  0x000055555594b3f8 in vtd_page_walk (s=0x555557565210,
ce=0x7fffddffa1a0, start=0, end=549755813888, info=0x7fffddffa140)
    at /home/qemu/hw/i386/intel_iommu.c:1293
#8  0x000055555594ba77 in vtd_sync_shadow_page_table_range
(vtd_as=0x55555728f270, ce=0x7fffddffa1a0, addr=0,
size=18446744073709551615)
    at /home/qemu/hw/i386/intel_iommu.c:1467
#9  0x000055555594bb50 in vtd_sync_shadow_page_table
(vtd_as=0x55555728f270) at /home/qemu/hw/i386/intel_iommu.c:1498
#10 0x000055555594cc5f in vtd_iotlb_domain_invalidate
(s=0x555557565210, domain_id=3) at
/home/qemu/hw/i386/intel_iommu.c:1965
#11 0x000055555594dbae in vtd_process_iotlb_desc (s=0x555557565210,
inv_desc=0x7fffddffa2b0) at /home/qemu/hw/i386/intel_iommu.c:2371
#12 0x000055555594dfd3 in vtd_process_inv_desc (s=0x555557565210) at
/home/qemu/hw/i386/intel_iommu.c:2499
#13 0x000055555594e1e9 in vtd_fetch_inv_desc (s=0x555557565210) at
/home/qemu/hw/i386/intel_iommu.c:2568
#14 0x000055555594e330 in vtd_handle_iqt_write (s=0x555557565210) at
/home/qemu/hw/i386/intel_iommu.c:2595
#15 0x000055555594ed90 in vtd_mem_write (opaque=0x555557565210,
addr=136, val=1888, size=4) at /home/qemu/hw/i386/intel_iommu.c:2842
#16 0x00005555559787b9 in memory_region_write_accessor
    (mr=0x555557565540, addr=136, value=0x7fffddffa478, size=4,
shift=0, mask=4294967295, attrs=...) at
/home/qemu/softmmu/memory.c:483
#17 0x00005555559789d7 in access_with_adjusted_size
    (addr=136, value=0x7fffddffa478, size=4, access_size_min=4,
access_size_max=8, access_fn=
    0x5555559786da <memory_region_write_accessor>, mr=0x555557565540,
attrs=...) at /home/qemu/softmmu/memory.c:544
#18 0x000055555597b8a5 in memory_region_dispatch_write
(mr=0x555557565540, addr=136, data=1888, op=MO_32, attrs=...)
    at /home/qemu/softmmu/memory.c:1465
#19 0x000055555582b1bf in flatview_write_continue
    (fv=0x7fffc447c470, addr=4275634312, attrs=...,
ptr=0x7ffff7dfd028, len=4, addr1=136, l=4, mr=0x555557565540) at
/home/qemu/exec.c:3176
#20 0x000055555582b304 in flatview_write (fv=0x7fffc447c470,
addr=4275634312, attrs=..., buf=0x7ffff7dfd028, len=4)
    at /home/qemu/exec.c:3216
#21 0x000055555582b659 in address_space_write
    (as=0x5555567a9380 <address_space_memory>, addr=4275634312,
attrs=..., buf=0x7ffff7dfd028, len=4) at /home/qemu/exec.c:3307
#22 0x000055555582b6c6 in address_space_rw
    (as=0x5555567a9380 <address_space_memory>, addr=4275634312,
attrs=..., buf=0x7ffff7dfd028, len=4, is_write=true)
    at /home/qemu/exec.c:3317
#23 0x000055555588e3b8 in kvm_cpu_exec (cpu=0x555556bfe9f0) at
/home/qemu/accel/kvm/kvm-all.c:2518
#24 0x0000555555972bcf in qemu_kvm_cpu_thread_fn (arg=0x555556bfe9f0)
at /home/qemu/softmmu/cpus.c:1188
#25 0x0000555555e08fbd in qemu_thread_start (args=0x555556c24c60) at
util/qemu-thread-posix.c:521
#26 0x00007ffff55a714a in start_thread () at /lib64/libpthread.so.0
#27 0x00007ffff52d8f23 in clone () at /lib64/libc.so.6

with entry = {target_as = 0x5555567a9380, iova = 0xfff0b000,
translated_addr = 0x0, addr_mask = 0xfff, perm = 0x0}

Here we got 3 levels of vtd_page_walk (frames #4-#6). The #6 parameters are:

(addr=10529021952, start=0, end=549755813888, level=3, read=true, write=true,
    info=0x7fffddffa140)

If I understand correctly, the while (iova < end) {} loop in
vtd_page_walk will break the big range in small pages (4K because of
level=1, and (end - iova) / subpage_size = 245 pages or iterations).
That could be a lot of write(2) in vhost_kernel_send_device_iotlb_msg
in the worst case, or a lot of useless returns in
memory_region_notify_one_iommu because of (notifier->start > entry_end
|| notifier->end < entry->iova) in the best.

Am I right with this? I understand that others notifiers (you mention
vfio) need the granularity, but would a check in some vtd_* function
for the help with the performance? (not suggesting to introduce it in
this patch series).

Thank you very much.

> Though we'll need to modify vtd_process_device_iotlb_desc() to only send
> notifications to the notifiers that registered with DEV_IOTLB flag.
>
> Hope it helps..
>
> Thanks,
>
> --
> Peter Xu
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v3 0/1] memory: Delete assertion in memory_region_unregister_iommu_notifier
  2020-08-12 14:33       ` Eugenio Perez Martin
@ 2020-08-12 21:12         ` Peter Xu
  0 siblings, 0 replies; 68+ messages in thread
From: Peter Xu @ 2020-08-12 21:12 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Jason Wang,
	Michael S. Tsirkin, qemu-devel, Eric Auger, Avi Kivity,
	Paolo Bonzini

On Wed, Aug 12, 2020 at 04:33:24PM +0200, Eugenio Perez Martin wrote:
> On Tue, Aug 11, 2020 at 9:28 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, Eugenio,
> >
> > On Tue, Aug 11, 2020 at 08:10:44PM +0200, Eugenio Perez Martin wrote:
> > > Using this patch as a reference, I'm having problems to understand:
> > >
> > > - I'm not sure that the flag name expresses clearly the notifier capability.
> >
> > The old code is kind of messed up for dev-iotlb invalidations, by always
> > sending UNMAP notifications for both iotlb and dev-iotlb invalidations.
> >
> > Now if we introduce the new DEV_IOTLB type, we can separate the two:
> >
> >   - We notify IOMMU_NOTIFIER_UNMAP for iotlb invalidations
> >
> >   - We notify IOMMU_NOTIFIER_DEV_IOTLB for dev-iotlb invalidations
> >
> > Vhost should always be with ats=on when vIOMMU enabled, so it will enable
> > device iotlb.  Then it does not need iotlb (UNMAP) invalidation any more
> > because they'll normally be duplicated (one is to invalidate vIOMMU cache, one
> > is to invalidate device cache).
> >
> > Also, we can drop UNMAP type for vhost if we introduce DEV_IOTLB.  It works
> > just like on the real hardwares - the device won't be able to receive iotlb
> > invalidation messages, but only device iotlb invalidation messages.  Here
> > vhost (or, virtio-pci) is the device.
> >
> > > - What would be the advantages of using another field (NotifierType?)
> > > in the notifier to express that it accepts arbitrary ranges for
> > > unmapping? (If I understood correctly Jason's proposal)
> >
> > (Please refer to above too..)
> >
> > > - Is it possible (or advisable) to skip all the page splitting in
> > > vtd_page_walk if the memory range notifier supports these arbitrary
> > > ranges? What would be the disadvantages? (Maybe in a future patch). It
> > > seems it is advisable to me, but I would like to double confirm.
> >
> > vtd_page_walk is not used for dev-iotlb, we don't need to change that.  We also
> > want to explicitly keep the page granularity of vtd_page_walk for the other
> > IOMMU notifiers, e.g. vfio.
> >
> 
> I'm not sure if I'm understanding it.
> 
> I have here a backtrace in a regular call (not [0,~0ULL]):
> #0  0x000055555597ca63 in memory_region_notify_one_iommu
> (notifier=0x7fffe4976f08, entry=0x7fffddff9d20)
>     at /home/qemu/softmmu/memory.c:1895
> #1  0x000055555597cc87 in memory_region_notify_iommu
> (iommu_mr=0x55555728f2e0, iommu_idx=0, entry=...) at
> /home/qemu/softmmu/memory.c:1938
> #2  0x000055555594b95a in vtd_sync_shadow_page_hook
> (entry=0x7fffddff9e70, private=0x55555728f2e0) at
> /home/qemu/hw/i386/intel_iommu.c:1436
> #3  0x000055555594af7b in vtd_page_walk_one (entry=0x7fffddff9e70,
> info=0x7fffddffa140) at /home/qemu/hw/i386/intel_iommu.c:1173
> #4  0x000055555594b2b3 in vtd_page_walk_level
>     (addr=10531758080, start=4292870144, end=4294967296, level=1,
> read=true, write=true, info=0x7fffddffa140)
>     at /home/qemu/hw/i386/intel_iommu.c:1254
> #5  0x000055555594b225 in vtd_page_walk_level
>     (addr=10530922496, start=3221225472, end=4294967296, level=2,
> read=true, write=true, info=0x7fffddffa140)
>     at /home/qemu/hw/i386/intel_iommu.c:1236
> #6  0x000055555594b225 in vtd_page_walk_level
>     (addr=10529021952, start=0, end=549755813888, level=3, read=true,
> write=true, info=0x7fffddffa140)
>     at /home/qemu/hw/i386/intel_iommu.c:1236
> #7  0x000055555594b3f8 in vtd_page_walk (s=0x555557565210,
> ce=0x7fffddffa1a0, start=0, end=549755813888, info=0x7fffddffa140)
>     at /home/qemu/hw/i386/intel_iommu.c:1293
> #8  0x000055555594ba77 in vtd_sync_shadow_page_table_range
> (vtd_as=0x55555728f270, ce=0x7fffddffa1a0, addr=0,
> size=18446744073709551615)
>     at /home/qemu/hw/i386/intel_iommu.c:1467
> #9  0x000055555594bb50 in vtd_sync_shadow_page_table
> (vtd_as=0x55555728f270) at /home/qemu/hw/i386/intel_iommu.c:1498
> #10 0x000055555594cc5f in vtd_iotlb_domain_invalidate
> (s=0x555557565210, domain_id=3) at
> /home/qemu/hw/i386/intel_iommu.c:1965
> #11 0x000055555594dbae in vtd_process_iotlb_desc (s=0x555557565210,
> inv_desc=0x7fffddffa2b0) at /home/qemu/hw/i386/intel_iommu.c:2371
> #12 0x000055555594dfd3 in vtd_process_inv_desc (s=0x555557565210) at
> /home/qemu/hw/i386/intel_iommu.c:2499
> #13 0x000055555594e1e9 in vtd_fetch_inv_desc (s=0x555557565210) at
> /home/qemu/hw/i386/intel_iommu.c:2568
> #14 0x000055555594e330 in vtd_handle_iqt_write (s=0x555557565210) at
> /home/qemu/hw/i386/intel_iommu.c:2595
> #15 0x000055555594ed90 in vtd_mem_write (opaque=0x555557565210,
> addr=136, val=1888, size=4) at /home/qemu/hw/i386/intel_iommu.c:2842
> #16 0x00005555559787b9 in memory_region_write_accessor
>     (mr=0x555557565540, addr=136, value=0x7fffddffa478, size=4,
> shift=0, mask=4294967295, attrs=...) at
> /home/qemu/softmmu/memory.c:483
> #17 0x00005555559789d7 in access_with_adjusted_size
>     (addr=136, value=0x7fffddffa478, size=4, access_size_min=4,
> access_size_max=8, access_fn=
>     0x5555559786da <memory_region_write_accessor>, mr=0x555557565540,
> attrs=...) at /home/qemu/softmmu/memory.c:544
> #18 0x000055555597b8a5 in memory_region_dispatch_write
> (mr=0x555557565540, addr=136, data=1888, op=MO_32, attrs=...)
>     at /home/qemu/softmmu/memory.c:1465
> #19 0x000055555582b1bf in flatview_write_continue
>     (fv=0x7fffc447c470, addr=4275634312, attrs=...,
> ptr=0x7ffff7dfd028, len=4, addr1=136, l=4, mr=0x555557565540) at
> /home/qemu/exec.c:3176
> #20 0x000055555582b304 in flatview_write (fv=0x7fffc447c470,
> addr=4275634312, attrs=..., buf=0x7ffff7dfd028, len=4)
>     at /home/qemu/exec.c:3216
> #21 0x000055555582b659 in address_space_write
>     (as=0x5555567a9380 <address_space_memory>, addr=4275634312,
> attrs=..., buf=0x7ffff7dfd028, len=4) at /home/qemu/exec.c:3307
> #22 0x000055555582b6c6 in address_space_rw
>     (as=0x5555567a9380 <address_space_memory>, addr=4275634312,
> attrs=..., buf=0x7ffff7dfd028, len=4, is_write=true)
>     at /home/qemu/exec.c:3317
> #23 0x000055555588e3b8 in kvm_cpu_exec (cpu=0x555556bfe9f0) at
> /home/qemu/accel/kvm/kvm-all.c:2518
> #24 0x0000555555972bcf in qemu_kvm_cpu_thread_fn (arg=0x555556bfe9f0)
> at /home/qemu/softmmu/cpus.c:1188
> #25 0x0000555555e08fbd in qemu_thread_start (args=0x555556c24c60) at
> util/qemu-thread-posix.c:521
> #26 0x00007ffff55a714a in start_thread () at /lib64/libpthread.so.0
> #27 0x00007ffff52d8f23 in clone () at /lib64/libc.so.6
> 
> with entry = {target_as = 0x5555567a9380, iova = 0xfff0b000,
> translated_addr = 0x0, addr_mask = 0xfff, perm = 0x0}
> 
> Here we got 3 levels of vtd_page_walk (frames #4-#6). The #6 parameters are:
> 
> (addr=10529021952, start=0, end=549755813888, level=3, read=true, write=true,
>     info=0x7fffddffa140)
> 
> If I understand correctly, the while (iova < end) {} loop in
> vtd_page_walk will break the big range in small pages (4K because of
> level=1, and (end - iova) / subpage_size = 245 pages or iterations).
> That could be a lot of write(2) in vhost_kernel_send_device_iotlb_msg
> in the worst case, or a lot of useless returns in
> memory_region_notify_one_iommu because of (notifier->start > entry_end
> || notifier->end < entry->iova) in the best.
> 
> Am I right with this? I understand that others notifiers (you mention
> vfio) need the granularity, but would a check in some vtd_* function
> for the help with the performance? (not suggesting to introduce it in
> this patch series).

Yeah, I think you're right we need to touch vtd_page_walk(), since
vtd_page_walk() should only notify MAP/UNMAP events, but not DEV_IOTLB.
However we don't need to touch more for the vtd_page_walk() internal logic, so
that the page granularities will be the same as before.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v3 1/1] memory: Skip bad range assertion if notifier supports arbitrary masks
  2020-08-12  8:49       ` Eugenio Perez Martin
@ 2020-08-18 14:24         ` Eugenio Perez Martin
  2020-08-19  7:15           ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Eugenio Perez Martin @ 2020-08-18 14:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Michael S. Tsirkin,
	qemu-devel, Peter Xu, Eric Auger, Avi Kivity, Paolo Bonzini

On Wed, Aug 12, 2020 at 10:49 AM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, Aug 12, 2020 at 4:24 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > On 2020/8/12 上午1:55, Eugenio Pérez wrote:
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > ---
> > >   hw/virtio/vhost.c     |  2 +-
> > >   include/exec/memory.h |  2 ++
> > >   softmmu/memory.c      | 10 ++++++++--
> > >   3 files changed, 11 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> > > index 1a1384e7a6..e74ad9e09b 100644
> > > --- a/hw/virtio/vhost.c
> > > +++ b/hw/virtio/vhost.c
> > > @@ -729,7 +729,7 @@ static void vhost_iommu_region_add(MemoryListener *listener,
> > >       iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
> > >                                                      MEMTXATTRS_UNSPECIFIED);
> > >       iommu_notifier_init(&iommu->n, vhost_iommu_unmap_notify,
> > > -                        IOMMU_NOTIFIER_UNMAP,
> > > +                        IOMMU_NOTIFIER_UNMAP | IOMMU_NOTIFIER_IOTLB,
> >
> >
> > I think we can safely drop IOMMU_NOTIFIER_UNMAP here since device IOTLB
> > is sufficient.
> >
> > Btw, IOMMU_NOTIFIER_IOTLB is kind of confusing, maybe something like
> > IOMMU_NOTIFIER_DEVIOTLB.
> >
>
> Got it, will change.
>
> >
> > >                           section->offset_within_region,
> > >                           int128_get64(end),
> > >                           iommu_idx);
> > > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > > index 307e527835..4d94c1e984 100644
> > > --- a/include/exec/memory.h
> > > +++ b/include/exec/memory.h
> > > @@ -87,6 +87,8 @@ typedef enum {
> > >       IOMMU_NOTIFIER_UNMAP = 0x1,
> > >       /* Notify entry changes (newly created entries) */
> > >       IOMMU_NOTIFIER_MAP = 0x2,
> > > +    /* Notify changes on IOTLB entries */
> > > +    IOMMU_NOTIFIER_IOTLB = 0x04,
> > >   } IOMMUNotifierFlag;
> > >
> > >   #define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
> > > diff --git a/softmmu/memory.c b/softmmu/memory.c
> > > index af25987518..e2c5f6d0e7 100644
> > > --- a/softmmu/memory.c
> > > +++ b/softmmu/memory.c
> > > @@ -1895,6 +1895,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> >
> >
> > (we probably need a better name of this function, at least something
> > like "memory_region_iommu_notify_one").
> >
>
> Ok will change.
>
> >
> > >   {
> > >       IOMMUNotifierFlag request_flags;
> > >       hwaddr entry_end = entry->iova + entry->addr_mask;
> > > +    IOMMUTLBEntry tmp = *entry;
> > >
> > >       /*
> > >        * Skip the notification if the notification does not overlap
> > > @@ -1904,7 +1905,12 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> > >           return;
> > >       }
> > >
> > > -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> > > +    if (notifier->notifier_flags & IOMMU_NOTIFIER_IOTLB) {
> > > +        tmp.iova = MAX(tmp.iova, notifier->start);
> > > +        tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
> >
> >
> > Any reason for doing such re-calculation here, a comment would be helpful.
> >
>
> It was proposed by Peter, but I understand as limiting the
> address+range we pass to the notifier. Although vhost seems to support
> it as long as it contains (notifier->start, notifier->end) in range, a
> future notifier might not.
>
> It could be done as iommu_entry_crop(IOMMUTLBEntry *entry, const
> IOMMUNotifier *notifier) though.
>
> >
> > > +    } else {
> > > +        assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> >
> >
> > I wonder if it's better to convert the assert so some kind of log or
> > warn here.
> >
>
> I think that if we transform that assert to a log, we should also tell
> the guest that something went wrong. Or would it be enough notifying
> the bad range?
>
> If a malicious guest cannot reach that point, I think that leaving it
> as an assertion allows us to detect earlier the fail in my opinion
> (Assert early and assert often).
>
> Thanks!
>

Hi Jason.

I just sent v4, please let me know if the changes are ok for you or
you think it should be done otherwise.

Is it ok for you to just let the assert, or you think we should still
change to a conditional?

Thanks!

> > Thanks
> >
> >
> > > +    }
> > >
> > >       if (entry->perm & IOMMU_RW) {
> > >           request_flags = IOMMU_NOTIFIER_MAP;
> > > @@ -1913,7 +1919,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> > >       }
> > >
> > >       if (notifier->notifier_flags & request_flags) {
> > > -        notifier->notify(notifier, entry);
> > > +        notifier->notify(notifier, &tmp);
> > >       }
> > >   }
> > >
> >



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v3 1/1] memory: Skip bad range assertion if notifier supports arbitrary masks
  2020-08-18 14:24         ` Eugenio Perez Martin
@ 2020-08-19  7:15           ` Jason Wang
  2020-08-19  8:22             ` Eugenio Perez Martin
  2020-08-19 15:50             ` Peter Xu
  0 siblings, 2 replies; 68+ messages in thread
From: Jason Wang @ 2020-08-19  7:15 UTC (permalink / raw)
  To: Eugenio Perez Martin, Peter Xu
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Michael S. Tsirkin,
	qemu-devel, Eric Auger, Avi Kivity, Paolo Bonzini


On 2020/8/18 下午10:24, Eugenio Perez Martin wrote:
> On Wed, Aug 12, 2020 at 10:49 AM Eugenio Perez Martin
> <eperezma@redhat.com>  wrote:
>> On Wed, Aug 12, 2020 at 4:24 AM Jason Wang<jasowang@redhat.com>  wrote:
>>> On 2020/8/12 上午1:55, Eugenio Pérez wrote:
>>>> Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
>>>> ---
>>>>    hw/virtio/vhost.c     |  2 +-
>>>>    include/exec/memory.h |  2 ++
>>>>    softmmu/memory.c      | 10 ++++++++--
>>>>    3 files changed, 11 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>>> index 1a1384e7a6..e74ad9e09b 100644
>>>> --- a/hw/virtio/vhost.c
>>>> +++ b/hw/virtio/vhost.c
>>>> @@ -729,7 +729,7 @@ static void vhost_iommu_region_add(MemoryListener *listener,
>>>>        iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
>>>>                                                       MEMTXATTRS_UNSPECIFIED);
>>>>        iommu_notifier_init(&iommu->n, vhost_iommu_unmap_notify,
>>>> -                        IOMMU_NOTIFIER_UNMAP,
>>>> +                        IOMMU_NOTIFIER_UNMAP | IOMMU_NOTIFIER_IOTLB,
>>> I think we can safely drop IOMMU_NOTIFIER_UNMAP here since device IOTLB
>>> is sufficient.
>>>
>>> Btw, IOMMU_NOTIFIER_IOTLB is kind of confusing, maybe something like
>>> IOMMU_NOTIFIER_DEVIOTLB.
>>>
>> Got it, will change.
>>
>>>>                            section->offset_within_region,
>>>>                            int128_get64(end),
>>>>                            iommu_idx);
>>>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>>>> index 307e527835..4d94c1e984 100644
>>>> --- a/include/exec/memory.h
>>>> +++ b/include/exec/memory.h
>>>> @@ -87,6 +87,8 @@ typedef enum {
>>>>        IOMMU_NOTIFIER_UNMAP = 0x1,
>>>>        /* Notify entry changes (newly created entries) */
>>>>        IOMMU_NOTIFIER_MAP = 0x2,
>>>> +    /* Notify changes on IOTLB entries */
>>>> +    IOMMU_NOTIFIER_IOTLB = 0x04,
>>>>    } IOMMUNotifierFlag;
>>>>
>>>>    #define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
>>>> diff --git a/softmmu/memory.c b/softmmu/memory.c
>>>> index af25987518..e2c5f6d0e7 100644
>>>> --- a/softmmu/memory.c
>>>> +++ b/softmmu/memory.c
>>>> @@ -1895,6 +1895,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>>> (we probably need a better name of this function, at least something
>>> like "memory_region_iommu_notify_one").
>>>
>> Ok will change.
>>
>>>>    {
>>>>        IOMMUNotifierFlag request_flags;
>>>>        hwaddr entry_end = entry->iova + entry->addr_mask;
>>>> +    IOMMUTLBEntry tmp = *entry;
>>>>
>>>>        /*
>>>>         * Skip the notification if the notification does not overlap
>>>> @@ -1904,7 +1905,12 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>>>>            return;
>>>>        }
>>>>
>>>> -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
>>>> +    if (notifier->notifier_flags & IOMMU_NOTIFIER_IOTLB) {
>>>> +        tmp.iova = MAX(tmp.iova, notifier->start);
>>>> +        tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
>>> Any reason for doing such re-calculation here, a comment would be helpful.
>>>
>> It was proposed by Peter, but I understand as limiting the
>> address+range we pass to the notifier. Although vhost seems to support
>> it as long as it contains (notifier->start, notifier->end) in range, a
>> future notifier might not.


Yes, actually, I feel confused after reading the codes. Is 
notifier->start IOVA or GPA?

In vfio.c, we did:

         iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
                             IOMMU_NOTIFIER_ALL,
                             section->offset_within_region,
                             int128_get64(llend),
                             iommu_idx);

So it looks to me the start and end are GPA, but the assertion above 
check it against IOVA which seems to be wrong ....

Thanks


>>
>> It could be done as iommu_entry_crop(IOMMUTLBEntry *entry, const
>> IOMMUNotifier *notifier) though.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v3 1/1] memory: Skip bad range assertion if notifier supports arbitrary masks
  2020-08-19  7:15           ` Jason Wang
@ 2020-08-19  8:22             ` Eugenio Perez Martin
  2020-08-19  9:36               ` Jason Wang
  2020-08-19 15:50             ` Peter Xu
  1 sibling, 1 reply; 68+ messages in thread
From: Eugenio Perez Martin @ 2020-08-19  8:22 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Michael S. Tsirkin,
	qemu-devel, Peter Xu, Eric Auger, Avi Kivity, Paolo Bonzini

On Wed, Aug 19, 2020 at 9:15 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2020/8/18 下午10:24, Eugenio Perez Martin wrote:
> > On Wed, Aug 12, 2020 at 10:49 AM Eugenio Perez Martin
> > <eperezma@redhat.com>  wrote:
> >> On Wed, Aug 12, 2020 at 4:24 AM Jason Wang<jasowang@redhat.com>  wrote:
> >>> On 2020/8/12 上午1:55, Eugenio Pérez wrote:
> >>>> Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
> >>>> ---
> >>>>    hw/virtio/vhost.c     |  2 +-
> >>>>    include/exec/memory.h |  2 ++
> >>>>    softmmu/memory.c      | 10 ++++++++--
> >>>>    3 files changed, 11 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >>>> index 1a1384e7a6..e74ad9e09b 100644
> >>>> --- a/hw/virtio/vhost.c
> >>>> +++ b/hw/virtio/vhost.c
> >>>> @@ -729,7 +729,7 @@ static void vhost_iommu_region_add(MemoryListener *listener,
> >>>>        iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
> >>>>                                                       MEMTXATTRS_UNSPECIFIED);
> >>>>        iommu_notifier_init(&iommu->n, vhost_iommu_unmap_notify,
> >>>> -                        IOMMU_NOTIFIER_UNMAP,
> >>>> +                        IOMMU_NOTIFIER_UNMAP | IOMMU_NOTIFIER_IOTLB,
> >>> I think we can safely drop IOMMU_NOTIFIER_UNMAP here since device IOTLB
> >>> is sufficient.
> >>>
> >>> Btw, IOMMU_NOTIFIER_IOTLB is kind of confusing, maybe something like
> >>> IOMMU_NOTIFIER_DEVIOTLB.
> >>>
> >> Got it, will change.
> >>
> >>>>                            section->offset_within_region,
> >>>>                            int128_get64(end),
> >>>>                            iommu_idx);
> >>>> diff --git a/include/exec/memory.h b/include/exec/memory.h
> >>>> index 307e527835..4d94c1e984 100644
> >>>> --- a/include/exec/memory.h
> >>>> +++ b/include/exec/memory.h
> >>>> @@ -87,6 +87,8 @@ typedef enum {
> >>>>        IOMMU_NOTIFIER_UNMAP = 0x1,
> >>>>        /* Notify entry changes (newly created entries) */
> >>>>        IOMMU_NOTIFIER_MAP = 0x2,
> >>>> +    /* Notify changes on IOTLB entries */
> >>>> +    IOMMU_NOTIFIER_IOTLB = 0x04,
> >>>>    } IOMMUNotifierFlag;
> >>>>
> >>>>    #define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
> >>>> diff --git a/softmmu/memory.c b/softmmu/memory.c
> >>>> index af25987518..e2c5f6d0e7 100644
> >>>> --- a/softmmu/memory.c
> >>>> +++ b/softmmu/memory.c
> >>>> @@ -1895,6 +1895,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> >>> (we probably need a better name of this function, at least something
> >>> like "memory_region_iommu_notify_one").
> >>>
> >> Ok will change.
> >>
> >>>>    {
> >>>>        IOMMUNotifierFlag request_flags;
> >>>>        hwaddr entry_end = entry->iova + entry->addr_mask;
> >>>> +    IOMMUTLBEntry tmp = *entry;
> >>>>
> >>>>        /*
> >>>>         * Skip the notification if the notification does not overlap
> >>>> @@ -1904,7 +1905,12 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> >>>>            return;
> >>>>        }
> >>>>
> >>>> -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> >>>> +    if (notifier->notifier_flags & IOMMU_NOTIFIER_IOTLB) {
> >>>> +        tmp.iova = MAX(tmp.iova, notifier->start);
> >>>> +        tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
> >>> Any reason for doing such re-calculation here, a comment would be helpful.
> >>>
> >> It was proposed by Peter, but I understand as limiting the
> >> address+range we pass to the notifier. Although vhost seems to support
> >> it as long as it contains (notifier->start, notifier->end) in range, a
> >> future notifier might not.
>
>
> Yes, actually, I feel confused after reading the codes. Is
> notifier->start IOVA or GPA?
>
> In vfio.c, we did:
>
>          iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
>                              IOMMU_NOTIFIER_ALL,
>                              section->offset_within_region,
>                              int128_get64(llend),
>                              iommu_idx);
>
> So it looks to me the start and end are GPA, but the assertion above
> check it against IOVA which seems to be wrong ....
>
> Thanks
>

I see.

I didn't go so deep, I just assumed that:
* all the addresses were GPA in the vhost-net+virtio-net case,
although the name iova in IOMMUTLBEntry.
* memory region was initialized with IOVA addresses in case of VFIO.

Maybe the comment should warn about the bad "iova" name, if I'm right?

I assumed that nothing changed in the VFIO case since its notifier has
no IOMMU_NOTIFIER_DEVIOTLB flag and the new conditional in
memory_region_notify_one_iommu, but I will test with a device
passthrough and DPDK again. Do you think another test would be needed?

Maybe Peter can go deeper on this.

Thanks!

>
> >>
> >> It could be done as iommu_entry_crop(IOMMUTLBEntry *entry, const
> >> IOMMUNotifier *notifier) though.
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v3 1/1] memory: Skip bad range assertion if notifier supports arbitrary masks
  2020-08-19  8:22             ` Eugenio Perez Martin
@ 2020-08-19  9:36               ` Jason Wang
  0 siblings, 0 replies; 68+ messages in thread
From: Jason Wang @ 2020-08-19  9:36 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Michael S. Tsirkin,
	qemu-devel, Peter Xu, Eric Auger, Avi Kivity, Paolo Bonzini


On 2020/8/19 下午4:22, Eugenio Perez Martin wrote:
> On Wed, Aug 19, 2020 at 9:15 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2020/8/18 下午10:24, Eugenio Perez Martin wrote:
>>> On Wed, Aug 12, 2020 at 10:49 AM Eugenio Perez Martin
>>> <eperezma@redhat.com>  wrote:
>>>> On Wed, Aug 12, 2020 at 4:24 AM Jason Wang<jasowang@redhat.com>  wrote:
>>>>> On 2020/8/12 上午1:55, Eugenio Pérez wrote:
>>>>>> Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
>>>>>> ---
>>>>>>     hw/virtio/vhost.c     |  2 +-
>>>>>>     include/exec/memory.h |  2 ++
>>>>>>     softmmu/memory.c      | 10 ++++++++--
>>>>>>     3 files changed, 11 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>>>>> index 1a1384e7a6..e74ad9e09b 100644
>>>>>> --- a/hw/virtio/vhost.c
>>>>>> +++ b/hw/virtio/vhost.c
>>>>>> @@ -729,7 +729,7 @@ static void vhost_iommu_region_add(MemoryListener *listener,
>>>>>>         iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
>>>>>>                                                        MEMTXATTRS_UNSPECIFIED);
>>>>>>         iommu_notifier_init(&iommu->n, vhost_iommu_unmap_notify,
>>>>>> -                        IOMMU_NOTIFIER_UNMAP,
>>>>>> +                        IOMMU_NOTIFIER_UNMAP | IOMMU_NOTIFIER_IOTLB,
>>>>> I think we can safely drop IOMMU_NOTIFIER_UNMAP here since device IOTLB
>>>>> is sufficient.
>>>>>
>>>>> Btw, IOMMU_NOTIFIER_IOTLB is kind of confusing, maybe something like
>>>>> IOMMU_NOTIFIER_DEVIOTLB.
>>>>>
>>>> Got it, will change.
>>>>
>>>>>>                             section->offset_within_region,
>>>>>>                             int128_get64(end),
>>>>>>                             iommu_idx);
>>>>>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>>>>>> index 307e527835..4d94c1e984 100644
>>>>>> --- a/include/exec/memory.h
>>>>>> +++ b/include/exec/memory.h
>>>>>> @@ -87,6 +87,8 @@ typedef enum {
>>>>>>         IOMMU_NOTIFIER_UNMAP = 0x1,
>>>>>>         /* Notify entry changes (newly created entries) */
>>>>>>         IOMMU_NOTIFIER_MAP = 0x2,
>>>>>> +    /* Notify changes on IOTLB entries */
>>>>>> +    IOMMU_NOTIFIER_IOTLB = 0x04,
>>>>>>     } IOMMUNotifierFlag;
>>>>>>
>>>>>>     #define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
>>>>>> diff --git a/softmmu/memory.c b/softmmu/memory.c
>>>>>> index af25987518..e2c5f6d0e7 100644
>>>>>> --- a/softmmu/memory.c
>>>>>> +++ b/softmmu/memory.c
>>>>>> @@ -1895,6 +1895,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>>>>> (we probably need a better name of this function, at least something
>>>>> like "memory_region_iommu_notify_one").
>>>>>
>>>> Ok will change.
>>>>
>>>>>>     {
>>>>>>         IOMMUNotifierFlag request_flags;
>>>>>>         hwaddr entry_end = entry->iova + entry->addr_mask;
>>>>>> +    IOMMUTLBEntry tmp = *entry;
>>>>>>
>>>>>>         /*
>>>>>>          * Skip the notification if the notification does not overlap
>>>>>> @@ -1904,7 +1905,12 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>>>>>>             return;
>>>>>>         }
>>>>>>
>>>>>> -    assert(entry->iova >= notifier->start && entry_end <= notifier->end);
>>>>>> +    if (notifier->notifier_flags & IOMMU_NOTIFIER_IOTLB) {
>>>>>> +        tmp.iova = MAX(tmp.iova, notifier->start);
>>>>>> +        tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
>>>>> Any reason for doing such re-calculation here, a comment would be helpful.
>>>>>
>>>> It was proposed by Peter, but I understand as limiting the
>>>> address+range we pass to the notifier. Although vhost seems to support
>>>> it as long as it contains (notifier->start, notifier->end) in range, a
>>>> future notifier might not.
>>
>> Yes, actually, I feel confused after reading the codes. Is
>> notifier->start IOVA or GPA?
>>
>> In vfio.c, we did:
>>
>>           iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
>>                               IOMMU_NOTIFIER_ALL,
>>                               section->offset_within_region,
>>                               int128_get64(llend),
>>                               iommu_idx);
>>
>> So it looks to me the start and end are GPA, but the assertion above
>> check it against IOVA which seems to be wrong ....
>>
>> Thanks
>>
> I see.
>
> I didn't go so deep, I just assumed that:
> * all the addresses were GPA in the vhost-net+virtio-net case,
> although the name iova in IOMMUTLBEntry.
> * memory region was initialized with IOVA addresses in case of VFIO.


If there's no vIOMMU, IOVA = GPA, so we're fine. But if vIOMMU is 
enabled, IOVA allocation is done inside guest so the start/end here not 
IOVA anymore.


>
> Maybe the comment should warn about the bad "iova" name, if I'm right?
>
> I assumed that nothing changed in the VFIO case since its notifier has
> no IOMMU_NOTIFIER_DEVIOTLB flag and the new conditional in
> memory_region_notify_one_iommu, but I will test with a device
> passthrough and DPDK again. Do you think another test would be needed?


I'm not sure if it's easy, but it might be interesting to teach DPDK to 
use IOVA which is outside the range of [start, end] here.


>
> Maybe Peter can go deeper on this.


Yes, or wait for Peter's comment.

Thanks


>
> Thanks!
>
>>>> It could be done as iommu_entry_crop(IOMMUTLBEntry *entry, const
>>>> IOMMUNotifier *notifier) though.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v3 1/1] memory: Skip bad range assertion if notifier supports arbitrary masks
  2020-08-19  7:15           ` Jason Wang
  2020-08-19  8:22             ` Eugenio Perez Martin
@ 2020-08-19 15:50             ` Peter Xu
  2020-08-20  2:28               ` Jason Wang
  1 sibling, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-08-19 15:50 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, Juan Quintela,
	qemu-devel, Eugenio Perez Martin, Avi Kivity, Eric Auger,
	Paolo Bonzini

On Wed, Aug 19, 2020 at 03:15:26PM +0800, Jason Wang wrote:
> Yes, actually, I feel confused after reading the codes. Is notifier->start
> IOVA or GPA?
> 
> In vfio.c, we did:
> 
>         iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
>                             IOMMU_NOTIFIER_ALL,
>                             section->offset_within_region,
>                             int128_get64(llend),
>                             iommu_idx);
> 
> So it looks to me the start and end are GPA, but the assertion above check
> it against IOVA which seems to be wrong ....

It should be iova; both section->offset_within_region and llend are for the
device's iova address space.  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v3 1/1] memory: Skip bad range assertion if notifier supports arbitrary masks
  2020-08-19 15:50             ` Peter Xu
@ 2020-08-20  2:28               ` Jason Wang
  2020-08-21 14:12                 ` Peter Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-08-20  2:28 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, Juan Quintela,
	qemu-devel, Eugenio Perez Martin, Avi Kivity, Eric Auger,
	Paolo Bonzini


On 2020/8/19 下午11:50, Peter Xu wrote:
> On Wed, Aug 19, 2020 at 03:15:26PM +0800, Jason Wang wrote:
>> Yes, actually, I feel confused after reading the codes. Is notifier->start
>> IOVA or GPA?
>>
>> In vfio.c, we did:
>>
>>          iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
>>                              IOMMU_NOTIFIER_ALL,
>>                              section->offset_within_region,
>>                              int128_get64(llend),
>>                              iommu_idx);
>>
>> So it looks to me the start and end are GPA, but the assertion above check
>> it against IOVA which seems to be wrong ....
> It should be iova; both section->offset_within_region and llend are for the
> device's iova address space.  Thanks,
>

Interesting, how can memory region know which IOVA is used by guest?

Thanks




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v3 1/1] memory: Skip bad range assertion if notifier supports arbitrary masks
  2020-08-20  2:28               ` Jason Wang
@ 2020-08-21 14:12                 ` Peter Xu
  2020-09-01  3:05                   ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-08-21 14:12 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, Juan Quintela,
	qemu-devel, Eugenio Perez Martin, Avi Kivity, Eric Auger,
	Paolo Bonzini

On Thu, Aug 20, 2020 at 10:28:00AM +0800, Jason Wang wrote:
> 
> On 2020/8/19 下午11:50, Peter Xu wrote:
> > On Wed, Aug 19, 2020 at 03:15:26PM +0800, Jason Wang wrote:
> > > Yes, actually, I feel confused after reading the codes. Is notifier->start
> > > IOVA or GPA?
> > > 
> > > In vfio.c, we did:
> > > 
> > >          iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
> > >                              IOMMU_NOTIFIER_ALL,
> > >                              section->offset_within_region,
> > >                              int128_get64(llend),
> > >                              iommu_idx);
> > > 
> > > So it looks to me the start and end are GPA, but the assertion above check
> > > it against IOVA which seems to be wrong ....
> > It should be iova; both section->offset_within_region and llend are for the
> > device's iova address space.  Thanks,
> > 
> 
> Interesting, how can memory region know which IOVA is used by guest?

Does it need to know? :)

AFAICT what we do here is only register with the whole possible IOVA address
space (e.g., across the whole 64bit address space).  Then vfio will get
notifications when there're new iova ranges mapped into it.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v3 1/1] memory: Skip bad range assertion if notifier supports arbitrary masks
  2020-08-21 14:12                 ` Peter Xu
@ 2020-09-01  3:05                   ` Jason Wang
  2020-09-01 19:35                     ` Peter Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Jason Wang @ 2020-09-01  3:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, Juan Quintela,
	qemu-devel, Eugenio Perez Martin, Avi Kivity, Eric Auger,
	Paolo Bonzini


On 2020/8/21 下午10:12, Peter Xu wrote:
> On Thu, Aug 20, 2020 at 10:28:00AM +0800, Jason Wang wrote:
>> On 2020/8/19 下午11:50, Peter Xu wrote:
>>> On Wed, Aug 19, 2020 at 03:15:26PM +0800, Jason Wang wrote:
>>>> Yes, actually, I feel confused after reading the codes. Is notifier->start
>>>> IOVA or GPA?
>>>>
>>>> In vfio.c, we did:
>>>>
>>>>           iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
>>>>                               IOMMU_NOTIFIER_ALL,
>>>>                               section->offset_within_region,
>>>>                               int128_get64(llend),
>>>>                               iommu_idx);
>>>>
>>>> So it looks to me the start and end are GPA, but the assertion above check
>>>> it against IOVA which seems to be wrong ....
>>> It should be iova; both section->offset_within_region and llend are for the
>>> device's iova address space.  Thanks,
>>>
>> Interesting, how can memory region know which IOVA is used by guest?
> Does it need to know? :)
>
> AFAICT what we do here is only register with the whole possible IOVA address
> space (e.g., across the whole 64bit address space).  Then vfio will get
> notifications when there're new iova ranges mapped into it.


Right, but the whole IOVA address space should be something vIOMMU 
specific, e.g for Intel it should be calculated by GAW, but I found:

         memory_region_init_iommu(&vtd_dev_as->iommu, 
sizeof(vtd_dev_as->iommu),
                                  TYPE_INTEL_IOMMU_MEMORY_REGION, OBJECT(s),
                                  name, UINT64_MAX);

which assumes UINT64_MAX.

Thanks



>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v3 1/1] memory: Skip bad range assertion if notifier supports arbitrary masks
  2020-09-01  3:05                   ` Jason Wang
@ 2020-09-01 19:35                     ` Peter Xu
  2020-09-02  5:13                       ` Jason Wang
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Xu @ 2020-09-01 19:35 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Maydell, Yan Zhao, Michael S. Tsirkin, Juan Quintela,
	qemu-devel, Eugenio Perez Martin, Avi Kivity, Eric Auger,
	Paolo Bonzini

On Tue, Sep 01, 2020 at 11:05:18AM +0800, Jason Wang wrote:
> 
> On 2020/8/21 下午10:12, Peter Xu wrote:
> > On Thu, Aug 20, 2020 at 10:28:00AM +0800, Jason Wang wrote:
> > > On 2020/8/19 下午11:50, Peter Xu wrote:
> > > > On Wed, Aug 19, 2020 at 03:15:26PM +0800, Jason Wang wrote:
> > > > > Yes, actually, I feel confused after reading the codes. Is notifier->start
> > > > > IOVA or GPA?
> > > > > 
> > > > > In vfio.c, we did:
> > > > > 
> > > > >           iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
> > > > >                               IOMMU_NOTIFIER_ALL,
> > > > >                               section->offset_within_region,
> > > > >                               int128_get64(llend),
> > > > >                               iommu_idx);
> > > > > 
> > > > > So it looks to me the start and end are GPA, but the assertion above check
> > > > > it against IOVA which seems to be wrong ....
> > > > It should be iova; both section->offset_within_region and llend are for the
> > > > device's iova address space.  Thanks,
> > > > 
> > > Interesting, how can memory region know which IOVA is used by guest?
> > Does it need to know? :)
> > 
> > AFAICT what we do here is only register with the whole possible IOVA address
> > space (e.g., across the whole 64bit address space).  Then vfio will get
> > notifications when there're new iova ranges mapped into it.
> 
> 
> Right, but the whole IOVA address space should be something vIOMMU specific,
> e.g for Intel it should be calculated by GAW, but I found:
> 
>         memory_region_init_iommu(&vtd_dev_as->iommu,
> sizeof(vtd_dev_as->iommu),
>                                  TYPE_INTEL_IOMMU_MEMORY_REGION, OBJECT(s),
>                                  name, UINT64_MAX);
> 
> which assumes UINT64_MAX.

Right.  AFAICT it can be reduced to gaw width, but I don't see a problem either
even with UINT64_MAX (as long as it covers the range specified by gaw).  Or did
I miss something?  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC v3 1/1] memory: Skip bad range assertion if notifier supports arbitrary masks
  2020-09-01 19:35                     ` Peter Xu
@ 2020-09-02  5:13                       ` Jason Wang
  0 siblings, 0 replies; 68+ messages in thread
From: Jason Wang @ 2020-09-02  5:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Yan Zhao, Juan Quintela, Michael S. Tsirkin,
	qemu-devel, Eugenio Perez Martin, Avi Kivity, Eric Auger,
	Paolo Bonzini


On 2020/9/2 上午3:35, Peter Xu wrote:
> On Tue, Sep 01, 2020 at 11:05:18AM +0800, Jason Wang wrote:
>> On 2020/8/21 下午10:12, Peter Xu wrote:
>>> On Thu, Aug 20, 2020 at 10:28:00AM +0800, Jason Wang wrote:
>>>> On 2020/8/19 下午11:50, Peter Xu wrote:
>>>>> On Wed, Aug 19, 2020 at 03:15:26PM +0800, Jason Wang wrote:
>>>>>> Yes, actually, I feel confused after reading the codes. Is notifier->start
>>>>>> IOVA or GPA?
>>>>>>
>>>>>> In vfio.c, we did:
>>>>>>
>>>>>>            iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
>>>>>>                                IOMMU_NOTIFIER_ALL,
>>>>>>                                section->offset_within_region,
>>>>>>                                int128_get64(llend),
>>>>>>                                iommu_idx);
>>>>>>
>>>>>> So it looks to me the start and end are GPA, but the assertion above check
>>>>>> it against IOVA which seems to be wrong ....
>>>>> It should be iova; both section->offset_within_region and llend are for the
>>>>> device's iova address space.  Thanks,
>>>>>
>>>> Interesting, how can memory region know which IOVA is used by guest?
>>> Does it need to know? :)
>>>
>>> AFAICT what we do here is only register with the whole possible IOVA address
>>> space (e.g., across the whole 64bit address space).  Then vfio will get
>>> notifications when there're new iova ranges mapped into it.
>>
>> Right, but the whole IOVA address space should be something vIOMMU specific,
>> e.g for Intel it should be calculated by GAW, but I found:
>>
>>          memory_region_init_iommu(&vtd_dev_as->iommu,
>> sizeof(vtd_dev_as->iommu),
>>                                   TYPE_INTEL_IOMMU_MEMORY_REGION, OBJECT(s),
>>                                   name, UINT64_MAX);
>>
>> which assumes UINT64_MAX.
> Right.  AFAICT it can be reduced to gaw width, but I don't see a problem either
> even with UINT64_MAX (as long as it covers the range specified by gaw).  Or did
> I miss something?


Dunno :)

Just notice this difference, for safety, maybe its better to cap it with 
GAW.

Btw, the naming of "vtd-ir" is kind of confusing, it should work without ir.

Thanks


> Thanks,
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2020-09-02  5:14 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-26  6:41 [RFC v2 0/1] memory: Delete assertion in memory_region_unregister_iommu_notifier Eugenio Pérez
2020-06-26  6:41 ` [RFC v2 1/1] " Eugenio Pérez
2020-06-26 21:29   ` Peter Xu
2020-06-27  7:26     ` Yan Zhao
2020-06-27 12:57       ` Peter Xu
2020-06-28  1:36         ` Yan Zhao
2020-06-28  7:03     ` Jason Wang
2020-06-28 14:47       ` Peter Xu
2020-06-29  5:51         ` Jason Wang
2020-06-29 13:34           ` Peter Xu
2020-06-30  2:41             ` Jason Wang
2020-06-30  8:29               ` Jason Wang
2020-06-30  9:21                 ` Michael S. Tsirkin
2020-06-30  9:23                   ` Jason Wang
2020-06-30 15:20                     ` Peter Xu
2020-07-01  8:11                       ` Jason Wang
2020-07-01 12:16                         ` Peter Xu
2020-07-01 12:30                           ` Jason Wang
2020-07-01 12:41                             ` Peter Xu
2020-07-02  3:00                               ` Jason Wang
2020-06-30 15:39               ` Peter Xu
2020-07-01  8:09                 ` Jason Wang
2020-07-02  3:01                   ` Jason Wang
2020-07-02 15:45                     ` Peter Xu
2020-07-03  7:24                       ` Jason Wang
2020-07-03 13:03                         ` Peter Xu
2020-07-07  8:03                           ` Jason Wang
2020-07-07 19:54                             ` Peter Xu
2020-07-08  5:42                               ` Jason Wang
2020-07-08 14:16                                 ` Peter Xu
2020-07-09  5:58                                   ` Jason Wang
2020-07-09 14:10                                     ` Peter Xu
2020-07-10  6:34                                       ` Jason Wang
2020-07-10 13:30                                         ` Peter Xu
2020-07-13  4:04                                           ` Jason Wang
2020-07-16  1:00                                             ` Peter Xu
2020-07-16  2:54                                               ` Jason Wang
2020-07-17 14:18                                                 ` Peter Xu
2020-07-20  4:02                                                   ` Jason Wang
2020-07-20 13:03                                                     ` Peter Xu
2020-07-21  6:20                                                       ` Jason Wang
2020-07-21 15:10                                                         ` Peter Xu
2020-08-03 16:00                         ` Eugenio Pérez
2020-08-04 20:30                           ` Peter Xu
2020-08-05  5:45                             ` Jason Wang
2020-08-11 17:01     ` Eugenio Perez Martin
2020-08-11 17:10       ` Eugenio Perez Martin
2020-06-29 15:05 ` [RFC v2 0/1] " Paolo Bonzini
2020-07-03  7:39   ` Eugenio Perez Martin
2020-07-03 10:10     ` Paolo Bonzini
2020-08-11 17:55 ` [RFC v3 " Eugenio Pérez
2020-08-11 17:55   ` [RFC v3 1/1] memory: Skip bad range assertion if notifier supports arbitrary masks Eugenio Pérez
2020-08-12  2:24     ` Jason Wang
2020-08-12  8:49       ` Eugenio Perez Martin
2020-08-18 14:24         ` Eugenio Perez Martin
2020-08-19  7:15           ` Jason Wang
2020-08-19  8:22             ` Eugenio Perez Martin
2020-08-19  9:36               ` Jason Wang
2020-08-19 15:50             ` Peter Xu
2020-08-20  2:28               ` Jason Wang
2020-08-21 14:12                 ` Peter Xu
2020-09-01  3:05                   ` Jason Wang
2020-09-01 19:35                     ` Peter Xu
2020-09-02  5:13                       ` Jason Wang
2020-08-11 18:10   ` [RFC v3 0/1] memory: Delete assertion in memory_region_unregister_iommu_notifier Eugenio Perez Martin
2020-08-11 19:27     ` Peter Xu
2020-08-12 14:33       ` Eugenio Perez Martin
2020-08-12 21:12         ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.