All of lore.kernel.org
 help / color / mirror / Atom feed
* Bug report: VFIO map/unmep mem subject to race and DMA data goes to incorrect page (4.18.0)
@ 2022-03-25 20:06 Daniel F. Smith
  2022-03-25 22:10 ` Alex Williamson
  0 siblings, 1 reply; 5+ messages in thread
From: Daniel F. Smith @ 2022-03-25 20:06 UTC (permalink / raw)
  To: iommu

This email is to document an insidious (incorrect data, no error or warning)
VFIO bug found when using the Intel IOMMU to perform DMA transfers; and the
associated workaround.

There may be security implications (unsure).

/sys/devices/virtual/iommu/dmar0/intel-iommu/version: 1:0
/sys/devices/virtual/iommu/dmar0/intel-iommu/cap: d2008c40660462
Linux xxxxx.ibm.com 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Tue Mar 8 12:56:54 EST 2022 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.5 (Ootpa)

In our testing of VFIO DMA to an FPGA card in rootless mode, we discovered a
glitch where DMA data are transferred to/from the incorrect page.  It
appears timing based.  Under some specific conditions the test could trigger
the bug every loop.  Sometimes the bug would only emerge after 20+ minutes
of testing.

Basics of test:
	Get memory with mmap(anonymous): size can change.
	VFIO_IOMMU_MAP_DMA with a block of memory, fixed IOVA.
	Fill memory with pattern.
	Do DMA transfer to FPGA from memory at IOVA.
	Do DMA transfer from FPGA to memory at IOVA+offset.
	Compare memory to ensure match.  Miscompare is bug.
	VFIO_IOMMU_UNMAP_DMA 
	unmap()
	Repeat.

Using the fixed IOVA address* caused sporadic memory miscompares.  The
nature of the miscompares is that the received data was mixed with pages
that had been returned by mmap in a *previous* loop.

Workaround: Randomizing the IOVA eliminated the memory miscompares.

Hypothesis/conjecture: Possible race condition in UNMAP_DMA such that pages
can be released/munlocked *after* the MAP_DMA with the same IOVA has
occurred.

Suggestion: Document issue when using fixed IOVA, or fix if security is a
concern.

Daniel F. Smith
dfsmith@us.ibm.com

* We cannot use physical page address for the IOVA since we are running
  without root, so /proc/pagemap is blanked out.  We also cannot use the VMA
  as the IOVA since MAP_DMA only permits us up to bit 39 in the IOVA.

VMA = virtual memory address (process space)
IOVA = IOV / IOMMU address
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bug report: VFIO map/unmep mem subject to race and DMA data goes to incorrect page (4.18.0)
  2022-03-25 20:06 Bug report: VFIO map/unmep mem subject to race and DMA data goes to incorrect page (4.18.0) Daniel F. Smith
@ 2022-03-25 22:10 ` Alex Williamson
  2022-03-28  9:01   ` Lu Baolu
  2022-03-28 19:14   ` Daniel F. Smith
  0 siblings, 2 replies; 5+ messages in thread
From: Alex Williamson @ 2022-03-25 22:10 UTC (permalink / raw)
  To: Daniel F. Smith; +Cc: iommu

Hi Daniel,

On Fri, 25 Mar 2022 13:06:40 -0700
"Daniel F. Smith" <dfsmith@us.ibm.com> wrote:

> This email is to document an insidious (incorrect data, no error or warning)
> VFIO bug found when using the Intel IOMMU to perform DMA transfers; and the
> associated workaround.
> 
> There may be security implications (unsure).
> 
> /sys/devices/virtual/iommu/dmar0/intel-iommu/version: 1:0
> /sys/devices/virtual/iommu/dmar0/intel-iommu/cap: d2008c40660462
> Linux xxxxx.ibm.com 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Tue Mar 8 12:56:54 EST 2022 x86_64 x86_64 x86_64 GNU/Linux
> Red Hat Enterprise Linux release 8.5 (Ootpa)
> 
> In our testing of VFIO DMA to an FPGA card in rootless mode, we discovered a
> glitch where DMA data are transferred to/from the incorrect page.  It
> appears timing based.  Under some specific conditions the test could trigger
> the bug every loop.  Sometimes the bug would only emerge after 20+ minutes
> of testing.
> 
> Basics of test:
> 	Get memory with mmap(anonymous): size can change.
> 	VFIO_IOMMU_MAP_DMA with a block of memory, fixed IOVA.
> 	Fill memory with pattern.
> 	Do DMA transfer to FPGA from memory at IOVA.
> 	Do DMA transfer from FPGA to memory at IOVA+offset.
> 	Compare memory to ensure match.  Miscompare is bug.
> 	VFIO_IOMMU_UNMAP_DMA 
> 	unmap()
> 	Repeat.
> 
> Using the fixed IOVA address* caused sporadic memory miscompares.  The
> nature of the miscompares is that the received data was mixed with pages
> that had been returned by mmap in a *previous* loop.
> 
> Workaround: Randomizing the IOVA eliminated the memory miscompares.
> 
> Hypothesis/conjecture: Possible race condition in UNMAP_DMA such that pages
> can be released/munlocked *after* the MAP_DMA with the same IOVA has
> occurred.

Coherency possibly.

There's a possible coherency issue at the compare depending on the
IOMMU capabilities which could affect whether DMA is coherent to memory
or requires an explicit flush.  I'm a little suspicious whether dmar0
is really the IOMMU controlling this device since you mention a 39bit
IOVA space, which is more typical of Intel client platforms which can
also have integrated graphics which often have a dedicated IOMMU at
dmar0 that isn't necessarily representative of the other IOMMUs in the
system, especially with regard to snoop-control.  Each dmar lists the
managed devices under it in sysfs to verify.  Support for snoop-control
would be identified in the ecap register rather than the cap register.
VFIO can also report coherency via the VFIO_DMA_CC_IOMMU extension
reported by VFIO_CHECK_EXTENSION ioctl.

However, CPU coherency might lead to a miscompare, but not necessarily a
miscompare matching the previous iteration.  Still, for completeness
let's make sure this isn't a gap in the test programming making invalid
assumptions about CPU/DMA coherency.

The fact that randomizing the IOVA provides a workaround though might
suggest something relative to the IOMMU page table coherency.  But for
the new mmap target to have the data from the previous iteration, the
IOMMU PTE would need to be stale on read, but correct on write in order
to land back in your new mmap.  That seems peculiar.  Are we sure the
FPGA device isn't caching the value at the IOVA or using any sort of
IOTLB caching such as ATS that might not be working correctly?

> Suggestion: Document issue when using fixed IOVA, or fix if security
> is a concern.

I don't know that there's enough information here to make any
conclusions.  Here are some further questions:

 * What size mappings are being used, both for the mmap and the VFIO
   MAP/UNMAP operations.

 * If the above is venturing into super page support (2MB), does the
   vfio_iommu_type1 module option disable_hugepages=1 affect the
   results.

 * Along the same lines, does the kernel command line option
   intel_iommu=sp_off produce different results.

 * Does this behavior also occur on upstream kernels (ie. v5.17)?

 * Do additional CPU cache flushes in the test program produce different
   results?

 * Is this a consumer available FPGA device that others might be able
   to reproduce this issue?  I've always wanted such a device for
   testing, but also we can't rule out that the FPGA itself or its
   programming is the source of the miscompare.

From the vfio perspective, UNMAP_DMA should first unmap the pages at
the IOMMU to prevent device access before unpinning the pages.  We do
make use of batch unmapping to reduce iotlb flushing, but the result is
expected to be that the IOMMU PTE entries are invalidated before the
UNMAP_DMA operation completes.  A stale IOVA would not be expected or
correct operation.  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bug report: VFIO map/unmep mem subject to race and DMA data goes to incorrect page (4.18.0)
  2022-03-25 22:10 ` Alex Williamson
@ 2022-03-28  9:01   ` Lu Baolu
  2022-03-28 19:14   ` Daniel F. Smith
  1 sibling, 0 replies; 5+ messages in thread
From: Lu Baolu @ 2022-03-28  9:01 UTC (permalink / raw)
  To: Alex Williamson, Daniel F. Smith; +Cc: iommu

Hi Daniel,

On 2022/3/26 6:10, Alex Williamson wrote:
> Hi Daniel,
> 
> On Fri, 25 Mar 2022 13:06:40 -0700
> "Daniel F. Smith" <dfsmith@us.ibm.com> wrote:
> 
>> This email is to document an insidious (incorrect data, no error or warning)
>> VFIO bug found when using the Intel IOMMU to perform DMA transfers; and the
>> associated workaround.
>>
>> There may be security implications (unsure).
>>
>> /sys/devices/virtual/iommu/dmar0/intel-iommu/version: 1:0
>> /sys/devices/virtual/iommu/dmar0/intel-iommu/cap: d2008c40660462
>> Linux xxxxx.ibm.com 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Tue Mar 8 12:56:54 EST 2022 x86_64 x86_64 x86_64 GNU/Linux
>> Red Hat Enterprise Linux release 8.5 (Ootpa)
>>
>> In our testing of VFIO DMA to an FPGA card in rootless mode, we discovered a
>> glitch where DMA data are transferred to/from the incorrect page.  It
>> appears timing based.  Under some specific conditions the test could trigger
>> the bug every loop.  Sometimes the bug would only emerge after 20+ minutes
>> of testing.
>>
>> Basics of test:
>> 	Get memory with mmap(anonymous): size can change.
>> 	VFIO_IOMMU_MAP_DMA with a block of memory, fixed IOVA.
>> 	Fill memory with pattern.
>> 	Do DMA transfer to FPGA from memory at IOVA.
>> 	Do DMA transfer from FPGA to memory at IOVA+offset.
>> 	Compare memory to ensure match.  Miscompare is bug.
>> 	VFIO_IOMMU_UNMAP_DMA
>> 	unmap()
>> 	Repeat.
>>
>> Using the fixed IOVA address* caused sporadic memory miscompares.  The
>> nature of the miscompares is that the received data was mixed with pages
>> that had been returned by mmap in a *previous* loop.
>>
>> Workaround: Randomizing the IOVA eliminated the memory miscompares.
>>
>> Hypothesis/conjecture: Possible race condition in UNMAP_DMA such that pages
>> can be released/munlocked *after* the MAP_DMA with the same IOVA has
>> occurred.
> 
> Coherency possibly.
> 
> There's a possible coherency issue at the compare depending on the
> IOMMU capabilities which could affect whether DMA is coherent to memory
> or requires an explicit flush.  I'm a little suspicious whether dmar0
> is really the IOMMU controlling this device since you mention a 39bit
> IOVA space, which is more typical of Intel client platforms which can
> also have integrated graphics which often have a dedicated IOMMU at
> dmar0 that isn't necessarily representative of the other IOMMUs in the
> system, especially with regard to snoop-control.  Each dmar lists the
> managed devices under it in sysfs to verify.  Support for snoop-control
> would be identified in the ecap register rather than the cap register.
> VFIO can also report coherency via the VFIO_DMA_CC_IOMMU extension
> reported by VFIO_CHECK_EXTENSION ioctl.
> 
> However, CPU coherency might lead to a miscompare, but not necessarily a
> miscompare matching the previous iteration.  Still, for completeness
> let's make sure this isn't a gap in the test programming making invalid
> assumptions about CPU/DMA coherency.
> 
> The fact that randomizing the IOVA provides a workaround though might
> suggest something relative to the IOMMU page table coherency.  But for
> the new mmap target to have the data from the previous iteration, the
> IOMMU PTE would need to be stale on read, but correct on write in order
> to land back in your new mmap.  That seems peculiar.  Are we sure the
> FPGA device isn't caching the value at the IOVA or using any sort of
> IOTLB caching such as ATS that might not be working correctly?
> 
>> Suggestion: Document issue when using fixed IOVA, or fix if security
>> is a concern.
> 
> I don't know that there's enough information here to make any
> conclusions.  Here are some further questions:
> 
>   * What size mappings are being used, both for the mmap and the VFIO
>     MAP/UNMAP operations.
> 
>   * If the above is venturing into super page support (2MB), does the
>     vfio_iommu_type1 module option disable_hugepages=1 affect the
>     results.
> 
>   * Along the same lines, does the kernel command line option
>     intel_iommu=sp_off produce different results.
> 
>   * Does this behavior also occur on upstream kernels (ie. v5.17)?
> 
>   * Do additional CPU cache flushes in the test program produce different
>     results?
> 
>   * Is this a consumer available FPGA device that others might be able
>     to reproduce this issue?  I've always wanted such a device for
>     testing, but also we can't rule out that the FPGA itself or its
>     programming is the source of the miscompare.
> 
>  From the vfio perspective, UNMAP_DMA should first unmap the pages at
> the IOMMU to prevent device access before unpinning the pages.  We do
> make use of batch unmapping to reduce iotlb flushing, but the result is
> expected to be that the IOMMU PTE entries are invalidated before the
> UNMAP_DMA operation completes.  A stale IOVA would not be expected or
> correct operation.  Thanks,
> 
> Alex
> 

As another suggestion, can you please try a patch posted here?

https://lore.kernel.org/linux-iommu/20220322063555.1422042-1-stevensd@google.com/

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bug report: VFIO map/unmep mem subject to race and DMA data goes to incorrect page (4.18.0)
  2022-03-25 22:10 ` Alex Williamson
  2022-03-28  9:01   ` Lu Baolu
@ 2022-03-28 19:14   ` Daniel F. Smith
  2022-03-28 23:05     ` Alex Williamson
  1 sibling, 1 reply; 5+ messages in thread
From: Daniel F. Smith @ 2022-03-28 19:14 UTC (permalink / raw)
  To: Alex Williamson; +Cc: iommu

Hi Alex,

Answers to questions I can answer are in-line.  First an apology
though---the machine with the FPGA board is 1000 miles remote, and I don't
have root access.  It's unlikely I will be able to do kernel patch testing.


Alex Williamson scribed the following, on or around Fri, Mar 25, 2022 at 04:10:22PM -0600:
> Hi Daniel,
> 
...
>
> Coherency possibly.
> 
> There's a possible coherency issue at the compare depending on the
> IOMMU capabilities which could affect whether DMA is coherent to memory
> or requires an explicit flush.  I'm a little suspicious whether dmar0
> is really the IOMMU controlling this device since you mention a 39bit
> IOVA space, which is more typical of Intel client platforms which can
> also have integrated graphics which often have a dedicated IOMMU at
> dmar0 that isn't necessarily representative of the other IOMMUs in the
> system, especially with regard to snoop-control.  Each dmar lists the
> managed devices under it in sysfs to verify.  Support for snoop-control
> would be identified in the ecap register rather than the cap register.
> VFIO can also report coherency via the VFIO_DMA_CC_IOMMU extension
> reported by VFIO_CHECK_EXTENSION ioctl.

$ cat /sys/devices/virtual/iommu/dmar0/intel-iommu/cap
d2008c40660462
$ cat /sys/devices/virtual/iommu/dmar0/intel-iommu/ecap
f050da
$ lscpu | grep Model
Model:               165
Model name:          Intel(R) Xeon(R) W-1290P CPU @ 3.70GHz
$ ls -l /sys/devices/virtual/iommu/dmar0/devices | wc -l
24
$ ... ioctl(container_fd, VFIO_CHECK_EXTENSION, VFIO_DMA_CC_IOMMU)
0

What are the implications of having no "IOMMU enforces DMA cache
conherence"?  On this machine there is no access to a PCIe bus analyzer, but
it's very unlikely that the TLPs would have NoSnoop set.

Is there a good way How can I tell what IOMMU I'm using?

(I did think it was strange that the IOMMU in this machine cannot handle
enough bits for mapping IOVA==VMA.  The test code is running in a podman
container, but (naively) I wouldn't expect that to make a difference.)

> However, CPU coherency might lead to a miscompare, but not necessarily a
> miscompare matching the previous iteration.  Still, for completeness
> let's make sure this isn't a gap in the test programming making invalid
> assumptions about CPU/DMA coherency.
> 
> The fact that randomizing the IOVA provides a workaround though might
> suggest something relative to the IOMMU page table coherency.  But for
> the new mmap target to have the data from the previous iteration, the
> IOMMU PTE would need to be stale on read, but correct on write in order
> to land back in your new mmap.  That seems peculiar.  Are we sure the
> FPGA device isn't caching the value at the IOVA or using any sort of
> IOTLB caching such as ATS that might not be working correctly?

I cannot say for certain what the FPGA caches, if anything.  The IP for that
part is closed (search for Xilinx PG302 QDMA).  It should (!) be
well-tested... oh for an analyzer!

> > Suggestion: Document issue when using fixed IOVA, or fix if security
> > is a concern.
> 
> I don't know that there's enough information here to make any
> conclusions.  Here are some further questions:
> 
>  * What size mappings are being used, both for the mmap and the VFIO
>    MAP/UNMAP operations.

The test would often fail switching from an 8KB allocation to 12KB where the
VMA would grow down by a page.  The mmap() always returned a 4KB aligned
VMA, and the requested mmap() size was always an exact number of 4KB pages. 
The VFIO map operations were always on the full extent of the mmap'd memory
(likely makes Baulu's patch moot in this case).

A typical (not consistent) syndrome would be:
  1st page: ok
  2nd page: previous mmap'd data.
  3rd page: ok
We saw the issue on transfers both to and from the card.  I.e., we placed a
memory block in the FPGA that we could interrogate when data were corrupted.

(And as mentioned, just changing the IOVA fixed this issue.)

>  * If the above is venturing into super page support (2MB), does the
>    vfio_iommu_type1 module option disable_hugepages=1 affect the
>    results.

N/A.

>  * Along the same lines, does the kernel command line option
>    intel_iommu=sp_off produce different results.

Would this affect small pages?

>  * Does this behavior also occur on upstream kernels (ie. v5.17)?

Unknown, and (unfortunately) untestable at present.

>  * Do additional CPU cache flushes in the test program produce different
>    results?

We did a number of experiments using combinations of MAP_LOCKED, mlock(),
barrier(), _mm_clflush().  They all affected reliability of the test
(through timing?), but all ultimately failed.  I'm happy to try other
flushes that can be achieved in non-root user space!

>  * Is this a consumer available FPGA device that others might be able
>    to reproduce this issue?  I've always wanted such a device for
>    testing, but also we can't rule out that the FPGA itself or its
>    programming is the source of the miscompare.

https://www.xilinx.com/products/boards-and-kits/vcu118.html
Just don't look at the price too hard!

> >From the vfio perspective, UNMAP_DMA should first unmap the pages at
> the IOMMU to prevent device access before unpinning the pages.  We do
> make use of batch unmapping to reduce iotlb flushing, but the result is
> expected to be that the IOMMU PTE entries are invalidated before the
> UNMAP_DMA operation completes.  A stale IOVA would not be expected or
> correct operation.  Thanks,
> 
> Alex

Thanks.

Daniel
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bug report: VFIO map/unmep mem subject to race and DMA data goes to incorrect page (4.18.0)
  2022-03-28 19:14   ` Daniel F. Smith
@ 2022-03-28 23:05     ` Alex Williamson
  0 siblings, 0 replies; 5+ messages in thread
From: Alex Williamson @ 2022-03-28 23:05 UTC (permalink / raw)
  To: Daniel F. Smith; +Cc: iommu

On Mon, 28 Mar 2022 12:14:51 -0700
"Daniel F. Smith" <dfsmith@us.ibm.com> wrote:

> Hi Alex,
> 
> Answers to questions I can answer are in-line.  First an apology
> though---the machine with the FPGA board is 1000 miles remote, and I don't
> have root access.  It's unlikely I will be able to do kernel patch testing.
> 
> 
> Alex Williamson scribed the following, on or around Fri, Mar 25, 2022 at 04:10:22PM -0600:
> > Hi Daniel,
> >   
> ...
> >
> > Coherency possibly.
> > 
> > There's a possible coherency issue at the compare depending on the
> > IOMMU capabilities which could affect whether DMA is coherent to memory
> > or requires an explicit flush.  I'm a little suspicious whether dmar0
> > is really the IOMMU controlling this device since you mention a 39bit
> > IOVA space, which is more typical of Intel client platforms which can
> > also have integrated graphics which often have a dedicated IOMMU at
> > dmar0 that isn't necessarily representative of the other IOMMUs in the
> > system, especially with regard to snoop-control.  Each dmar lists the
> > managed devices under it in sysfs to verify.  Support for snoop-control
> > would be identified in the ecap register rather than the cap register.
> > VFIO can also report coherency via the VFIO_DMA_CC_IOMMU extension
> > reported by VFIO_CHECK_EXTENSION ioctl.  
> 
> $ cat /sys/devices/virtual/iommu/dmar0/intel-iommu/cap
> d2008c40660462
> $ cat /sys/devices/virtual/iommu/dmar0/intel-iommu/ecap
> f050da
> $ lscpu | grep Model
> Model:               165
> Model name:          Intel(R) Xeon(R) W-1290P CPU @ 3.70GHz
> $ ls -l /sys/devices/virtual/iommu/dmar0/devices | wc -l
> 24
> $ ... ioctl(container_fd, VFIO_CHECK_EXTENSION, VFIO_DMA_CC_IOMMU)
> 0

Your ecap register reports bit 7 (Snoop Control) set, which should mean
that VT-d is enforcing coherency regardless of no-snoop transactions.
I suspect maybe the different result from the ioctl could be from
testing this extension before the IOMMU has been set for the
container(?)

> What are the implications of having no "IOMMU enforces DMA cache
> conherence"?  On this machine there is no access to a PCIe bus analyzer, but
> it's very unlikely that the TLPs would have NoSnoop set.

There's also bit 11 (Enable No Snoop) that could be cleared in the PCI
device control register, which would theoretically prevent the device
from using no-snoop TLPs.
 
> Is there a good way How can I tell what IOMMU I'm using?

Which DMAR?  Like this for example:

$ readlink -f /sys/bus/pci/devices/0000:04:00.0/iommu
/sys/devices/virtual/iommu/dmar3

Your listing of devices piped to wc would also reciprocally list the
device in that output.  With 24 devices there's a fair chance that
dmar0 is the only one used.

> (I did think it was strange that the IOMMU in this machine cannot handle
> enough bits for mapping IOVA==VMA.  The test code is running in a podman
> container, but (naively) I wouldn't expect that to make a difference.)

Single socket Xeon processors like this tend to share more similarities
to consumer desktop processors than to the "Scalable" line of Xeons.

FWIW, there's a proposal[1] for a new, shared userspace IOMMU interface
that includes an option for the kernel to allocate IOVAs for these
cases.

[1]https://lore.kernel.org/all/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com/

> > However, CPU coherency might lead to a miscompare, but not necessarily a
> > miscompare matching the previous iteration.  Still, for completeness
> > let's make sure this isn't a gap in the test programming making invalid
> > assumptions about CPU/DMA coherency.
> > 
> > The fact that randomizing the IOVA provides a workaround though might
> > suggest something relative to the IOMMU page table coherency.  But for
> > the new mmap target to have the data from the previous iteration, the
> > IOMMU PTE would need to be stale on read, but correct on write in order
> > to land back in your new mmap.  That seems peculiar.  Are we sure the
> > FPGA device isn't caching the value at the IOVA or using any sort of
> > IOTLB caching such as ATS that might not be working correctly?  
> 
> I cannot say for certain what the FPGA caches, if anything.  The IP for that
> part is closed (search for Xilinx PG302 QDMA).  It should (!) be
> well-tested... oh for an analyzer!
> 
> > > Suggestion: Document issue when using fixed IOVA, or fix if security
> > > is a concern.  
> > 
> > I don't know that there's enough information here to make any
> > conclusions.  Here are some further questions:
> > 
> >  * What size mappings are being used, both for the mmap and the VFIO
> >    MAP/UNMAP operations.  
> 
> The test would often fail switching from an 8KB allocation to 12KB where the
> VMA would grow down by a page.  The mmap() always returned a 4KB aligned
> VMA, and the requested mmap() size was always an exact number of 4KB pages. 
> The VFIO map operations were always on the full extent of the mmap'd memory
> (likely makes Baulu's patch moot in this case).
> 
> A typical (not consistent) syndrome would be:
>   1st page: ok
>   2nd page: previous mmap'd data.
>   3rd page: ok
> We saw the issue on transfers both to and from the card.  I.e., we placed a
> memory block in the FPGA that we could interrogate when data were corrupted.

If we assume the previous mapping was for 8KB and the new mapping was
for 12KB, I might hypothesize that the extent of the IOTLB invalidation
when unmapping the 8KB mapping could have an off-by-one such that the
IOMMU has a stale entry for the 2nd page.  The 1st page would have been
invalidated correctly and the behavior of the 3rd page might depend on
where it fell in the mapping previously, and arbitrary pressures on the
IOTLB otherwise.

> (And as mentioned, just changing the IOVA fixed this issue.)

And that would also avoid a large number of IOTLB invalidation issues.

> >  * If the above is venturing into super page support (2MB), does the
> >    vfio_iommu_type1 module option disable_hugepages=1 affect the
> >    results.  
> 
> N/A.

Probably not, but I would be interested whether the results are more
consistent if you were to call MAP_DMA for each page rather than for
the whole buffer.  This would result in UNMAP_DMA across the whole IOVA
range of the buffer making individual unmaps for each page from the
IOMMU, which may point to something like the hypothesis above.
 
> >  * Along the same lines, does the kernel command line option
> >    intel_iommu=sp_off produce different results.  
> 
> Would this affect small pages?

Not likely.
 
> >  * Does this behavior also occur on upstream kernels (ie. v5.17)?  
> 
> Unknown, and (unfortunately) untestable at present.
> 
> >  * Do additional CPU cache flushes in the test program produce different
> >    results?  
> 
> We did a number of experiments using combinations of MAP_LOCKED, mlock(),
> barrier(), _mm_clflush().  They all affected reliability of the test
> (through timing?), but all ultimately failed.  I'm happy to try other
> flushes that can be achieved in non-root user space!
> 
> >  * Is this a consumer available FPGA device that others might be able
> >    to reproduce this issue?  I've always wanted such a device for
> >    testing, but also we can't rule out that the FPGA itself or its
> >    programming is the source of the miscompare.  
> 
> https://www.xilinx.com/products/boards-and-kits/vcu118.html
> Just don't look at the price too hard!

Yikes!  Thanks, and I'll be curious if breaking down the DMA_MAP calls
give us any further leads,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-03-28 23:06 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-25 20:06 Bug report: VFIO map/unmep mem subject to race and DMA data goes to incorrect page (4.18.0) Daniel F. Smith
2022-03-25 22:10 ` Alex Williamson
2022-03-28  9:01   ` Lu Baolu
2022-03-28 19:14   ` Daniel F. Smith
2022-03-28 23:05     ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.