All of lore.kernel.org
 help / color / mirror / Atom feed
* i.MX 6 and PCIe DMA issues
@ 2017-07-11 13:40 Moese, Michael
  2017-07-11 14:07 ` Andrew Lunn
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Moese, Michael @ 2017-07-11 13:40 UTC (permalink / raw)
  To: linux-arm-kernel

Hello ARM folks,
I turn to you in hope you have any hints or directions on the problem I describe below.

I am currently investigating strange behavior of our i.MX 6 (Quad) board with an FPGA connected to to PCI Express. This FPGA contains, among others, an Ethernet (10/100 Mbps) IP core. 
The Ethernet relies completely on DMA transfers. There is one buffer descriptor table containing pointers to 64 RX and TX buffers.
Buffers are allocated using dma_alloc_coherent() and mapped using dma_map_single().
Prior to access, dma_sync_single_for_cpu() is called on the memory regions, afterwards dma_sync_single_for_device().

If a new frame is received, the driver reads the RX buffer and passes the frame using skb_put().
When the issue was reported for an old (say 3.18.19) kernel, the buffer descriptor was read correctly (including a correct length), but the buffer contained all zeroes. When I map the physical address in userspace and dump the contents, I can see the correct buffer descriptor contents and inside the buffers valid Ethernet frames. So the DMA transfer itself is working obviously.
To avoid chasing after already-fixed bugs, I switched to a 4.12.0 kernel and observed almost the same behavior, but this time there was no length read as well.
On 3.18.19 I was able to read the buffers when I allocate them using kmalloc() instead of dma_alloc_coherent(), on 4.12 this did not have any impact.

I was suspecting the caches to be the root of my issue, but I was not able to resolve the issue with calls to flush_cache_all(), which I suppose should have invalidated the entire cache.

Unfortunately, our driver is legacy out-of-tree code and I started working on this driver to get it ready for submission. If it is of any help, I could send the code of the driver as well as our board's device tree.

I would highly appreciate any hint or direction that may help my troubleshooting.

Best Regards,
Michael

^ permalink raw reply	[flat|nested] 10+ messages in thread

* i.MX 6 and PCIe DMA issues
  2017-07-11 13:40 i.MX 6 and PCIe DMA issues Moese, Michael
@ 2017-07-11 14:07 ` Andrew Lunn
  2017-07-11 14:50 ` Robin Murphy
  2017-07-11 17:56 ` Andrew Lunn
  2 siblings, 0 replies; 10+ messages in thread
From: Andrew Lunn @ 2017-07-11 14:07 UTC (permalink / raw)
  To: linux-arm-kernel

> I was suspecting the caches to be the root of my issue, but I was
> not able to resolve the issue with calls to flush_cache_all(), which
> I suppose should have invalidated the entire cache.

Flush and invalidate are different operations. flush_cache_all() will
not invalidate.

How are your RX buffers aligned. It is good to ensure that your
buffers don't share cache lines. What could be happening is that you
are reading the tail end of one buffer which is bringing into cache
the head of the next buffer, if they share cache lines.

Also, ARMv7 processors can do speculative reads, so you must get your
dma_sync_single_for_cpu() and dma_sync_single_for_device() correct.

    Andrew

^ permalink raw reply	[flat|nested] 10+ messages in thread

* i.MX 6 and PCIe DMA issues
  2017-07-11 13:40 i.MX 6 and PCIe DMA issues Moese, Michael
  2017-07-11 14:07 ` Andrew Lunn
@ 2017-07-11 14:50 ` Robin Murphy
  2017-07-13  7:07   ` michael.moese at men.de
  2017-07-11 17:56 ` Andrew Lunn
  2 siblings, 1 reply; 10+ messages in thread
From: Robin Murphy @ 2017-07-11 14:50 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/07/17 14:40, Moese, Michael wrote:
> Hello ARM folks, I turn to you in hope you have any hints or
> directions on the problem I describe below.
> 
> I am currently investigating strange behavior of our i.MX 6 (Quad)
> board with an FPGA connected to to PCI Express. This FPGA contains,
> among others, an Ethernet (10/100 Mbps) IP core. The Ethernet relies
> completely on DMA transfers. There is one buffer descriptor table
> containing pointers to 64 RX and TX buffers. Buffers are allocated
> using dma_alloc_coherent() and mapped using dma_map_single(). Prior

I don't much like the sound of that "and" there - coherent DMA
allocations are, as the name implies, already coherent for CPU and DMA
accesses, and require no maintenance; the streaming DMA API
(dma_{map,unmap,sync}_*) on the other hand is *only* for use on
kmalloced memory.

I'm far more of a DMA guy than a networking guy, so I don't know much
about the details of ethernet drivers, but typically, long-lived things
like descriptor rings would usually use coherent allocations, whilst
streaming DMA is used for the transient mapping/unmapping of individual
skbs.

> to access, dma_sync_single_for_cpu() is called on the memory regions,
> afterwards dma_sync_single_for_device().
> 
> If a new frame is received, the driver reads the RX buffer and passes
> the frame using skb_put(). When the issue was reported for an old
> (say 3.18.19) kernel, the buffer descriptor was read correctly
> (including a correct length), but the buffer contained all zeroes.
> When I map the physical address in userspace and dump the contents, I
> can see the correct buffer descriptor contents and inside the buffers
> valid Ethernet frames. So the DMA transfer itself is working
> obviously. To avoid chasing after already-fixed bugs, I switched to a
> 4.12.0 kernel and observed almost the same behavior, but this time
> there was no length read as well. On 3.18.19 I was able to read the
> buffers when I allocate them using kmalloc() instead of
> dma_alloc_coherent(), on 4.12 this did not have any impact.
> 
> I was suspecting the caches to be the root of my issue, but I was not
> able to resolve the issue with calls to flush_cache_all(), which I
> suppose should have invalidated the entire cache.

i.MX6Q has a PL310 L2 outer cache, which brings with it a whole load of
extra fun, but the primary effect is that it'll be extremely hard to
bodge things if the DMA API usage is incorrect in the first place.
AFAICS flush_cache_all() on a v7 CPU performs set/way operations on the
CPU caches, which means a) on that system it will only affect L1, and b)
it shouldn't really be used from an SMP context with the MMU on anyway.

The PL310 does have more than its fair share of wackiness, but unless
you also see DMA going wrong for the on-chip peripherals, the problem is
almost certainly down to the driver itself rather than the cache
configuration.

Robin.

> Unfortunately, our driver is legacy out-of-tree code and I started
> working on this driver to get it ready for submission. If it is of
> any help, I could send the code of the driver as well as our board's
> device tree.
> 
> I would highly appreciate any hint or direction that may help my
> troubleshooting.
> 
> Best Regards, Michael
> 
> 
> 
> _______________________________________________ linux-arm-kernel
> mailing list linux-arm-kernel at lists.infradead.org 
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* i.MX 6 and PCIe DMA issues
  2017-07-11 13:40 i.MX 6 and PCIe DMA issues Moese, Michael
  2017-07-11 14:07 ` Andrew Lunn
  2017-07-11 14:50 ` Robin Murphy
@ 2017-07-11 17:56 ` Andrew Lunn
  2 siblings, 0 replies; 10+ messages in thread
From: Andrew Lunn @ 2017-07-11 17:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 11, 2017 at 01:40:57PM +0000, Moese, Michael wrote:
> Hello ARM folks,
> I turn to you in hope you have any hints or directions on the problem I describe below.
> 
> I am currently investigating strange behavior of our i.MX 6 (Quad) board with an FPGA connected to to PCI Express. This FPGA contains, among others, an Ethernet (10/100 Mbps) IP core. 

FYI: I'm using am IMX6 quad with an intel i210 PCIe Ethernet
controller. It works fine. So you might want to look at how the igb
driver does its buffer handling.

       Andrew

^ permalink raw reply	[flat|nested] 10+ messages in thread

* i.MX 6 and PCIe DMA issues
  2017-07-11 14:50 ` Robin Murphy
@ 2017-07-13  7:07   ` michael.moese at men.de
  2017-07-13  9:04     ` Russell King - ARM Linux
  2017-07-13 14:57     ` Robin Murphy
  0 siblings, 2 replies; 10+ messages in thread
From: michael.moese at men.de @ 2017-07-13  7:07 UTC (permalink / raw)
  To: linux-arm-kernel



On Tue, Jul 11, 2017 at 03:50:19PM +0100, Robin Murphy wrote:
> I don't much like the sound of that "and" there - coherent DMA
> allocations are, as the name implies, already coherent for CPU and DMA
> accesses, and require no maintenance; the streaming DMA API
> (dma_{map,unmap,sync}_*) on the other hand is *only* for use on
> kmalloced memory.
Ok, I hope I am correct. I alloc my memory using dma_alloc_coherent()
once, the dma_handle is passed to the device, with no other dma_map*()
or dma_sync_*() calls needed?

Yesterday I observed some strange behavior  when I did some debug prints in 
the driver, printing me the result from phys_to_virt() of my
virtual address and the dma_handle. I know these may be different, but,
when I dump the memory (from userspace using mmap), I can see the data
at the adress of my dma_handle, but the memory the driver has the
pointer for has different contents. I don't understand this.

 
> The PL310 does have more than its fair share of wackiness, but unless
> you also see DMA going wrong for the on-chip peripherals, the problem is
> almost certainly down to the driver itself rather than the cache
> configuration.
Well, I think I need do dive into this as well, my former co-worker
disabled DMA for SPI, for example, in the device tree. That may be
another hint. I think I will need to find out what is setup here.


Thank you for your advise, I will try to find out more.

Michael

^ permalink raw reply	[flat|nested] 10+ messages in thread

* i.MX 6 and PCIe DMA issues
  2017-07-13  7:07   ` michael.moese at men.de
@ 2017-07-13  9:04     ` Russell King - ARM Linux
  2017-07-13 10:00       ` michael.moese at men.de
  2017-07-13 14:57     ` Robin Murphy
  1 sibling, 1 reply; 10+ messages in thread
From: Russell King - ARM Linux @ 2017-07-13  9:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jul 13, 2017 at 09:07:19AM +0200, michael.moese at men.de wrote:
> Ok, I hope I am correct. I alloc my memory using dma_alloc_coherent()
> once, the dma_handle is passed to the device, with no other dma_map*()
> or dma_sync_*() calls needed?

Correct.

> Yesterday I observed some strange behavior  when I did some debug prints in 
> the driver, printing me the result from phys_to_virt() of my
> virtual address and the dma_handle. I know these may be different, but,
> when I dump the memory (from userspace using mmap), I can see the data
> at the adress of my dma_handle, but the memory the driver has the
> pointer for has different contents. I don't understand this.

phys_to_virt() and virt_to_phys() are only defined for the linear map
region.

DMA coherent allocations may or may not return addresses in the linear
region, which means virt_to_phys() on a virtual address returned from
dma_alloc_coherent() is not defined.

DMA addresses are not physical addresses, they are device addresses -
this is an important distinction when IOMMUs or address translators are
in the path between the device and memory.  phys_to_virt() must not be
used with dma_addr_t data types.

You must always use the two addresses returned from dma_alloc_coherent()
and never try to translate them - the DMA address is for programming
into the hardware, and the virtual address for CPU accesses.  If you do
try to pass these to the various translators (like virt_to_phys()) the
result is undefined.

Moreover, passing the dma_addr_t returned from dma_alloc_coherent() to
the dma_sync_*() functions also results in undefined behaviour, and
potential data corruption.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* i.MX 6 and PCIe DMA issues
  2017-07-13  9:04     ` Russell King - ARM Linux
@ 2017-07-13 10:00       ` michael.moese at men.de
  2017-07-13 10:18         ` Russell King - ARM Linux
  0 siblings, 1 reply; 10+ messages in thread
From: michael.moese at men.de @ 2017-07-13 10:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jul 13, 2017 at 10:04:23AM +0100, Russell King - ARM Linux wrote:
> On Thu, Jul 13, 2017 at 09:07:19AM +0200, michael.moese at men.de wrote:
> > Ok, I hope I am correct. I alloc my memory using dma_alloc_coherent()
> > once, the dma_handle is passed to the device, with no other dma_map*()
> > or dma_sync_*() calls needed?
> 
> Correct.
Great. Thanks for the clarification. I will make sure the usage is done
the right way.

> You must always use the two addresses returned from dma_alloc_coherent()
> and never try to translate them - the DMA address is for programming
> into the hardware, and the virtual address for CPU accesses.  If you do
> try to pass these to the various translators (like virt_to_phys()) the
> result is undefined.

I was trying to just print some addresses for debugging. I removed this
again. I'm just confused, because the DMA device address seems to be the
physical one. At least the memory contains my data. The virtual memory
contains different data. 

Could it be possible this mapping on my board is going wrong?

Michael

^ permalink raw reply	[flat|nested] 10+ messages in thread

* i.MX 6 and PCIe DMA issues
  2017-07-13 10:00       ` michael.moese at men.de
@ 2017-07-13 10:18         ` Russell King - ARM Linux
  0 siblings, 0 replies; 10+ messages in thread
From: Russell King - ARM Linux @ 2017-07-13 10:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jul 13, 2017 at 12:00:31PM +0200, michael.moese at men.de wrote:
> On Thu, Jul 13, 2017 at 10:04:23AM +0100, Russell King - ARM Linux wrote:
> > You must always use the two addresses returned from dma_alloc_coherent()
> > and never try to translate them - the DMA address is for programming
> > into the hardware, and the virtual address for CPU accesses.  If you do
> > try to pass these to the various translators (like virt_to_phys()) the
> > result is undefined.
> 
> I was trying to just print some addresses for debugging. I removed this
> again. I'm just confused, because the DMA device address seems to be the
> physical one. At least the memory contains my data. The virtual memory
> contains different data. 

The DMA address might be a physical address if that's what the device
requires, but equally, it might not be.

If it is a physical address, phys_to_virt() will return the virtual
address in the lowmem mapping provided that the physical address
corresponds with a lowmem mapped page.  If it's a highmem page, then
phys_to_virt() will still return something that looks like a virtual
address, but it could be anything.  It's undefined.

Accessing memory through that pointer could end up hitting a cacheable
mapping, which would destroy the semantics of the DMA coherent mapping,
thereby leading to unpredictable behaviour.

As I've already said, use _both_ pointers for a DMA coherent allocation.
If you want to dump the data in the DMA coherent memory, use the virtual
address returned from dma_alloc_coherent() - this will have the least
side effects.

If you access it through phys_to_virt(dma_addr) then you could be
dragging the data into the caches, and the first time you print it, it
would look correct, but when the DMA device updates it, it becomes stale,
and even if you subsequently access it through the correct pointer, you
could continue receiving the stale data.  Just don't do this - using
phys_to_virt() may seem like a short-cut, but it really isn't.

> Could it be possible this mapping on my board is going wrong?

Unlikely.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* i.MX 6 and PCIe DMA issues
  2017-07-13  7:07   ` michael.moese at men.de
  2017-07-13  9:04     ` Russell King - ARM Linux
@ 2017-07-13 14:57     ` Robin Murphy
  2017-07-14 11:07       ` michael.moese at men.de
  1 sibling, 1 reply; 10+ messages in thread
From: Robin Murphy @ 2017-07-13 14:57 UTC (permalink / raw)
  To: linux-arm-kernel

On 13/07/17 08:07, michael.moese at men.de wrote:
> 
> 
> On Tue, Jul 11, 2017 at 03:50:19PM +0100, Robin Murphy wrote:
>> I don't much like the sound of that "and" there - coherent DMA
>> allocations are, as the name implies, already coherent for CPU and DMA
>> accesses, and require no maintenance; the streaming DMA API
>> (dma_{map,unmap,sync}_*) on the other hand is *only* for use on
>> kmalloced memory.
> Ok, I hope I am correct. I alloc my memory using dma_alloc_coherent()
> once, the dma_handle is passed to the device, with no other dma_map*()
> or dma_sync_*() calls needed?
> 
> Yesterday I observed some strange behavior  when I did some debug prints in 
> the driver, printing me the result from phys_to_virt() of my
> virtual address and the dma_handle. I know these may be different, but,
> when I dump the memory (from userspace using mmap), I can see the data
> at the adress of my dma_handle, but the memory the driver has the
> pointer for has different contents. I don't understand this.
> 
>  
>> The PL310 does have more than its fair share of wackiness, but unless
>> you also see DMA going wrong for the on-chip peripherals, the problem is
>> almost certainly down to the driver itself rather than the cache
>> configuration.
> Well, I think I need do dive into this as well, my former co-worker
> disabled DMA for SPI, for example, in the device tree. That may be
> another hint. I think I will need to find out what is setup here.

OK, the first thing I'd check is that you have the "arm,shared-override"
property on the DT node for the PL310 and that it's being applied
correctly (e.g. you're not entering the kernel with L2 already enabled).
Otherwise I think it might be possible to end up in a situation where
you get stale data into L2 that even reads from the non-cacheable remap
of the buffer can hit, which would probably look rather like what you're
describing.

Robin.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* i.MX 6 and PCIe DMA issues
  2017-07-13 14:57     ` Robin Murphy
@ 2017-07-14 11:07       ` michael.moese at men.de
  0 siblings, 0 replies; 10+ messages in thread
From: michael.moese at men.de @ 2017-07-14 11:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jul 13, 2017 at 03:57:19PM +0100, Robin Murphy wrote:
> >> I don't much like the sound of that "and" there - coherent DMA
> >> allocations are, as the name implies, already coherent for CPU and DMA
> >> accesses, and require no maintenance; the streaming DMA API
> >> (dma_{map,unmap,sync}_*) on the other hand is *only* for use on
> >> kmalloced memory.
I removed all the calls to dma_{map,unmap,sync}_* and took special care
on the handling of the dma_attr_t's. I now dma_alloc_coherent my memory,
and do nothing else with the dma_attr_t's. It seems to work, at least
partly, now. I had to grab a newer board, because we killed it while
soldering wires to trace PCI Express traffic.

Maybe I had some weird old CPU or some strange U-Boot or I simply missed
removing all dma_sync_* calls or similar before.

Now I got other issues, but at least the DMA seems to be working now.
Thanks to you, I think I understood this topic now, thank you.

Now I can try to figure out why a read to the PCI Express may stop all
traffic - that's a different story.


Thanks,
Michael

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-07-14 11:07 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-11 13:40 i.MX 6 and PCIe DMA issues Moese, Michael
2017-07-11 14:07 ` Andrew Lunn
2017-07-11 14:50 ` Robin Murphy
2017-07-13  7:07   ` michael.moese at men.de
2017-07-13  9:04     ` Russell King - ARM Linux
2017-07-13 10:00       ` michael.moese at men.de
2017-07-13 10:18         ` Russell King - ARM Linux
2017-07-13 14:57     ` Robin Murphy
2017-07-14 11:07       ` michael.moese at men.de
2017-07-11 17:56 ` Andrew Lunn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.