All of lore.kernel.org
 help / color / mirror / Atom feed
* DMA remote memcpy requests
@ 2018-10-11  7:28 Adam Cottrel
  2018-10-12  9:09 ` Will Deacon
  0 siblings, 1 reply; 18+ messages in thread
From: Adam Cottrel @ 2018-10-11  7:28 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

I am using the ATH10K on Linux 14.4 with an Arm Cavium processor. During heavy loading, I am seeing that target initiated DMA requests are being silently dropped under extreme IO memory pressure and it is proving very difficult to isolate the root cause.

The ATH10K firmware uses the DMA API to set up phy_addr_t pointers (32-bit) which are then copied to a shared ring buffer. The target then initiates the memcpy operation (for target-to-host reads), but I do not have any means of debugging the target directly, and so I am looking for software hooks on the host that might help debug this complex problem.

Please can someone explain the low-level operation of DMA once it becomes a target initiated memcpy function?

Best,
Adam

p.s. I have tested with and without the IOMMU, and I have eliminated issues such as cache coherency being the root cause.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* DMA remote memcpy requests
  2018-10-11  7:28 DMA remote memcpy requests Adam Cottrel
@ 2018-10-12  9:09 ` Will Deacon
  2018-10-12  9:48   ` Adam Cottrel
  0 siblings, 1 reply; 18+ messages in thread
From: Will Deacon @ 2018-10-12  9:09 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Adam,

[+Robin and Cavium folks -- it's usually best to cc people as well as
 mailing the list]

On Thu, Oct 11, 2018 at 07:28:37AM +0000, Adam Cottrel wrote:
> I am using the ATH10K on Linux 14.4 with an Arm Cavium processor. During
> heavy loading, I am seeing that target initiated DMA requests are being
> silently dropped under extreme IO memory pressure and it is proving very
> difficult to isolate the root cause.

Is this ThunderX 1 or 2 or something else? Can you reproduce the issue with
mainline?

> The ATH10K firmware uses the DMA API to set up phy_addr_t pointers
> (32-bit) which are then copied to a shared ring buffer. The target then
> initiates the memcpy operation (for target-to-host reads), but I do not
> have any means of debugging the target directly, and so I am looking for
> software hooks on the host that might help debug this complex problem.

How does the firmware use the DMA API, or are you referring to a driver? If
the latter, could you point us to the code, please? Is it using the
streaming API, or is this a coherent allocation?

> Please can someone explain the low-level operation of DMA once it becomes
> a target initiated memcpy function?

I think we need a better handle on the issue first.

> p.s. I have tested with and without the IOMMU, and I have eliminated
> issues such as cache coherency being the root cause.

Right, not sure how the SMMU would help here.

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* DMA remote memcpy requests
  2018-10-12  9:09 ` Will Deacon
@ 2018-10-12  9:48   ` Adam Cottrel
  2018-10-12 10:46     ` Robin Murphy
  2018-10-12 11:03     ` Jan Glauber
  0 siblings, 2 replies; 18+ messages in thread
From: Adam Cottrel @ 2018-10-12  9:48 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

Thank you for getting back to me.

> [+Robin and Cavium folks -- it's usually best to cc people as well as  mailing
> the list]
I will remember this for future. Thanks for the advice.

> > I am using the ATH10K on Linux 14.4 with an Arm Cavium processor.
> > During heavy loading, I am seeing that target initiated DMA requests
> > are being silently dropped under extreme IO memory pressure and it is
> > proving very difficult to isolate the root cause.
> 
> Is this ThunderX 1 or 2 or something else? Can you reproduce the issue with
> mainline?
I am using:-
        model = "Cavium ThunderX CN81XX board";
        compatible = "cavium,thunder-81xx";

Yes - the issue can be reproduced on the mainline, but here is a link to the code branch that I am using:-
https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/ath/ath10k

> > The ATH10K firmware uses the DMA API to set up phy_addr_t pointers
> > (32-bit) which are then copied to a shared ring buffer. The target
> > then initiates the memcpy operation (for target-to-host reads), but I
> > do not have any means of debugging the target directly, and so I am
> > looking for software hooks on the host that might help debug this complex
> problem.
> 
> How does the firmware use the DMA API, or are you referring to a driver? If
> the latter, could you point us to the code, please? Is it using the streaming
> API, or is this a coherent allocation?
The code is using the ARM64 DMA API. It cuts corners in places (!!) but for the most part, it follows the rules. In local tests, I have added memory barriers (e.g. dmb(SY)) and even put in low-level flush/invalidate calls (DC CIVAC) to try and eliminate cache-coherency type problems.

The receive fault can be observed in the Rx handler which can be found on line 528 of ce.c:-
https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/ath/ath10k/ce.c

The memory is allocated by the Rx post buffer function which is on line 760 of pci.c:-
https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/ath/ath10k/pci.c

To better observe the fault, I made the following change:-
 + On allocation, I use memset to clear the skb->data (pci.c::770)
 + On receive, I check that the data is not zero (ce.c::555)
 + If the data is not yet written, I exit the Rx IRQ handler and try again.

In tests, the code works as expected under normal operation, however once I start to simulate a heavy memory pressure situation then the Rx handler starts to fail. This failure (if allowed to continue) will eventually tear down the entire module and crash the target firmware because presumably they are seeing similar dropouts on the transmit path.

When the fault is happening, if I poll the target registers (e.g. write counters over MMIO) I can see that they are still sending us new messages. In other words, they have silently failed to send the data, or rather we have silently failed to accept the memory copy. I am not able to access the target firmware directly, but I have been reliably informed that the DMA memcpy operation is initiated by the target.

My memory pressure test uses a large dd copy to create a lot of dirty memory pages. This always creates the fault, however without any memory pressure the code runs beautifully...

> > Please can someone explain the low-level operation of DMA once it
> > becomes a target initiated memcpy function?
> 
> I think we need a better handle on the issue first.

I fully agree - please tell me what you want to know :-D

> > p.s. I have tested with and without the IOMMU, and I have eliminated
> > issues such as cache coherency being the root cause.
> 
> Right, not sure how the SMMU would help here.

Understood, and thanks for taking the time to reply, and I look forward to hearing your thoughts as I would like to fix this issue once and for all.

Best,
Adam

^ permalink raw reply	[flat|nested] 18+ messages in thread

* DMA remote memcpy requests
  2018-10-12  9:48   ` Adam Cottrel
@ 2018-10-12 10:46     ` Robin Murphy
  2018-10-12 11:06       ` Adam Cottrel
  2018-10-15 14:34       ` Adam Cottrel
  2018-10-12 11:03     ` Jan Glauber
  1 sibling, 2 replies; 18+ messages in thread
From: Robin Murphy @ 2018-10-12 10:46 UTC (permalink / raw)
  To: linux-arm-kernel

On 12/10/18 10:48, Adam Cottrel wrote:
> Hi Will,
> 
> Thank you for getting back to me.
> 
>> [+Robin and Cavium folks -- it's usually best to cc people as well as  mailing
>> the list]
> I will remember this for future. Thanks for the advice.
> 
>>> I am using the ATH10K on Linux 14.4 with an Arm Cavium processor.
>>> During heavy loading, I am seeing that target initiated DMA requests
>>> are being silently dropped under extreme IO memory pressure and it is
>>> proving very difficult to isolate the root cause.
>>
>> Is this ThunderX 1 or 2 or something else? Can you reproduce the issue with
>> mainline?
> I am using:-
>          model = "Cavium ThunderX CN81XX board";
>          compatible = "cavium,thunder-81xx";
> 
> Yes - the issue can be reproduced on the mainline, but here is a link to the code branch that I am using:-
> https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/ath/ath10k
> 
>>> The ATH10K firmware uses the DMA API to set up phy_addr_t pointers
>>> (32-bit) which are then copied to a shared ring buffer. The target

That's the first alarm bell - phys_addr_t is still going to be 64-bit on 
any arm64 platform. If the device is expecting 32-bit addresses but 
somehow doesn't have its DMA mask set appropriately, then if you have 
more than 3GB or so of RAM there's the potential for addresses to get 
truncated such that the DMA *does* happen, but to the wrong place.

However, with SMMU translation enabled (i.e. not just passthrough), then 
I'd expect that same situation to cause more or less all DMA to fail, so 
if you've successfully tested that setup it must be something much more 
subtle :/

>>> then initiates the memcpy operation (for target-to-host reads), but I
>>> do not have any means of debugging the target directly, and so I am
>>> looking for software hooks on the host that might help debug this complex
>> problem.
>>
>> How does the firmware use the DMA API, or are you referring to a driver? If
>> the latter, could you point us to the code, please? Is it using the streaming
>> API, or is this a coherent allocation?
> The code is using the ARM64 DMA API. It cuts corners in places (!!) but for the most part, it follows the rules. In local tests, I have added memory barriers (e.g. dmb(SY)) and even put in low-level flush/invalidate calls (DC CIVAC) to try and eliminate cache-coherency type problems.
> 
> The receive fault can be observed in the Rx handler which can be found on line 528 of ce.c:-
> https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/ath/ath10k/ce.c
> 
> The memory is allocated by the Rx post buffer function which is on line 760 of pci.c:-
> https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/ath/ath10k/pci.c
> 
> To better observe the fault, I made the following change:-
>   + On allocation, I use memset to clear the skb->data (pci.c::770)
>   + On receive, I check that the data is not zero (ce.c::555)
>   + If the data is not yet written, I exit the Rx IRQ handler and try again.
> 
> In tests, the code works as expected under normal operation, however once I start to simulate a heavy memory pressure situation then the Rx handler starts to fail. This failure (if allowed to continue) will eventually tear down the entire module and crash the target firmware because presumably they are seeing similar dropouts on the transmit path.
> 
> When the fault is happening, if I poll the target registers (e.g. write counters over MMIO) I can see that they are still sending us new messages. In other words, they have silently failed to send the data, or rather we have silently failed to accept the memory copy. I am not able to access the target firmware directly, but I have been reliably informed that the DMA memcpy operation is initiated by the target.
> 
> My memory pressure test uses a large dd copy to create a lot of dirty memory pages. This always creates the fault, however without any memory pressure the code runs beautifully...

Are you able to characterise whether it's actually the memory pressure 
itself that changes the behaviour (e.g. difficulty in allocating new 
SKBs), or is it just that there's suddenly a lot more work going on in 
general? Those aren't exactly the most powerful CPU cores, and with only 
2 or 4 of them it doesn't seem impossible that the system could simply 
get loaded to the point where it can't keep up and starts dropping 
things on the floor.

Robin.

>>> Please can someone explain the low-level operation of DMA once it
>>> becomes a target initiated memcpy function?
>>
>> I think we need a better handle on the issue first.
> 
> I fully agree - please tell me what you want to know :-D
> 
>>> p.s. I have tested with and without the IOMMU, and I have eliminated
>>> issues such as cache coherency being the root cause.
>>
>> Right, not sure how the SMMU would help here.
> 
> Understood, and thanks for taking the time to reply, and I look forward to hearing your thoughts as I would like to fix this issue once and for all.
> 
> Best,
> Adam
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* DMA remote memcpy requests
  2018-10-12  9:48   ` Adam Cottrel
  2018-10-12 10:46     ` Robin Murphy
@ 2018-10-12 11:03     ` Jan Glauber
  2018-10-12 11:07       ` Adam Cottrel
  1 sibling, 1 reply; 18+ messages in thread
From: Jan Glauber @ 2018-10-12 11:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 12, 2018 at 09:48:01AM +0000, Adam Cottrel wrote:
> Hi Will,
> 
> Thank you for getting back to me.
> 
> > [+Robin and Cavium folks -- it's usually best to cc people as well as  mailing
> > the list]
> I will remember this for future. Thanks for the advice.
> 
> > > I am using the ATH10K on Linux 14.4 with an Arm Cavium processor.
> > > During heavy loading, I am seeing that target initiated DMA requests
> > > are being silently dropped under extreme IO memory pressure and it is
> > > proving very difficult to isolate the root cause.
> >
> > Is this ThunderX 1 or 2 or something else? Can you reproduce the issue with
> > mainline?
> I am using:-
>         model = "Cavium ThunderX CN81XX board";
>         compatible = "cavium,thunder-81xx";
> 

Hi Adam,

what is the exact hardware revision (shown in /proc/cpuinfo) ?

--Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* DMA remote memcpy requests
  2018-10-12 10:46     ` Robin Murphy
@ 2018-10-12 11:06       ` Adam Cottrel
  2018-10-15 14:34       ` Adam Cottrel
  1 sibling, 0 replies; 18+ messages in thread
From: Adam Cottrel @ 2018-10-12 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Robin,

Thank you for taking the time to reply.

> >>> I am using the ATH10K on Linux 14.4 with an Arm Cavium processor.
> >>> During heavy loading, I am seeing that target initiated DMA requests
> >>> are being silently dropped under extreme IO memory pressure and it
> >>> is proving very difficult to isolate the root cause.
> >>
> >> Is this ThunderX 1 or 2 or something else? Can you reproduce the
> >> issue with mainline?
> > I am using:-
> >          model = "Cavium ThunderX CN81XX board";
> >          compatible = "cavium,thunder-81xx";
> >
> > Yes - the issue can be reproduced on the mainline, but here is a link
> > to the code branch that I am using:-
> > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
> > th/ath10k
> >
> >>> The ATH10K firmware uses the DMA API to set up phy_addr_t pointers
> >>> (32-bit) which are then copied to a shared ring buffer. The target
> 
> That's the first alarm bell - phys_addr_t is still going to be 64-bit on any arm64
> platform. If the device is expecting 32-bit addresses but somehow doesn't
> have its DMA mask set appropriately, then if you have more than 3GB or so
> of RAM there's the potential for addresses to get truncated such that the
> DMA *does* happen, but to the wrong place.

The firmware is setting the mask to be 32-bit. This was the first thing that I checked, but it is a good observation - the ATH10K hardware is limited to 32-bit.

> However, with SMMU translation enabled (i.e. not just passthrough), then
> I'd expect that same situation to cause more or less all DMA to fail, so if
> you've successfully tested that setup it must be something much more
> subtle :/

Agreed - like I said, under normal conditions (fair weather) there is no issue with DMA.

> 
> >>> then initiates the memcpy operation (for target-to-host reads), but
> >>> I do not have any means of debugging the target directly, and so I
> >>> am looking for software hooks on the host that might help debug this
> >>> complex
> >> problem.
> >>
> >> How does the firmware use the DMA API, or are you referring to a
> >> driver? If the latter, could you point us to the code, please? Is it
> >> using the streaming API, or is this a coherent allocation?
> > The code is using the ARM64 DMA API. It cuts corners in places (!!) but for
> the most part, it follows the rules. In local tests, I have added memory
> barriers (e.g. dmb(SY)) and even put in low-level flush/invalidate calls (DC
> CIVAC) to try and eliminate cache-coherency type problems.
> >
> > The receive fault can be observed in the Rx handler which can be found
> > on line 528 of ce.c:-
> > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
> > th/ath10k/ce.c
> >
> > The memory is allocated by the Rx post buffer function which is on
> > line 760 of pci.c:-
> > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
> > th/ath10k/pci.c
> >
> > To better observe the fault, I made the following change:-
> >   + On allocation, I use memset to clear the skb->data (pci.c::770)
> >   + On receive, I check that the data is not zero (ce.c::555)
> >   + If the data is not yet written, I exit the Rx IRQ handler and try again.
> >
> > In tests, the code works as expected under normal operation, however
> once I start to simulate a heavy memory pressure situation then the Rx
> handler starts to fail. This failure (if allowed to continue) will eventually tear
> down the entire module and crash the target firmware because presumably
> they are seeing similar dropouts on the transmit path.
> >
> > When the fault is happening, if I poll the target registers (e.g. write
> counters over MMIO) I can see that they are still sending us new messages.
> In other words, they have silently failed to send the data, or rather we have
> silently failed to accept the memory copy. I am not able to access the target
> firmware directly, but I have been reliably informed that the DMA memcpy
> operation is initiated by the target.
> >
> > My memory pressure test uses a large dd copy to create a lot of dirty
> memory pages. This always creates the fault, however without any memory
> pressure the code runs beautifully...
> 
> Are you able to characterise whether it's actually the memory pressure itself
> that changes the behaviour (e.g. difficulty in allocating new SKBs), or is it just
> that there's suddenly a lot more work going on in general? Those aren't
> exactly the most powerful CPU cores, and with only
> 2 or 4 of them it doesn't seem impossible that the system could simply get
> loaded to the point where it can't keep up and starts dropping things on the
> floor.

I have done many tests, so hopefully I can help to characterise the fault in detail.

The fault happens when I read or write a large file to Flash. For example, if I run the following:-
+ Running: dd bs=1M count=1700 if=/dev/zero of=/tmp.bin
 
However, the fault does not happen with the following tests:-
 + Overloaded CPU cores (e.g. stress)
 + Stressing userspace RAM
 + Running: dd bs=1M count=1700 if=/dev/zero oflag=direct of=/tmp.bin      <<-- this dd will limit the cache size by writing to the Flash direct
 + Running: dd bs=1M count=1700 if=/dev/zero of=/mnt/usbflash                   <<-- this dd writes to my USB Flash which does is not cached

I can reduce the likelihood of the crash with the following:-
 + Remounting the rootfs (MMC) with the SYNC flag
 + Increasing vm.free_mem_kbytes to 200MB
 + Regularly calling vm.drop_cache to force cache memory to be freed

However, when the fault condition occurs, I do not see page allocation faults.
The code (in so far as I can tell) does try to report memory allocation failures e.g. returns from the DMA API and skt_buf code are all checked

Best,
Adam

^ permalink raw reply	[flat|nested] 18+ messages in thread

* DMA remote memcpy requests
  2018-10-12 11:03     ` Jan Glauber
@ 2018-10-12 11:07       ` Adam Cottrel
  0 siblings, 0 replies; 18+ messages in thread
From: Adam Cottrel @ 2018-10-12 11:07 UTC (permalink / raw)
  To: linux-arm-kernel

Dear Jan,

> what is the exact hardware revision (shown in /proc/cpuinfo) ?

cat /proc/cpuinfo

processor       : 0
BogoMIPS        : 200.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x43
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0x0a2
CPU revision    : 2

processor       : 1
BogoMIPS        : 200.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x43
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0x0a2
CPU revision    : 2

processor       : 2
BogoMIPS        : 200.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x43
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0x0a2
CPU revision    : 2

processor       : 3
BogoMIPS        : 200.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x43
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0x0a2
CPU revision    : 2

Best,
Adam

^ permalink raw reply	[flat|nested] 18+ messages in thread

* DMA remote memcpy requests
  2018-10-12 10:46     ` Robin Murphy
  2018-10-12 11:06       ` Adam Cottrel
@ 2018-10-15 14:34       ` Adam Cottrel
  2018-10-15 15:09         ` Jan Glauber
       [not found]         ` <DM6PR07MB4923F3328079199090D6D2CA9EFE0@DM6PR07MB4923.namprd07.prod.outlook.com>
  1 sibling, 2 replies; 18+ messages in thread
From: Adam Cottrel @ 2018-10-15 14:34 UTC (permalink / raw)
  To: linux-arm-kernel

Dear Robin/Jan/Will,

Any thoughts on what I can do to further diagnose the root cause?

Best,
Adam

> -----Original Message-----
> From: Robin Murphy <robin.murphy@arm.com>
> Sent: 12 October 2018 11:47
> To: Adam Cottrel <adam.cottrel@veea.com>; Will Deacon
> <will.deacon@arm.com>
> Cc: linux-arm-kernel at lists.infradead.org; rric at kernel.org;
> jglauber at cavium.com; jnair at caviumnetworks.com; sgoutham at cavium.com
> Subject: Re: DMA remote memcpy requests
> 
> On 12/10/18 10:48, Adam Cottrel wrote:
> > Hi Will,
> >
> > Thank you for getting back to me.
> >
> >> [+Robin and Cavium folks -- it's usually best to cc people as well as
> >> mailing the list]
> > I will remember this for future. Thanks for the advice.
> >
> >>> I am using the ATH10K on Linux 14.4 with an Arm Cavium processor.
> >>> During heavy loading, I am seeing that target initiated DMA requests
> >>> are being silently dropped under extreme IO memory pressure and it
> >>> is proving very difficult to isolate the root cause.
> >>
> >> Is this ThunderX 1 or 2 or something else? Can you reproduce the
> >> issue with mainline?
> > I am using:-
> >          model = "Cavium ThunderX CN81XX board";
> >          compatible = "cavium,thunder-81xx";
> >
> > Yes - the issue can be reproduced on the mainline, but here is a link
> > to the code branch that I am using:-
> > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
> > th/ath10k
> >
> >>> The ATH10K firmware uses the DMA API to set up phy_addr_t pointers
> >>> (32-bit) which are then copied to a shared ring buffer. The target
> 
> That's the first alarm bell - phys_addr_t is still going to be 64-bit on any arm64
> platform. If the device is expecting 32-bit addresses but somehow doesn't
> have its DMA mask set appropriately, then if you have more than 3GB or so
> of RAM there's the potential for addresses to get truncated such that the
> DMA *does* happen, but to the wrong place.
> 
> However, with SMMU translation enabled (i.e. not just passthrough), then
> I'd expect that same situation to cause more or less all DMA to fail, so if
> you've successfully tested that setup it must be something much more
> subtle :/
> 
> >>> then initiates the memcpy operation (for target-to-host reads), but
> >>> I do not have any means of debugging the target directly, and so I
> >>> am looking for software hooks on the host that might help debug this
> >>> complex
> >> problem.
> >>
> >> How does the firmware use the DMA API, or are you referring to a
> >> driver? If the latter, could you point us to the code, please? Is it
> >> using the streaming API, or is this a coherent allocation?
> > The code is using the ARM64 DMA API. It cuts corners in places (!!) but for
> the most part, it follows the rules. In local tests, I have added memory
> barriers (e.g. dmb(SY)) and even put in low-level flush/invalidate calls (DC
> CIVAC) to try and eliminate cache-coherency type problems.
> >
> > The receive fault can be observed in the Rx handler which can be found
> > on line 528 of ce.c:-
> > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
> > th/ath10k/ce.c
> >
> > The memory is allocated by the Rx post buffer function which is on
> > line 760 of pci.c:-
> > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
> > th/ath10k/pci.c
> >
> > To better observe the fault, I made the following change:-
> >   + On allocation, I use memset to clear the skb->data (pci.c::770)
> >   + On receive, I check that the data is not zero (ce.c::555)
> >   + If the data is not yet written, I exit the Rx IRQ handler and try again.
> >
> > In tests, the code works as expected under normal operation, however
> once I start to simulate a heavy memory pressure situation then the Rx
> handler starts to fail. This failure (if allowed to continue) will eventually tear
> down the entire module and crash the target firmware because presumably
> they are seeing similar dropouts on the transmit path.
> >
> > When the fault is happening, if I poll the target registers (e.g. write
> counters over MMIO) I can see that they are still sending us new messages.
> In other words, they have silently failed to send the data, or rather we have
> silently failed to accept the memory copy. I am not able to access the target
> firmware directly, but I have been reliably informed that the DMA memcpy
> operation is initiated by the target.
> >
> > My memory pressure test uses a large dd copy to create a lot of dirty
> memory pages. This always creates the fault, however without any memory
> pressure the code runs beautifully...
> 
> Are you able to characterise whether it's actually the memory pressure itself
> that changes the behaviour (e.g. difficulty in allocating new SKBs), or is it just
> that there's suddenly a lot more work going on in general? Those aren't
> exactly the most powerful CPU cores, and with only
> 2 or 4 of them it doesn't seem impossible that the system could simply get
> loaded to the point where it can't keep up and starts dropping things on the
> floor.
> 
> Robin.
> 
> >>> Please can someone explain the low-level operation of DMA once it
> >>> becomes a target initiated memcpy function?
> >>
> >> I think we need a better handle on the issue first.
> >
> > I fully agree - please tell me what you want to know :-D
> >
> >>> p.s. I have tested with and without the IOMMU, and I have eliminated
> >>> issues such as cache coherency being the root cause.
> >>
> >> Right, not sure how the SMMU would help here.
> >
> > Understood, and thanks for taking the time to reply, and I look forward to
> hearing your thoughts as I would like to fix this issue once and for all.
> >
> > Best,
> > Adam
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* DMA remote memcpy requests
  2018-10-15 14:34       ` Adam Cottrel
@ 2018-10-15 15:09         ` Jan Glauber
  2018-10-15 15:24           ` Adam Cottrel
       [not found]         ` <DM6PR07MB4923F3328079199090D6D2CA9EFE0@DM6PR07MB4923.namprd07.prod.outlook.com>
  1 sibling, 1 reply; 18+ messages in thread
From: Jan Glauber @ 2018-10-15 15:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 15, 2018 at 02:34:35PM +0000, Adam Cottrel wrote:
> Dear Robin/Jan/Will,
> 
> Any thoughts on what I can do to further diagnose the root cause?

Hi Adam,

from your description this sound like it:
- only happens under memory pressure
- only happens when you combine atheros DMA with something else (or does
  the MMC stress test trigger any faults on its own?)

With that I would look through all the allocations in the atheros
driver and especially look for any missing error handling. But that's
just my 2 cents, maybe Robin or Will can give better advise here...

Regards,
Jan

> 
> Best,
> Adam
> 
> > -----Original Message-----
> > From: Robin Murphy <robin.murphy@arm.com>
> > Sent: 12 October 2018 11:47
> > To: Adam Cottrel <adam.cottrel@veea.com>; Will Deacon
> > <will.deacon@arm.com>
> > Cc: linux-arm-kernel at lists.infradead.org; rric at kernel.org;
> > jglauber at cavium.com; jnair at caviumnetworks.com; sgoutham at cavium.com
> > Subject: Re: DMA remote memcpy requests
> >
> > On 12/10/18 10:48, Adam Cottrel wrote:
> > > Hi Will,
> > >
> > > Thank you for getting back to me.
> > >
> > >> [+Robin and Cavium folks -- it's usually best to cc people as well as
> > >> mailing the list]
> > > I will remember this for future. Thanks for the advice.
> > >
> > >>> I am using the ATH10K on Linux 14.4 with an Arm Cavium processor.
> > >>> During heavy loading, I am seeing that target initiated DMA requests
> > >>> are being silently dropped under extreme IO memory pressure and it
> > >>> is proving very difficult to isolate the root cause.
> > >>
> > >> Is this ThunderX 1 or 2 or something else? Can you reproduce the
> > >> issue with mainline?
> > > I am using:-
> > >          model = "Cavium ThunderX CN81XX board";
> > >          compatible = "cavium,thunder-81xx";
> > >
> > > Yes - the issue can be reproduced on the mainline, but here is a link
> > > to the code branch that I am using:-
> > > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
> > > th/ath10k
> > >
> > >>> The ATH10K firmware uses the DMA API to set up phy_addr_t pointers
> > >>> (32-bit) which are then copied to a shared ring buffer. The target
> >
> > That's the first alarm bell - phys_addr_t is still going to be 64-bit on any arm64
> > platform. If the device is expecting 32-bit addresses but somehow doesn't
> > have its DMA mask set appropriately, then if you have more than 3GB or so
> > of RAM there's the potential for addresses to get truncated such that the
> > DMA *does* happen, but to the wrong place.
> >
> > However, with SMMU translation enabled (i.e. not just passthrough), then
> > I'd expect that same situation to cause more or less all DMA to fail, so if
> > you've successfully tested that setup it must be something much more
> > subtle :/
> >
> > >>> then initiates the memcpy operation (for target-to-host reads), but
> > >>> I do not have any means of debugging the target directly, and so I
> > >>> am looking for software hooks on the host that might help debug this
> > >>> complex
> > >> problem.
> > >>
> > >> How does the firmware use the DMA API, or are you referring to a
> > >> driver? If the latter, could you point us to the code, please? Is it
> > >> using the streaming API, or is this a coherent allocation?
> > > The code is using the ARM64 DMA API. It cuts corners in places (!!) but for
> > the most part, it follows the rules. In local tests, I have added memory
> > barriers (e.g. dmb(SY)) and even put in low-level flush/invalidate calls (DC
> > CIVAC) to try and eliminate cache-coherency type problems.
> > >
> > > The receive fault can be observed in the Rx handler which can be found
> > > on line 528 of ce.c:-
> > > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
> > > th/ath10k/ce.c
> > >
> > > The memory is allocated by the Rx post buffer function which is on
> > > line 760 of pci.c:-
> > > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
> > > th/ath10k/pci.c
> > >
> > > To better observe the fault, I made the following change:-
> > >   + On allocation, I use memset to clear the skb->data (pci.c::770)
> > >   + On receive, I check that the data is not zero (ce.c::555)
> > >   + If the data is not yet written, I exit the Rx IRQ handler and try again.
> > >
> > > In tests, the code works as expected under normal operation, however
> > once I start to simulate a heavy memory pressure situation then the Rx
> > handler starts to fail. This failure (if allowed to continue) will eventually tear
> > down the entire module and crash the target firmware because presumably
> > they are seeing similar dropouts on the transmit path.
> > >
> > > When the fault is happening, if I poll the target registers (e.g. write
> > counters over MMIO) I can see that they are still sending us new messages.
> > In other words, they have silently failed to send the data, or rather we have
> > silently failed to accept the memory copy. I am not able to access the target
> > firmware directly, but I have been reliably informed that the DMA memcpy
> > operation is initiated by the target.
> > >
> > > My memory pressure test uses a large dd copy to create a lot of dirty
> > memory pages. This always creates the fault, however without any memory
> > pressure the code runs beautifully...
> >
> > Are you able to characterise whether it's actually the memory pressure itself
> > that changes the behaviour (e.g. difficulty in allocating new SKBs), or is it just
> > that there's suddenly a lot more work going on in general? Those aren't
> > exactly the most powerful CPU cores, and with only
> > 2 or 4 of them it doesn't seem impossible that the system could simply get
> > loaded to the point where it can't keep up and starts dropping things on the
> > floor.
> >
> > Robin.
> >
> > >>> Please can someone explain the low-level operation of DMA once it
> > >>> becomes a target initiated memcpy function?
> > >>
> > >> I think we need a better handle on the issue first.
> > >
> > > I fully agree - please tell me what you want to know :-D
> > >
> > >>> p.s. I have tested with and without the IOMMU, and I have eliminated
> > >>> issues such as cache coherency being the root cause.
> > >>
> > >> Right, not sure how the SMMU would help here.
> > >
> > > Understood, and thanks for taking the time to reply, and I look forward to
> > hearing your thoughts as I would like to fix this issue once and for all.
> > >
> > > Best,
> > > Adam
> > >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* DMA remote memcpy requests
  2018-10-15 15:09         ` Jan Glauber
@ 2018-10-15 15:24           ` Adam Cottrel
  2018-10-15 15:39             ` Jan Glauber
  0 siblings, 1 reply; 18+ messages in thread
From: Adam Cottrel @ 2018-10-15 15:24 UTC (permalink / raw)
  To: linux-arm-kernel

Dear Jan,

> from your description this sound like it:
> - only happens under memory pressure
> - only happens when you combine atheros DMA with something else (or
> does
>   the MMC stress test trigger any faults on its own?)
> 
> With that I would look through all the allocations in the atheros driver and
> especially look for any missing error handling. But that's just my 2 cents,
> maybe Robin or Will can give better advise here...

That is good advice.

>From what I can see, there are checks made on every alloc, however, it is possible that the failure is silently handled.

For example, memory is allocated with __GFP_IGNORE and the error flag is lost because the called returned void...

I have put in a lot of debug code to look for this type of fault - it is possible that I have missed the exact point of failure...

Is there some kind of queue of outstanding remote DMA requests? And if so, is it possible that the request queue can overflow in some way?

Best,
Adam

^ permalink raw reply	[flat|nested] 18+ messages in thread

* DMA remote memcpy requests
  2018-10-15 15:24           ` Adam Cottrel
@ 2018-10-15 15:39             ` Jan Glauber
  2018-10-15 15:51               ` Adam Cottrel
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Glauber @ 2018-10-15 15:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 15, 2018 at 03:24:55PM +0000, Adam Cottrel wrote:
> Dear Jan,
> 
> > from your description this sound like it:
> > - only happens under memory pressure
> > - only happens when you combine atheros DMA with something else (or
> > does
> >   the MMC stress test trigger any faults on its own?)
> >
> > With that I would look through all the allocations in the atheros driver and
> > especially look for any missing error handling. But that's just my 2 cents,
> > maybe Robin or Will can give better advise here...
> 
> That is good advice.
> 
> >From what I can see, there are checks made on every alloc, however, it is possible that the failure is silently handled.
> 
> For example, memory is allocated with __GFP_IGNORE and the error flag is lost because the called returned void...
> 
> I have put in a lot of debug code to look for this type of fault - it is possible that I have missed the exact point of failure...
> 
> Is there some kind of queue of outstanding remote DMA requests? And if so, is it possible that the request queue can overflow in some way?

I'm not sure where that point would be where DMA request could be lost here.
The MMC and PCIe only meet in the NCB (near coprocessor bus) which goes
to the Coherent memory interconnect and L2 cache.

I've looked for any known errata but didn't find anything that would match
your problem.

--Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* DMA remote memcpy requests
  2018-10-15 15:39             ` Jan Glauber
@ 2018-10-15 15:51               ` Adam Cottrel
  2018-10-18 15:36                 ` Adam Cottrel
  0 siblings, 1 reply; 18+ messages in thread
From: Adam Cottrel @ 2018-10-15 15:51 UTC (permalink / raw)
  To: linux-arm-kernel

Dear Jan,

> I'm not sure where that point would be where DMA request could be lost
> here.
> The MMC and PCIe only meet in the NCB (near coprocessor bus) which goes
> to the Coherent memory interconnect and L2 cache.
> 
> I've looked for any known errata but didn't find anything that would match
> your problem.

For the purposes of debug, is it possible for me to turn off the MMC? Or the L2 cache? Or put it into pass through mode? Or get any kind of stack trace on its operation?

Best,
Adam

^ permalink raw reply	[flat|nested] 18+ messages in thread

* DMA remote memcpy requests
       [not found]         ` <DM6PR07MB4923F3328079199090D6D2CA9EFE0@DM6PR07MB4923.namprd07.prod.outlook.com>
@ 2018-10-16 16:52           ` Adam Cottrel
  2018-10-16 17:08             ` Robin Murphy
  0 siblings, 1 reply; 18+ messages in thread
From: Adam Cottrel @ 2018-10-16 16:52 UTC (permalink / raw)
  To: linux-arm-kernel

Dear Sunil,

That is a great suggestion. Can someone advise on how to turn off the SMMU for testing purposes?

Best,
Adam

From: Goutham, Sunil <Sunil.Goutham@cavium.com> 
Sent: 16 October 2018 17:51
To: Adam Cottrel <adam.cottrel@veea.com>; Robin Murphy <robin.murphy@arm.com>; Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel at lists.infradead.org; rric at kernel.org; Jan Glauber <Jan.Glauber@cavium.com>; Nair, Jayachandran <Jayachandran.Nair@cavium.com>; Goutham, Sunil <Sunil.Goutham@cavium.com>
Subject: Re: DMA remote memcpy requests

Hi Adam,

Is it possible for you to disable SMMU and do the same test ?
It might help in narrowing down whether transaction is lost at PCIeRC itself or
SMMU translation.

Thanks,
Sunil.


Sent from my Samsung Galaxy smartphone.


-------- Original message --------
From: Adam Cottrel <mailto:adam.cottrel@veea.com> 
Date: 15/10/2018 20:04 (GMT+05:30) 
To: Robin Murphy <mailto:robin.murphy@arm.com>, Will Deacon <mailto:will.deacon@arm.com> 
Cc: mailto:linux-arm-kernel at lists.infradead.org, mailto:rric at kernel.org, Jan Glauber <mailto:Jan.Glauber@cavium.com>, "Nair, Jayachandran" <mailto:Jayachandran.Nair@cavium.com>, "Goutham, Sunil" <mailto:Sunil.Goutham@cavium.com> 
Subject: RE: DMA remote memcpy requests 

External Email

Dear Robin/Jan/Will,

Any thoughts on what I can do to further diagnose the root cause?

Best,
Adam

> -----Original Message-----
> From: Robin Murphy <mailto:robin.murphy@arm.com>
> Sent: 12 October 2018 11:47
> To: Adam Cottrel <mailto:adam.cottrel@veea.com>; Will Deacon
> <mailto:will.deacon@arm.com>
> Cc: mailto:linux-arm-kernel at lists.infradead.org; mailto:rric at kernel.org;
> mailto:jglauber at cavium.com; mailto:jnair at caviumnetworks.com; mailto:sgoutham at cavium.com
> Subject: Re: DMA remote memcpy requests
>
> On 12/10/18 10:48, Adam Cottrel wrote:
> > Hi Will,
> >
> > Thank you for getting back to me.
> >
> >> [+Robin and Cavium folks -- it's usually best to cc people as well as
> >> mailing the list]
> > I will remember this for future. Thanks for the advice.
> >
> >>> I am using the ATH10K on Linux 14.4 with an Arm Cavium processor.
> >>> During heavy loading, I am seeing that target initiated DMA requests
> >>> are being silently dropped under extreme IO memory pressure and it
> >>> is proving very difficult to isolate the root cause.
> >>
> >> Is this ThunderX 1 or 2 or something else? Can you reproduce the
> >> issue with mainline?
> > I am using:-
> >????????? model = "Cavium ThunderX CN81XX board";
> >????????? compatible = "cavium,thunder-81xx";
> >
> > Yes - the issue can be reproduced on the mainline, but here is a link
> > to the code branch that I am using:-
> > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
> > th/ath10k
> >
> >>> The ATH10K firmware uses the DMA API to set up phy_addr_t pointers
> >>> (32-bit) which are then copied to a shared ring buffer. The target
>
> That's the first alarm bell - phys_addr_t is still going to be 64-bit on any arm64
> platform. If the device is expecting 32-bit addresses but somehow doesn't
> have its DMA mask set appropriately, then if you have more than 3GB or so
> of RAM there's the potential for addresses to get truncated such that the
> DMA *does* happen, but to the wrong place.
>
> However, with SMMU translation enabled (i.e. not just passthrough), then
> I'd expect that same situation to cause more or less all DMA to fail, so if
> you've successfully tested that setup it must be something much more
> subtle :/
>
> >>> then initiates the memcpy operation (for target-to-host reads), but
> >>> I do not have any means of debugging the target directly, and so I
> >>> am looking for software hooks on the host that might help debug this
> >>> complex
> >> problem.
> >>
> >> How does the firmware use the DMA API, or are you referring to a
> >> driver? If the latter, could you point us to the code, please? Is it
> >> using the streaming API, or is this a coherent allocation?
> > The code is using the ARM64 DMA API. It cuts corners in places (!!) but for
> the most part, it follows the rules. In local tests, I have added memory
> barriers (e.g. dmb(SY)) and even put in low-level flush/invalidate calls (DC
> CIVAC) to try and eliminate cache-coherency type problems.
> >
> > The receive fault can be observed in the Rx handler which can be found
> > on line 528 of ce.c:-
> > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
> > th/ath10k/ce.c
> >
> > The memory is allocated by the Rx post buffer function which is on
> > line 760 of pci.c:-
> > https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
> > th/ath10k/pci.c
> >
> > To better observe the fault, I made the following change:-
> >?? + On allocation, I use memset to clear the skb->data (pci.c::770)
> >?? + On receive, I check that the data is not zero (ce.c::555)
> >?? + If the data is not yet written, I exit the Rx IRQ handler and try again.
> >
> > In tests, the code works as expected under normal operation, however
> once I start to simulate a heavy memory pressure situation then the Rx
> handler starts to fail. This failure (if allowed to continue) will eventually tear
> down the entire module and crash the target firmware because presumably
> they are seeing similar dropouts on the transmit path.
> >
> > When the fault is happening, if I poll the target registers (e.g. write
> counters over MMIO) I can see that they are still sending us new messages.
> In other words, they have silently failed to send the data, or rather we have
> silently failed to accept the memory copy. I am not able to access the target
> firmware directly, but I have been reliably informed that the DMA memcpy
> operation is initiated by the target.
> >
> > My memory pressure test uses a large dd copy to create a lot of dirty
> memory pages. This always creates the fault, however without any memory
> pressure the code runs beautifully...
>
> Are you able to characterise whether it's actually the memory pressure itself
> that changes the behaviour (e.g. difficulty in allocating new SKBs), or is it just
> that there's suddenly a lot more work going on in general? Those aren't
> exactly the most powerful CPU cores, and with only
> 2 or 4 of them it doesn't seem impossible that the system could simply get
> loaded to the point where it can't keep up and starts dropping things on the
> floor.
>
> Robin.
>
> >>> Please can someone explain the low-level operation of DMA once it
> >>> becomes a target initiated memcpy function?
> >>
> >> I think we need a better handle on the issue first.
> >
> > I fully agree - please tell me what you want to know :-D
> >
> >>> p.s. I have tested with and without the IOMMU, and I have eliminated
> >>> issues such as cache coherency being the root cause.
> >>
> >> Right, not sure how the SMMU would help here.
> >
> > Understood, and thanks for taking the time to reply, and I look forward to
> hearing your thoughts as I would like to fix this issue once and for all.
> >
> > Best,
> > Adam
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* DMA remote memcpy requests
  2018-10-16 16:52           ` Adam Cottrel
@ 2018-10-16 17:08             ` Robin Murphy
  0 siblings, 0 replies; 18+ messages in thread
From: Robin Murphy @ 2018-10-16 17:08 UTC (permalink / raw)
  To: linux-arm-kernel

On 16/10/18 17:52, Adam Cottrel wrote:
> Dear Sunil,
> 
> That is a great suggestion. Can someone advise on how to turn off the SMMU for testing purposes?

Unless the firmware does something funky, simply removing the driver 
from your kernel config should result in the SMMU remaining in its 
disabled and fully-bypassed state out of reset. That's a fair bit 
different from having the driver present with "iommu.passthrough=1", 
where the SMMU is enabled and actively permitting things to pass 
untranslated on a per-transaction basis, which involves a lot more going 
on under the covers.

Robin.

> 
> Best,
> Adam
> 
> From: Goutham, Sunil <Sunil.Goutham@cavium.com>
> Sent: 16 October 2018 17:51
> To: Adam Cottrel <adam.cottrel@veea.com>; Robin Murphy <robin.murphy@arm.com>; Will Deacon <will.deacon@arm.com>
> Cc: linux-arm-kernel at lists.infradead.org; rric at kernel.org; Jan Glauber <Jan.Glauber@cavium.com>; Nair, Jayachandran <Jayachandran.Nair@cavium.com>; Goutham, Sunil <Sunil.Goutham@cavium.com>
> Subject: Re: DMA remote memcpy requests
> 
> Hi Adam,
> 
> Is it possible for you to disable SMMU and do the same test ?
> It might help in narrowing down whether transaction is lost at PCIeRC itself or
> SMMU translation.
> 
> Thanks,
> Sunil.
> 
> 
> Sent from my Samsung Galaxy smartphone.
> 
> 
> -------- Original message --------
> From: Adam Cottrel <mailto:adam.cottrel@veea.com>
> Date: 15/10/2018 20:04 (GMT+05:30)
> To: Robin Murphy <mailto:robin.murphy@arm.com>, Will Deacon <mailto:will.deacon@arm.com>
> Cc: mailto:linux-arm-kernel at lists.infradead.org, mailto:rric at kernel.org, Jan Glauber <mailto:Jan.Glauber@cavium.com>, "Nair, Jayachandran" <mailto:Jayachandran.Nair@cavium.com>, "Goutham, Sunil" <mailto:Sunil.Goutham@cavium.com>
> Subject: RE: DMA remote memcpy requests
> 
> External Email
> 
> Dear Robin/Jan/Will,
> 
> Any thoughts on what I can do to further diagnose the root cause?
> 
> Best,
> Adam
> 
>> -----Original Message-----
>> From: Robin Murphy <mailto:robin.murphy@arm.com>
>> Sent: 12 October 2018 11:47
>> To: Adam Cottrel <mailto:adam.cottrel@veea.com>; Will Deacon
>> <mailto:will.deacon@arm.com>
>> Cc: mailto:linux-arm-kernel at lists.infradead.org; mailto:rric at kernel.org;
>> mailto:jglauber at cavium.com; mailto:jnair at caviumnetworks.com; mailto:sgoutham at cavium.com
>> Subject: Re: DMA remote memcpy requests
>>
>> On 12/10/18 10:48, Adam Cottrel wrote:
>>> Hi Will,
>>>
>>> Thank you for getting back to me.
>>>
>>>> [+Robin and Cavium folks -- it's usually best to cc people as well as
>>>> mailing the list]
>>> I will remember this for future. Thanks for the advice.
>>>
>>>>> I am using the ATH10K on Linux 14.4 with an Arm Cavium processor.
>>>>> During heavy loading, I am seeing that target initiated DMA requests
>>>>> are being silently dropped under extreme IO memory pressure and it
>>>>> is proving very difficult to isolate the root cause.
>>>>
>>>> Is this ThunderX 1 or 2 or something else? Can you reproduce the
>>>> issue with mainline?
>>> I am using:-
>>>  ????????? model = "Cavium ThunderX CN81XX board";
>>>  ????????? compatible = "cavium,thunder-81xx";
>>>
>>> Yes - the issue can be reproduced on the mainline, but here is a link
>>> to the code branch that I am using:-
>>> https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
>>> th/ath10k
>>>
>>>>> The ATH10K firmware uses the DMA API to set up phy_addr_t pointers
>>>>> (32-bit) which are then copied to a shared ring buffer. The target
>>
>> That's the first alarm bell - phys_addr_t is still going to be 64-bit on any arm64
>> platform. If the device is expecting 32-bit addresses but somehow doesn't
>> have its DMA mask set appropriately, then if you have more than 3GB or so
>> of RAM there's the potential for addresses to get truncated such that the
>> DMA *does* happen, but to the wrong place.
>>
>> However, with SMMU translation enabled (i.e. not just passthrough), then
>> I'd expect that same situation to cause more or less all DMA to fail, so if
>> you've successfully tested that setup it must be something much more
>> subtle :/
>>
>>>>> then initiates the memcpy operation (for target-to-host reads), but
>>>>> I do not have any means of debugging the target directly, and so I
>>>>> am looking for software hooks on the host that might help debug this
>>>>> complex
>>>> problem.
>>>>
>>>> How does the firmware use the DMA API, or are you referring to a
>>>> driver? If the latter, could you point us to the code, please? Is it
>>>> using the streaming API, or is this a coherent allocation?
>>> The code is using the ARM64 DMA API. It cuts corners in places (!!) but for
>> the most part, it follows the rules. In local tests, I have added memory
>> barriers (e.g. dmb(SY)) and even put in low-level flush/invalidate calls (DC
>> CIVAC) to try and eliminate cache-coherency type problems.
>>>
>>> The receive fault can be observed in the Rx handler which can be found
>>> on line 528 of ce.c:-
>>> https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
>>> th/ath10k/ce.c
>>>
>>> The memory is allocated by the Rx post buffer function which is on
>>> line 760 of pci.c:-
>>> https://elixir.bootlin.com/linux/v4.14.4/source/drivers/net/wireless/a
>>> th/ath10k/pci.c
>>>
>>> To better observe the fault, I made the following change:-
>>>  ?? + On allocation, I use memset to clear the skb->data (pci.c::770)
>>>  ?? + On receive, I check that the data is not zero (ce.c::555)
>>>  ?? + If the data is not yet written, I exit the Rx IRQ handler and try again.
>>>
>>> In tests, the code works as expected under normal operation, however
>> once I start to simulate a heavy memory pressure situation then the Rx
>> handler starts to fail. This failure (if allowed to continue) will eventually tear
>> down the entire module and crash the target firmware because presumably
>> they are seeing similar dropouts on the transmit path.
>>>
>>> When the fault is happening, if I poll the target registers (e.g. write
>> counters over MMIO) I can see that they are still sending us new messages.
>> In other words, they have silently failed to send the data, or rather we have
>> silently failed to accept the memory copy. I am not able to access the target
>> firmware directly, but I have been reliably informed that the DMA memcpy
>> operation is initiated by the target.
>>>
>>> My memory pressure test uses a large dd copy to create a lot of dirty
>> memory pages. This always creates the fault, however without any memory
>> pressure the code runs beautifully...
>>
>> Are you able to characterise whether it's actually the memory pressure itself
>> that changes the behaviour (e.g. difficulty in allocating new SKBs), or is it just
>> that there's suddenly a lot more work going on in general? Those aren't
>> exactly the most powerful CPU cores, and with only
>> 2 or 4 of them it doesn't seem impossible that the system could simply get
>> loaded to the point where it can't keep up and starts dropping things on the
>> floor.
>>
>> Robin.
>>
>>>>> Please can someone explain the low-level operation of DMA once it
>>>>> becomes a target initiated memcpy function?
>>>>
>>>> I think we need a better handle on the issue first.
>>>
>>> I fully agree - please tell me what you want to know :-D
>>>
>>>>> p.s. I have tested with and without the IOMMU, and I have eliminated
>>>>> issues such as cache coherency being the root cause.
>>>>
>>>> Right, not sure how the SMMU would help here.
>>>
>>> Understood, and thanks for taking the time to reply, and I look forward to
>> hearing your thoughts as I would like to fix this issue once and for all.
>>>
>>> Best,
>>> Adam
>>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* DMA remote memcpy requests
  2018-10-15 15:51               ` Adam Cottrel
@ 2018-10-18 15:36                 ` Adam Cottrel
  2018-10-22 14:28                   ` Jan Glauber
  0 siblings, 1 reply; 18+ messages in thread
From: Adam Cottrel @ 2018-10-18 15:36 UTC (permalink / raw)
  To: linux-arm-kernel

Dear Jan,

Sorry for the delay in getting back to you - the patch took longer than expected to finish.

 $ cat /proc/config.gz | gunzip | grep ERRATUM
CONFIG_ARM64_ERRATUM_826319=y
CONFIG_ARM64_ERRATUM_827319=y
CONFIG_ARM64_ERRATUM_824069=y
CONFIG_ARM64_ERRATUM_819472=y
CONFIG_ARM64_ERRATUM_832075=y
CONFIG_ARM64_ERRATUM_834220=y
CONFIG_ARM64_ERRATUM_845719=y
CONFIG_ARM64_ERRATUM_843419=y
CONFIG_CAVIUM_ERRATUM_22375=y
CONFIG_CAVIUM_ERRATUM_23144=y
CONFIG_CAVIUM_ERRATUM_23154=y
CONFIG_CAVIUM_ERRATUM_27456=y
CONFIG_CAVIUM_ERRATUM_28168=y                               <--------------------- HERE!!
CONFIG_CAVIUM_ERRATUM_30115=y
CONFIG_QCOM_FALKOR_ERRATUM_1003=y
CONFIG_QCOM_FALKOR_ERRATUM_1009=y
CONFIG_QCOM_QDF2400_ERRATUM_0065=y
CONFIG_FSL_ERRATUM_A008585=y
CONFIG_HISILICON_ERRATUM_161010101=y
CONFIG_ARM64_ERRATUM_858921=y

As you can see, the fix is enabled, but when I test, it is not making any difference to the original issue. The ath10k driver is still dropping inward DMA under memory pressure.

As an aside, I had to make one small change to the patch due to differences between earlier kernel versions. Please see cavium.diff attached. Is this an acceptable change?

Before we discount this as being a fix, please can you tell me how I can prove that the patch is actually working on my platform?

Best,
Adam

> -----Original Message-----
> From: linux-arm-kernel <linux-arm-kernel-bounces@lists.infradead.org> On
> Behalf Of Adam Cottrel
> Sent: 15 October 2018 16:51
> To: Jan Glauber <Jan.Glauber@cavium.com>
> Cc: Nair, Jayachandran <Jayachandran.Nair@cavium.com>; rric at kernel.org;
> Goutham, Sunil <Sunil.Goutham@cavium.com>; Will Deacon
> <will.deacon@arm.com>; Robin Murphy <robin.murphy@arm.com>; linux-
> arm-kernel at lists.infradead.org
> Subject: RE: DMA remote memcpy requests
> 
> Dear Jan,
> 
> > I'm not sure where that point would be where DMA request could be lost
> > here.
> > The MMC and PCIe only meet in the NCB (near coprocessor bus) which
> > goes to the Coherent memory interconnect and L2 cache.
> >
> > I've looked for any known errata but didn't find anything that would
> > match your problem.
> 
> For the purposes of debug, is it possible for me to turn off the MMC? Or the
> L2 cache? Or put it into pass through mode? Or get any kind of stack trace on
> its operation?
> 
> Best,
> Adam
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cavium.diff
Type: application/octet-stream
Size: 741 bytes
Desc: cavium.diff
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20181018/b2b6c895/attachment.obj>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* DMA remote memcpy requests
  2018-10-18 15:36                 ` Adam Cottrel
@ 2018-10-22 14:28                   ` Jan Glauber
  2018-10-22 14:39                     ` Adam Cottrel
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Glauber @ 2018-10-22 14:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Oct 18, 2018 at 03:36:25PM +0000, Adam Cottrel wrote:
> Dear Jan,
> 
> Sorry for the delay in getting back to you - the patch took longer than expected to finish.
> 
>  $ cat /proc/config.gz | gunzip | grep ERRATUM
> CONFIG_ARM64_ERRATUM_826319=y
> CONFIG_ARM64_ERRATUM_827319=y
> CONFIG_ARM64_ERRATUM_824069=y
> CONFIG_ARM64_ERRATUM_819472=y
> CONFIG_ARM64_ERRATUM_832075=y
> CONFIG_ARM64_ERRATUM_834220=y
> CONFIG_ARM64_ERRATUM_845719=y
> CONFIG_ARM64_ERRATUM_843419=y
> CONFIG_CAVIUM_ERRATUM_22375=y
> CONFIG_CAVIUM_ERRATUM_23144=y
> CONFIG_CAVIUM_ERRATUM_23154=y
> CONFIG_CAVIUM_ERRATUM_27456=y
> CONFIG_CAVIUM_ERRATUM_28168=y                               <--------------------- HERE!!
> CONFIG_CAVIUM_ERRATUM_30115=y
> CONFIG_QCOM_FALKOR_ERRATUM_1003=y
> CONFIG_QCOM_FALKOR_ERRATUM_1009=y
> CONFIG_QCOM_QDF2400_ERRATUM_0065=y
> CONFIG_FSL_ERRATUM_A008585=y
> CONFIG_HISILICON_ERRATUM_161010101=y
> CONFIG_ARM64_ERRATUM_858921=y
> 
> As you can see, the fix is enabled, but when I test, it is not making any difference to the original issue. The ath10k driver is still dropping inward DMA under memory pressure.

OK, it was just a guess from my side.

> As an aside, I had to make one small change to the patch due to differences between earlier kernel versions. Please see cavium.diff attached. Is this an acceptable change?

Your resolution looks fine.

> Before we discount this as being a fix, please can you tell me how I can prove that the patch is actually working on my platform?

It looks like it doesn't solve your issue. I just wanted to rule this
one out.

Have you tried the other suggestion of completely turning of the SMMU?

Regards,
Jan


> Best,
> Adam
> 
> > -----Original Message-----
> > From: linux-arm-kernel <linux-arm-kernel-bounces@lists.infradead.org> On
> > Behalf Of Adam Cottrel
> > Sent: 15 October 2018 16:51
> > To: Jan Glauber <Jan.Glauber@cavium.com>
> > Cc: Nair, Jayachandran <Jayachandran.Nair@cavium.com>; rric at kernel.org;
> > Goutham, Sunil <Sunil.Goutham@cavium.com>; Will Deacon
> > <will.deacon@arm.com>; Robin Murphy <robin.murphy@arm.com>; linux-
> > arm-kernel at lists.infradead.org
> > Subject: RE: DMA remote memcpy requests
> >
> > Dear Jan,
> >
> > > I'm not sure where that point would be where DMA request could be lost
> > > here.
> > > The MMC and PCIe only meet in the NCB (near coprocessor bus) which
> > > goes to the Coherent memory interconnect and L2 cache.
> > >
> > > I've looked for any known errata but didn't find anything that would
> > > match your problem.
> >
> > For the purposes of debug, is it possible for me to turn off the MMC? Or the
> > L2 cache? Or put it into pass through mode? Or get any kind of stack trace on
> > its operation?
> >
> > Best,
> > Adam
> >
> > _______________________________________________
> > linux-arm-kernel mailing list
> > linux-arm-kernel at lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* DMA remote memcpy requests
  2018-10-22 14:28                   ` Jan Glauber
@ 2018-10-22 14:39                     ` Adam Cottrel
  2018-10-22 15:33                       ` Jan Glauber
  0 siblings, 1 reply; 18+ messages in thread
From: Adam Cottrel @ 2018-10-22 14:39 UTC (permalink / raw)
  To: linux-arm-kernel

Dear Jan,

As I understand this, to turn off I need to re-compile without SMMU driver support via the kernel config flags.

However, if I turn the SMU off, how will the system cope?

Best,
Adam

> -----Original Message-----
> From: Jan Glauber <Jan.Glauber@cavium.com>
> Sent: 22 October 2018 15:29
> To: Adam Cottrel <adam.cottrel@veea.com>
> Cc: Nair, Jayachandran <Jayachandran.Nair@cavium.com>; rric at kernel.org;
> Goutham, Sunil <Sunil.Goutham@cavium.com>; Will Deacon
> <will.deacon@arm.com>; Robin Murphy <robin.murphy@arm.com>; linux-
> arm-kernel at lists.infradead.org
> Subject: Re: DMA remote memcpy requests
> 
> On Thu, Oct 18, 2018 at 03:36:25PM +0000, Adam Cottrel wrote:
> > Dear Jan,
> >
> > Sorry for the delay in getting back to you - the patch took longer than
> expected to finish.
> >
> >  $ cat /proc/config.gz | gunzip | grep ERRATUM
> > CONFIG_ARM64_ERRATUM_826319=y
> CONFIG_ARM64_ERRATUM_827319=y
> > CONFIG_ARM64_ERRATUM_824069=y
> CONFIG_ARM64_ERRATUM_819472=y
> > CONFIG_ARM64_ERRATUM_832075=y
> CONFIG_ARM64_ERRATUM_834220=y
> > CONFIG_ARM64_ERRATUM_845719=y
> CONFIG_ARM64_ERRATUM_843419=y
> > CONFIG_CAVIUM_ERRATUM_22375=y
> CONFIG_CAVIUM_ERRATUM_23144=y
> > CONFIG_CAVIUM_ERRATUM_23154=y
> CONFIG_CAVIUM_ERRATUM_27456=y
> > CONFIG_CAVIUM_ERRATUM_28168=y                               <---------------------
> HERE!!
> > CONFIG_CAVIUM_ERRATUM_30115=y
> > CONFIG_QCOM_FALKOR_ERRATUM_1003=y
> > CONFIG_QCOM_FALKOR_ERRATUM_1009=y
> > CONFIG_QCOM_QDF2400_ERRATUM_0065=y
> > CONFIG_FSL_ERRATUM_A008585=y
> > CONFIG_HISILICON_ERRATUM_161010101=y
> > CONFIG_ARM64_ERRATUM_858921=y
> >
> > As you can see, the fix is enabled, but when I test, it is not making any
> difference to the original issue. The ath10k driver is still dropping inward DMA
> under memory pressure.
> 
> OK, it was just a guess from my side.
> 
> > As an aside, I had to make one small change to the patch due to differences
> between earlier kernel versions. Please see cavium.diff attached. Is this an
> acceptable change?
> 
> Your resolution looks fine.
> 
> > Before we discount this as being a fix, please can you tell me how I can
> prove that the patch is actually working on my platform?
> 
> It looks like it doesn't solve your issue. I just wanted to rule this one out.
> 
> Have you tried the other suggestion of completely turning of the SMMU?
> 
> Regards,
> Jan
> 
> 
> > Best,
> > Adam
> >
> > > -----Original Message-----
> > > From: linux-arm-kernel
> > > <linux-arm-kernel-bounces@lists.infradead.org> On Behalf Of Adam
> > > Cottrel
> > > Sent: 15 October 2018 16:51
> > > To: Jan Glauber <Jan.Glauber@cavium.com>
> > > Cc: Nair, Jayachandran <Jayachandran.Nair@cavium.com>;
> > > rric at kernel.org; Goutham, Sunil <Sunil.Goutham@cavium.com>; Will
> > > Deacon <will.deacon@arm.com>; Robin Murphy
> <robin.murphy@arm.com>;
> > > linux- arm-kernel at lists.infradead.org
> > > Subject: RE: DMA remote memcpy requests
> > >
> > > Dear Jan,
> > >
> > > > I'm not sure where that point would be where DMA request could be
> > > > lost here.
> > > > The MMC and PCIe only meet in the NCB (near coprocessor bus) which
> > > > goes to the Coherent memory interconnect and L2 cache.
> > > >
> > > > I've looked for any known errata but didn't find anything that
> > > > would match your problem.
> > >
> > > For the purposes of debug, is it possible for me to turn off the
> > > MMC? Or the
> > > L2 cache? Or put it into pass through mode? Or get any kind of stack
> > > trace on its operation?
> > >
> > > Best,
> > > Adam
> > >
> > > _______________________________________________
> > > linux-arm-kernel mailing list
> > > linux-arm-kernel at lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* DMA remote memcpy requests
  2018-10-22 14:39                     ` Adam Cottrel
@ 2018-10-22 15:33                       ` Jan Glauber
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Glauber @ 2018-10-22 15:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 22, 2018 at 02:39:00PM +0000, Adam Cottrel wrote:
> Dear Jan,
> 
> As I understand this, to turn off I need to re-compile without SMMU driver support via the kernel config flags.
> 
> However, if I turn the SMU off, how will the system cope?

The SMMU should be optional, but I didn't try it myselfs so bad things
might happen...

Regards,
Jan

> Best,
> Adam
> 
> > -----Original Message-----
> > From: Jan Glauber <Jan.Glauber@cavium.com>
> > Sent: 22 October 2018 15:29
> > To: Adam Cottrel <adam.cottrel@veea.com>
> > Cc: Nair, Jayachandran <Jayachandran.Nair@cavium.com>; rric at kernel.org;
> > Goutham, Sunil <Sunil.Goutham@cavium.com>; Will Deacon
> > <will.deacon@arm.com>; Robin Murphy <robin.murphy@arm.com>; linux-
> > arm-kernel at lists.infradead.org
> > Subject: Re: DMA remote memcpy requests
> >
> > On Thu, Oct 18, 2018 at 03:36:25PM +0000, Adam Cottrel wrote:
> > > Dear Jan,
> > >
> > > Sorry for the delay in getting back to you - the patch took longer than
> > expected to finish.
> > >
> > >  $ cat /proc/config.gz | gunzip | grep ERRATUM
> > > CONFIG_ARM64_ERRATUM_826319=y
> > CONFIG_ARM64_ERRATUM_827319=y
> > > CONFIG_ARM64_ERRATUM_824069=y
> > CONFIG_ARM64_ERRATUM_819472=y
> > > CONFIG_ARM64_ERRATUM_832075=y
> > CONFIG_ARM64_ERRATUM_834220=y
> > > CONFIG_ARM64_ERRATUM_845719=y
> > CONFIG_ARM64_ERRATUM_843419=y
> > > CONFIG_CAVIUM_ERRATUM_22375=y
> > CONFIG_CAVIUM_ERRATUM_23144=y
> > > CONFIG_CAVIUM_ERRATUM_23154=y
> > CONFIG_CAVIUM_ERRATUM_27456=y
> > > CONFIG_CAVIUM_ERRATUM_28168=y                               <---------------------
> > HERE!!
> > > CONFIG_CAVIUM_ERRATUM_30115=y
> > > CONFIG_QCOM_FALKOR_ERRATUM_1003=y
> > > CONFIG_QCOM_FALKOR_ERRATUM_1009=y
> > > CONFIG_QCOM_QDF2400_ERRATUM_0065=y
> > > CONFIG_FSL_ERRATUM_A008585=y
> > > CONFIG_HISILICON_ERRATUM_161010101=y
> > > CONFIG_ARM64_ERRATUM_858921=y
> > >
> > > As you can see, the fix is enabled, but when I test, it is not making any
> > difference to the original issue. The ath10k driver is still dropping inward DMA
> > under memory pressure.
> >
> > OK, it was just a guess from my side.
> >
> > > As an aside, I had to make one small change to the patch due to differences
> > between earlier kernel versions. Please see cavium.diff attached. Is this an
> > acceptable change?
> >
> > Your resolution looks fine.
> >
> > > Before we discount this as being a fix, please can you tell me how I can
> > prove that the patch is actually working on my platform?
> >
> > It looks like it doesn't solve your issue. I just wanted to rule this one out.
> >
> > Have you tried the other suggestion of completely turning of the SMMU?
> >
> > Regards,
> > Jan
> >
> >
> > > Best,
> > > Adam
> > >
> > > > -----Original Message-----
> > > > From: linux-arm-kernel
> > > > <linux-arm-kernel-bounces@lists.infradead.org> On Behalf Of Adam
> > > > Cottrel
> > > > Sent: 15 October 2018 16:51
> > > > To: Jan Glauber <Jan.Glauber@cavium.com>
> > > > Cc: Nair, Jayachandran <Jayachandran.Nair@cavium.com>;
> > > > rric at kernel.org; Goutham, Sunil <Sunil.Goutham@cavium.com>; Will
> > > > Deacon <will.deacon@arm.com>; Robin Murphy
> > <robin.murphy@arm.com>;
> > > > linux- arm-kernel at lists.infradead.org
> > > > Subject: RE: DMA remote memcpy requests
> > > >
> > > > Dear Jan,
> > > >
> > > > > I'm not sure where that point would be where DMA request could be
> > > > > lost here.
> > > > > The MMC and PCIe only meet in the NCB (near coprocessor bus) which
> > > > > goes to the Coherent memory interconnect and L2 cache.
> > > > >
> > > > > I've looked for any known errata but didn't find anything that
> > > > > would match your problem.
> > > >
> > > > For the purposes of debug, is it possible for me to turn off the
> > > > MMC? Or the
> > > > L2 cache? Or put it into pass through mode? Or get any kind of stack
> > > > trace on its operation?
> > > >
> > > > Best,
> > > > Adam
> > > >
> > > > _______________________________________________
> > > > linux-arm-kernel mailing list
> > > > linux-arm-kernel at lists.infradead.org
> > > > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-10-22 15:33 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-11  7:28 DMA remote memcpy requests Adam Cottrel
2018-10-12  9:09 ` Will Deacon
2018-10-12  9:48   ` Adam Cottrel
2018-10-12 10:46     ` Robin Murphy
2018-10-12 11:06       ` Adam Cottrel
2018-10-15 14:34       ` Adam Cottrel
2018-10-15 15:09         ` Jan Glauber
2018-10-15 15:24           ` Adam Cottrel
2018-10-15 15:39             ` Jan Glauber
2018-10-15 15:51               ` Adam Cottrel
2018-10-18 15:36                 ` Adam Cottrel
2018-10-22 14:28                   ` Jan Glauber
2018-10-22 14:39                     ` Adam Cottrel
2018-10-22 15:33                       ` Jan Glauber
     [not found]         ` <DM6PR07MB4923F3328079199090D6D2CA9EFE0@DM6PR07MB4923.namprd07.prod.outlook.com>
2018-10-16 16:52           ` Adam Cottrel
2018-10-16 17:08             ` Robin Murphy
2018-10-12 11:03     ` Jan Glauber
2018-10-12 11:07       ` Adam Cottrel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.