All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/6] dma: edma: Provide granular residue accounting
@ 2014-04-17 14:40 Thomas Gleixner
  2014-04-17 14:40 ` [patch 2/6] dma: edma: Check the current decriptor first in tx_status() Thomas Gleixner
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Thomas Gleixner @ 2014-04-17 14:40 UTC (permalink / raw)
  To: linux-arm-kernel

The residue reporting in edma_tx_status() is broken and the
implementation is beyond silly. See patch 1/n and 2/n

The following series addresses this and adds on top granular
accounting to the driver.

The motivation behind this is that I tried to get the DMA mode of the
DCAN peripheral in beaglebone working. The DCAN device driver
implements a network device via the net/can infrastructure.

So the obvious choice would have been scatter gather lists. But that
has the same issue as the stupid "FIFO" implementation of the CAN IP.

Once the last SG element is processed, the EDMA interrupt needs to
establish the next SG list. In fastest mode the CAN packets come with
less than 50us over the wire and there is no buffering in the CAN
IP. So if the interrupt gets delayed a bit we can lose packets. With
enough load on the bus its observable.

The next obstacle was the missing per SG element reporting. We really
can't wait for a full SG list for notification.

I couldn't be bothered to fix this, as this would defeat the whole
idea of NAPI: disable interrupts and poll the device until all pending
packets are done.

So we'd trade the CAN interrupt per packet against the EDMA interrupt
per packet. And the notification which is done via a tasklet is not
really helpful either.

Interrupt
     schedule tasklet

  softirq
     run tasklet
     	 napi_schedule()
	    raise RX softirq

     run rx-action 
  	 poll one packet
	 napi_complete()

So even if another interrupt comes in before we leave the NAPI poll
there is no way that we can see it as it merily schedules the
tasklet. Not what you really want.

The same applies to cyclic buffers where the period is one CAN
frame. That actually works without the packet loss due to SG reload.

But the interrupt load is amazing and with only max 20 periods an
overrun is to observe when the softirq goes into the ksoftirq
thread. It just takes 1ms away from the CPU to happen, which is less
than a full timeslot with HZ=250.

Next idea was to utilize a single larger cyclic buffer and avoid the
EDMA interrupt alltogether as the CAN chip can signal the state change
via its own interrupt which is then handled simply via the normal NAPI
mechanisms. Now the CAN IP has no packet counter so I decided to use
dma_tx_status to track the DMA progress.

That failed to work because the residue reporting was only descriptor
granular and returned even the wrong size for the circular buffer.

So I digged into the details and found a rather simple solution to
make granular accounting useable for both circular and SG style work.

With that the DCAN DMA works reliably and the system load decreases
significantly as the main contributor to that (the slow read from the
DCAN interface) is gone.

As a side note:

The DCAN readout is 4 consecutive 32bit registers. The only way I got
that working is by configuring the engine with:

       cfg.direction = DMA_DEV_TO_MEM;
       cfg.src_addr_width = 16;
       cfg.src_maxburst = 1;

With
       cfg.src_addr_width = 4;
       cfg.src_maxburst = 4;

it reads just 4 times the first register.

I have my doubts that this is correct API wise, so it'd be nice if
someone could enlighten me on that.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 1/6] dma: edma: Sanitize residue reporting
  2014-04-17 14:40 [patch 0/6] dma: edma: Provide granular residue accounting Thomas Gleixner
  2014-04-17 14:40 ` [patch 2/6] dma: edma: Check the current decriptor first in tx_status() Thomas Gleixner
@ 2014-04-17 14:40 ` Thomas Gleixner
  2014-04-18  0:02   ` Joel Fernandes
  2014-04-17 14:40 ` [patch 3/6] dma: edma: Create private pset struct Thomas Gleixner
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2014-04-17 14:40 UTC (permalink / raw)
  To: linux-arm-kernel

An embedded and charset-unspecified text was scrubbed...
Name: dma-edma-sanitize-residue-reporting.patch
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140417/17c99cd7/attachment.ksh>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 2/6] dma: edma: Check the current decriptor first in tx_status()
  2014-04-17 14:40 [patch 0/6] dma: edma: Provide granular residue accounting Thomas Gleixner
@ 2014-04-17 14:40 ` Thomas Gleixner
  2014-04-17 14:40 ` [patch 1/6] dma: edma: Sanitize residue reporting Thomas Gleixner
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2014-04-17 14:40 UTC (permalink / raw)
  To: linux-arm-kernel

An embedded and charset-unspecified text was scrubbed...
Name: dma-edma-tx-status-check-current-desc-first.patch
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140417/135fd61c/attachment.ksh>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 3/6] dma: edma: Create private pset struct
  2014-04-17 14:40 [patch 0/6] dma: edma: Provide granular residue accounting Thomas Gleixner
  2014-04-17 14:40 ` [patch 2/6] dma: edma: Check the current decriptor first in tx_status() Thomas Gleixner
  2014-04-17 14:40 ` [patch 1/6] dma: edma: Sanitize residue reporting Thomas Gleixner
@ 2014-04-17 14:40 ` Thomas Gleixner
  2014-04-17 23:56   ` Joel Fernandes
  2014-04-17 14:40 ` [patch 5/6] edma: Make reading the position of active channels work Thomas Gleixner
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2014-04-17 14:40 UTC (permalink / raw)
  To: linux-arm-kernel

An embedded and charset-unspecified text was scrubbed...
Name: dma-edma-create-private-pset-struct.patch
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140417/a145045c/attachment.ksh>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 5/6] edma: Make reading the position of active channels work
  2014-04-17 14:40 [patch 0/6] dma: edma: Provide granular residue accounting Thomas Gleixner
                   ` (2 preceding siblings ...)
  2014-04-17 14:40 ` [patch 3/6] dma: edma: Create private pset struct Thomas Gleixner
@ 2014-04-17 14:40 ` Thomas Gleixner
  2014-04-18  0:47   ` Joel Fernandes
  2014-04-17 14:40 ` [patch 4/6] dma: edma: Store transfer data in edma_desc and edma_pset Thomas Gleixner
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2014-04-17 14:40 UTC (permalink / raw)
  To: linux-arm-kernel

An embedded and charset-unspecified text was scrubbed...
Name: arm-edma-make-read-position-useful.patch
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140417/362699e1/attachment.ksh>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 4/6] dma: edma: Store transfer data in edma_desc and edma_pset
  2014-04-17 14:40 [patch 0/6] dma: edma: Provide granular residue accounting Thomas Gleixner
                   ` (3 preceding siblings ...)
  2014-04-17 14:40 ` [patch 5/6] edma: Make reading the position of active channels work Thomas Gleixner
@ 2014-04-17 14:40 ` Thomas Gleixner
  2014-04-17 14:40 ` [patch 6/6] dma: edma: Provide granular accounting Thomas Gleixner
  2014-04-17 20:07 ` [patch 0/6] dma: edma: Provide granular residue accounting Russell King - ARM Linux
  6 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2014-04-17 14:40 UTC (permalink / raw)
  To: linux-arm-kernel

An embedded and charset-unspecified text was scrubbed...
Name: dma-edma-store-per-pset-info.patch
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140417/462ab06d/attachment.ksh>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 6/6] dma: edma: Provide granular accounting
  2014-04-17 14:40 [patch 0/6] dma: edma: Provide granular residue accounting Thomas Gleixner
                   ` (4 preceding siblings ...)
  2014-04-17 14:40 ` [patch 4/6] dma: edma: Store transfer data in edma_desc and edma_pset Thomas Gleixner
@ 2014-04-17 14:40 ` Thomas Gleixner
  2014-04-18  0:45   ` Joel Fernandes
  2014-04-17 20:07 ` [patch 0/6] dma: edma: Provide granular residue accounting Russell King - ARM Linux
  6 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2014-04-17 14:40 UTC (permalink / raw)
  To: linux-arm-kernel

An embedded and charset-unspecified text was scrubbed...
Name: dma-edma-provide-granular-accounting.patch
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140417/5e6c058f/attachment.ksh>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 0/6] dma: edma: Provide granular residue accounting
  2014-04-17 14:40 [patch 0/6] dma: edma: Provide granular residue accounting Thomas Gleixner
                   ` (5 preceding siblings ...)
  2014-04-17 14:40 ` [patch 6/6] dma: edma: Provide granular accounting Thomas Gleixner
@ 2014-04-17 20:07 ` Russell King - ARM Linux
  2014-04-17 20:31   ` Thomas Gleixner
  6 siblings, 1 reply; 21+ messages in thread
From: Russell King - ARM Linux @ 2014-04-17 20:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Apr 17, 2014 at 02:40:43PM -0000, Thomas Gleixner wrote:
> The next obstacle was the missing per SG element reporting. We really
> can't wait for a full SG list for notification.

Err, dmaengine doesn't have per-SG element reporting.

What it does allow is several transactions to be submitted consecutively,
so that the DMA engine can move to the next transaction once the previous
one has been submitted.

Where it's important that this happens with the minimum of delay, there's
nothing in the API that prevents the hardware scatterlist of the previous
transaction being linked directly to the following transaction, provided
of course the hardware can do that.

Many DMA engine implementations are just lazy - they implement stuff as:
setup hardware, run scatter list, get to the end, raise interrupt.  Fire
off tasklet.  Tasklet runs, calls the callback, checks to see if there's
another transaction, sets up hardware for the next one.  That (as you
would expect) gives quite a high latency to the following transaction.

I've coded at least one DMA engine driver to start the next transaction
immediately that the previous one completes, before the tasklet is run.
As I say above, there's really no reason to even wait for the interrupt...
if people can be bothered to think about all the implications that brings
(f.e. reporting completion status, and how many bytes remaining of a
transaction, etc.)

> So we'd trade the CAN interrupt per packet against the EDMA interrupt
> per packet. And the notification which is done via a tasklet is not
> really helpful either.

Again, the DMA engine API allows for the reception of the interrupt to
be turned off (not every DMA engine implementation supports it though.)
However, no interrupts means no callbacks (or an implementation may
decide that the presence of a callback means that you do want interrupts -
that's because a number of buggy drivers forget to give DMA_PREP_INTERRUPT.)

However, not using the callbacks then means you need to poll for
completion - that's where tracking the cookies for the submitted
transactions, calling dmaengine_tx_status() with a state argument, or
dma_async_is_tx_complete() then allows you to use dma_async_is_complete()
to quickly ascertain whether any of the other cookies have completed.

What you can't do though is decide whether you want an interrupt after
submission.

> The DCAN readout is 4 consecutive 32bit registers. The only way I got
> that working is by configuring the engine with:
> 
>        cfg.direction = DMA_DEV_TO_MEM;
>        cfg.src_addr_width = 16;

Hmm.  This is an enum, and the expected values are limited to 1, 2, 4,
and 8 for 8-bit, 16-bit, 32-bit and 64-bit respectively.  The width is
supposed to represent the width of the access used on the bus to the
peripheral.

So, if we did have a DMA engine which supported 128-bit accesses, it
would end up doing a 128-bit access for the above...

>        cfg.src_maxburst = 1;

This is supposed to indicate the maximum number of accesses in a burst
to the same register.  It kind of makes sense, but I'm not sure it's in
the spirit of the DMA engine API.  I think strictly, to have a DMA engine
perform 4 consecutive 32-bit accesses is quite a special requirement that
we don't have a way to represent it at present.

> With
>        cfg.src_addr_width = 4;
>        cfg.src_maxburst = 4;
> 
> it reads just 4 times the first register.

If it's just that the FIFO is spread over 4 consecutive locations
(effectively due to not decoding bits 2,3 of the address bus for the
register) then reading the first register four times is just as
acceptable as reading them consecutively.

The reason that kind of thing was done in old days was to allow the
ARM ldmia/stmia instructions to be used to access FIFOs, thereby
allowing multiple words to be transferred with a single instruction.
I can't believe that there's still people designing for that
especially if they have a DMA engine...

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 0/6] dma: edma: Provide granular residue accounting
  2014-04-17 20:07 ` [patch 0/6] dma: edma: Provide granular residue accounting Russell King - ARM Linux
@ 2014-04-17 20:31   ` Thomas Gleixner
  2014-04-17 20:38     ` Russell King - ARM Linux
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2014-04-17 20:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 17 Apr 2014, Russell King - ARM Linux wrote:

> On Thu, Apr 17, 2014 at 02:40:43PM -0000, Thomas Gleixner wrote:
> > The next obstacle was the missing per SG element reporting. We really
> > can't wait for a full SG list for notification.
> 
> Err, dmaengine doesn't have per-SG element reporting.

enum dma_residue_granularity {
        DMA_RESIDUE_GRANULARITY_DESCRIPTOR = 0,
	DMA_RESIDUE_GRANULARITY_SEGMENT = 1,
        DMA_RESIDUE_GRANULARITY_BURST = 2,
};

tells a different story.
 
> What it does allow is several transactions to be submitted consecutively,
> so that the DMA engine can move to the next transaction once the previous
> one has been submitted.
> 
> Where it's important that this happens with the minimum of delay, there's
> nothing in the API that prevents the hardware scatterlist of the previous
> transaction being linked directly to the following transaction, provided
> of course the hardware can do that.

Right. I hoped that this would be the case, as you would expect from
DMA, but as you observed correctly:
 
> Many DMA engine implementations are just lazy - they implement stuff as:
> setup hardware, run scatter list, get to the end, raise interrupt.  Fire
> off tasklet.  Tasklet runs, calls the callback, checks to see if there's
> another transaction, sets up hardware for the next one.  That (as you
> would expect) gives quite a high latency to the following transaction.

Yep. It's just unusable for low latency applications.
 
> I've coded at least one DMA engine driver to start the next transaction
> immediately that the previous one completes, before the tasklet is run.
> As I say above, there's really no reason to even wait for the interrupt...
> if people can be bothered to think about all the implications that brings
> (f.e. reporting completion status, and how many bytes remaining of a
> transaction, etc.)

The EDMA HW would allow that as well, but the driver is definitely not
up to it and to be honest I didnt have the cycles to rewrite it from
scratch as that would be the only way to make that work.
 
> If it's just that the FIFO is spread over 4 consecutive locations
> (effectively due to not decoding bits 2,3 of the address bus for the
> register) then reading the first register four times is just as
> acceptable as reading them consecutively.

It's not a FIFO. It's four different consecutive registers, which are
DMA readable. And you need to read all of them...
 
> The reason that kind of thing was done in old days was to allow the
> ARM ldmia/stmia instructions to be used to access FIFOs, thereby
> allowing multiple words to be transferred with a single instruction.
> I can't believe that there's still people designing for that
> especially if they have a DMA engine...

In that case it's a magic DMA extension superglued beside the already
horrible register interface of that particular IP block.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 0/6] dma: edma: Provide granular residue accounting
  2014-04-17 20:31   ` Thomas Gleixner
@ 2014-04-17 20:38     ` Russell King - ARM Linux
  2014-04-17 21:14       ` Thomas Gleixner
  0 siblings, 1 reply; 21+ messages in thread
From: Russell King - ARM Linux @ 2014-04-17 20:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Apr 17, 2014 at 10:31:44PM +0200, Thomas Gleixner wrote:
> On Thu, 17 Apr 2014, Russell King - ARM Linux wrote:
> 
> > On Thu, Apr 17, 2014 at 02:40:43PM -0000, Thomas Gleixner wrote:
> > > The next obstacle was the missing per SG element reporting. We really
> > > can't wait for a full SG list for notification.
> > 
> > Err, dmaengine doesn't have per-SG element reporting.
> 
> enum dma_residue_granularity {
>         DMA_RESIDUE_GRANULARITY_DESCRIPTOR = 0,
> 	DMA_RESIDUE_GRANULARITY_SEGMENT = 1,
>         DMA_RESIDUE_GRANULARITY_BURST = 2,
> };
> 
> tells a different story.

That's to do with the residue though, not to do with callbacks.

> > If it's just that the FIFO is spread over 4 consecutive locations
> > (effectively due to not decoding bits 2,3 of the address bus for the
> > register) then reading the first register four times is just as
> > acceptable as reading them consecutively.
> 
> It's not a FIFO. It's four different consecutive registers, which are
> DMA readable. And you need to read all of them...
>  
> > The reason that kind of thing was done in old days was to allow the
> > ARM ldmia/stmia instructions to be used to access FIFOs, thereby
> > allowing multiple words to be transferred with a single instruction.
> > I can't believe that there's still people designing for that
> > especially if they have a DMA engine...
> 
> In that case it's a magic DMA extension superglued beside the already
> horrible register interface of that particular IP block.

So it's more a copy-from-peripheral-to-memory - a DMA copy operation
triggered in a similar manner to the DMA slave mode.  There are a
number of use cases for this, but no one has yet put their head above
the parapet to spear-head that cause. :)

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 0/6] dma: edma: Provide granular residue accounting
  2014-04-17 20:38     ` Russell King - ARM Linux
@ 2014-04-17 21:14       ` Thomas Gleixner
  0 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2014-04-17 21:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 17 Apr 2014, Russell King - ARM Linux wrote:

> On Thu, Apr 17, 2014 at 10:31:44PM +0200, Thomas Gleixner wrote:
> > On Thu, 17 Apr 2014, Russell King - ARM Linux wrote:
> > 
> > > On Thu, Apr 17, 2014 at 02:40:43PM -0000, Thomas Gleixner wrote:
> > > > The next obstacle was the missing per SG element reporting. We really
> > > > can't wait for a full SG list for notification.
> > > 
> > > Err, dmaengine doesn't have per-SG element reporting.
> > 
> > enum dma_residue_granularity {
> >         DMA_RESIDUE_GRANULARITY_DESCRIPTOR = 0,
> > 	DMA_RESIDUE_GRANULARITY_SEGMENT = 1,
> >         DMA_RESIDUE_GRANULARITY_BURST = 2,
> > };
> > 
> > tells a different story.
> 
> That's to do with the residue though, not to do with callbacks.

Right. That's what I tripped over. Sorry for mixing up the naming
conventions.

I don't care about per element reporting in terms of callbacks as
that's completely counterproductive if you deal with a network
device.

What I care about is the ability to figure out how many packets have
been transmitted into the DMA buffer.

In the particular case of DCAN the cyclic buffer is the most optimal
one, because if I use skb based SG then I cannot just drop the skb as
is into the network stack. I still need to get the information from
the skb and bring it into the required CAN skb frame format, write it
back and then submit a new SG element. So just reading the data from
the cyclic buffer, formatting it and storing the result into the skb
is way more efficient.
 
> > > If it's just that the FIFO is spread over 4 consecutive locations
> > > (effectively due to not decoding bits 2,3 of the address bus for the
> > > register) then reading the first register four times is just as
> > > acceptable as reading them consecutively.
> > 
> > It's not a FIFO. It's four different consecutive registers, which are
> > DMA readable. And you need to read all of them...
> >  
> > > The reason that kind of thing was done in old days was to allow the
> > > ARM ldmia/stmia instructions to be used to access FIFOs, thereby
> > > allowing multiple words to be transferred with a single instruction.
> > > I can't believe that there's still people designing for that
> > > especially if they have a DMA engine...
> > 
> > In that case it's a magic DMA extension superglued beside the already
> > horrible register interface of that particular IP block.
> 
> So it's more a copy-from-peripheral-to-memory - a DMA copy operation
> triggered in a similar manner to the DMA slave mode.  There are a
> number of use cases for this, but no one has yet put their head above
> the parapet to spear-head that cause. :)

The EDMA and most other DMA engines have HW support for this, we just
have no interface to make use of it. My burst=1/width=16 workaround is
just abusing the internal implementation details of EDMA.

That's why I was asking.

Would extending the dma_slave_config struct by the following fields
make sense?

 struct dma_slave_config {
        enum dma_transfer_direction direction;
        dma_addr_t src_addr;
        dma_addr_t dst_addr;
        enum dma_slave_buswidth src_addr_width;
        enum dma_slave_buswidth dst_addr_width;
	u32 src_maxburst;
        u32 dst_maxburst;
+       u32 src_stride;
+       u32 dst_stride;
        bool device_fc;
        unsigned int slave_id;
 };

The default for these fields would be 0, so no change for existing
implementations. If set to !0 the hardware driver can either bail out
if not supported (in HW or SW) or setup the transfer in the right way
to increment the src/dst field for the burst by stride.

Thoughts?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 3/6] dma: edma: Create private pset struct
  2014-04-17 14:40 ` [patch 3/6] dma: edma: Create private pset struct Thomas Gleixner
@ 2014-04-17 23:56   ` Joel Fernandes
  0 siblings, 0 replies; 21+ messages in thread
From: Joel Fernandes @ 2014-04-17 23:56 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/17/2014 09:40 AM, Thomas Gleixner wrote:
> Preparatory patch to support finer grained accounting.
> 
> Move the edma_params array out of edma_desc so we can add further per
> pset data to it.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  drivers/dma/edma.c |   67 ++++++++++++++++++++++++++++-------------------------
>  1 file changed, 36 insertions(+), 31 deletions(-)
> 
> Index: linux-2.6/drivers/dma/edma.c
> ===================================================================
> --- linux-2.6.orig/drivers/dma/edma.c
> +++ linux-2.6/drivers/dma/edma.c
> @@ -57,6 +57,10 @@
>  #define EDMA_MAX_SLOTS		MAX_NR_SG
>  #define EDMA_DESCRIPTORS	16
>  
> +struct edma_pset {
> +	struct edmacc_param		hwpar;

If its ok, can this be renamed to param instead of hwpar through out the
patch? I feel that's more readable. Thanks.


> +};
> +
>  struct edma_desc {
>  	struct virt_dma_desc		vdesc;
>  	struct list_head		node;
> @@ -65,7 +69,7 @@ struct edma_desc {
>  	int				pset_nr;
>  	int				processed;
>  	u32				residue;
> -	struct edmacc_param		pset[0];
> +	struct edma_pset		pset[0];
>  };
>  
>  struct edma_cc;
> @@ -141,7 +145,7 @@ static void edma_execute(struct edma_cha
>  	/* Write descriptor PaRAM set(s) */
>  	for (i = 0; i < nslots; i++) {
>  		j = i + edesc->processed;
> -		edma_write_slot(echan->slot[i], &edesc->pset[j]);
> +		edma_write_slot(echan->slot[i], &edesc->pset[j].hwpar);
>  		dev_dbg(echan->vchan.chan.device->dev,
>  			"\n pset[%d]:\n"
>  			"  chnum\t%d\n"
> @@ -155,14 +159,14 @@ static void edma_execute(struct edma_cha
>  			"  cidx\t%08x\n"
>  			"  lkrld\t%08x\n",
>  			j, echan->ch_num, echan->slot[i],
> -			edesc->pset[j].opt,
> -			edesc->pset[j].src,
> -			edesc->pset[j].dst,
> -			edesc->pset[j].a_b_cnt,
> -			edesc->pset[j].ccnt,
> -			edesc->pset[j].src_dst_bidx,
> -			edesc->pset[j].src_dst_cidx,
> -			edesc->pset[j].link_bcntrld);
> +			edesc->pset[j].hwpar.opt,
> +			edesc->pset[j].hwpar.src,
> +			edesc->pset[j].hwpar.dst,
> +			edesc->pset[j].hwpar.a_b_cnt,
> +			edesc->pset[j].hwpar.ccnt,
> +			edesc->pset[j].hwpar.src_dst_bidx,
> +			edesc->pset[j].hwpar.src_dst_cidx,
> +			edesc->pset[j].hwpar.link_bcntrld);
>  		/* Link to the previous slot if not the last set */
>  		if (i != (nslots - 1))
>  			edma_link(echan->slot[i], echan->slot[i+1]);
> @@ -274,13 +278,14 @@ static int edma_control(struct dma_chan
>   * @dma_length: Total length of the DMA transfer
>   * @direction: Direction of the transfer
>   */
> -static int edma_config_pset(struct dma_chan *chan, struct edmacc_param *pset,
> +static int edma_config_pset(struct dma_chan *chan, struct edma_pset *epset,
>  	dma_addr_t src_addr, dma_addr_t dst_addr, u32 burst,
>  	enum dma_slave_buswidth dev_width, unsigned int dma_length,
>  	enum dma_transfer_direction direction)
>  {
>  	struct edma_chan *echan = to_edma_chan(chan);
>  	struct device *dev = chan->device->dev;
> +	struct edmacc_param *hwpar = &epset->hwpar;
>  	int acnt, bcnt, ccnt, cidx;
>  	int src_bidx, dst_bidx, src_cidx, dst_cidx;
>  	int absync;
> @@ -351,26 +356,26 @@ static int edma_config_pset(struct dma_c
>  		return -EINVAL;
>  	}
>  
> -	pset->opt = EDMA_TCC(EDMA_CHAN_SLOT(echan->ch_num));
> +	hwpar->opt = EDMA_TCC(EDMA_CHAN_SLOT(echan->ch_num));
>  	/* Configure A or AB synchronized transfers */
>  	if (absync)
> -		pset->opt |= SYNCDIM;
> +		hwpar->opt |= SYNCDIM;
>  
> -	pset->src = src_addr;
> -	pset->dst = dst_addr;
> +	hwpar->src = src_addr;
> +	hwpar->dst = dst_addr;
>  
> -	pset->src_dst_bidx = (dst_bidx << 16) | src_bidx;
> -	pset->src_dst_cidx = (dst_cidx << 16) | src_cidx;
> +	hwpar->src_dst_bidx = (dst_bidx << 16) | src_bidx;
> +	hwpar->src_dst_cidx = (dst_cidx << 16) | src_cidx;
>  
> -	pset->a_b_cnt = bcnt << 16 | acnt;
> -	pset->ccnt = ccnt;
> +	hwpar->a_b_cnt = bcnt << 16 | acnt;
> +	hwpar->ccnt = ccnt;
>  	/*
>  	 * Only time when (bcntrld) auto reload is required is for
>  	 * A-sync case, and in this case, a requirement of reload value
>  	 * of SZ_64K-1 only is assured. 'link' is initially set to NULL
>  	 * and then later will be populated by edma_execute.
>  	 */
> -	pset->link_bcntrld = 0xffffffff;
> +	hwpar->link_bcntrld = 0xffffffff;
>  	return absync;
>  }
>  
> @@ -457,11 +462,11 @@ static struct dma_async_tx_descriptor *e
>  		/* If this is the last in a current SG set of transactions,
>  		   enable interrupts so that next set is processed */
>  		if (!((i+1) % MAX_NR_SG))
> -			edesc->pset[i].opt |= TCINTEN;
> +			edesc->pset[i].hwpar.opt |= TCINTEN;
>  
>  		/* If this is the last set, enable completion interrupt flag */
>  		if (i == sg_len - 1)
> -			edesc->pset[i].opt |= TCINTEN;
> +			edesc->pset[i].hwpar.opt |= TCINTEN;
>  	}
>  
>  	return vchan_tx_prep(&echan->vchan, &edesc->vdesc, tx_flags);
> @@ -582,14 +587,14 @@ static struct dma_async_tx_descriptor *e
>  			"  cidx\t%08x\n"
>  			"  lkrld\t%08x\n",
>  			i, echan->ch_num, echan->slot[i],
> -			edesc->pset[i].opt,
> -			edesc->pset[i].src,
> -			edesc->pset[i].dst,
> -			edesc->pset[i].a_b_cnt,
> -			edesc->pset[i].ccnt,
> -			edesc->pset[i].src_dst_bidx,
> -			edesc->pset[i].src_dst_cidx,
> -			edesc->pset[i].link_bcntrld);
> +			edesc->pset[i].hwpar.opt,
> +			edesc->pset[i].hwpar.src,
> +			edesc->pset[i].hwpar.dst,
> +			edesc->pset[i].hwpar.a_b_cnt,
> +			edesc->pset[i].hwpar.ccnt,
> +			edesc->pset[i].hwpar.src_dst_bidx,
> +			edesc->pset[i].hwpar.src_dst_cidx,
> +			edesc->pset[i].hwpar.link_bcntrld);
>  
>  		edesc->absync = ret;
>  
> @@ -597,7 +602,7 @@ static struct dma_async_tx_descriptor *e
>  		 * Enable interrupts for every period because callback
>  		 * has to be called for every period.
>  		 */
> -		edesc->pset[i].opt |= TCINTEN;
> +		edesc->pset[i].hwpar.opt |= TCINTEN;
>  	}
>  
>  	return vchan_tx_prep(&echan->vchan, &edesc->vdesc, tx_flags);
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 1/6] dma: edma: Sanitize residue reporting
  2014-04-17 14:40 ` [patch 1/6] dma: edma: Sanitize residue reporting Thomas Gleixner
@ 2014-04-18  0:02   ` Joel Fernandes
  0 siblings, 0 replies; 21+ messages in thread
From: Joel Fernandes @ 2014-04-18  0:02 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Thomas,

On 04/17/2014 09:40 AM, Thomas Gleixner wrote:
> The residue reporting in edma_tx_status() is just broken. It blindly
> walks the psets and recalculates the lenght of the transfer from the
> hardware parameters. For cyclic transfers it adds the link pset, which
> results in interestingly large residues. For non-cyclic it adds the
> dummy pset, which is stupid as well.
> Aside of that it's silly to walk through the pset params when the per
> descriptor residue is known at the point of creating it.

Yes this bit had to be rewritten, after we added support to DMA SGs.
Thanks. It was written before I took over the driver ;-)

> 
> Store the information in edma_desc and use it.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  drivers/dma/edma.c |   34 +++++++++-------------------------
>  1 file changed, 9 insertions(+), 25 deletions(-)
> 
> Index: linux-2.6/drivers/dma/edma.c
> ===================================================================
> --- linux-2.6.orig/drivers/dma/edma.c
> +++ linux-2.6/drivers/dma/edma.c
> @@ -64,6 +64,7 @@ struct edma_desc {
>  	int				absync;
>  	int				pset_nr;
>  	int				processed;
> +	u32				residue;
>  	struct edmacc_param		pset[0];
>  };
>  
> @@ -416,6 +417,7 @@ static struct dma_async_tx_descriptor *e
>  	}
>  
>  	edesc->pset_nr = sg_len;
> +	edesc->residue = 0;
>  
>  	/* Allocate a PaRAM slot, if needed */
>  	nslots = min_t(unsigned, MAX_NR_SG, sg_len);
> @@ -450,6 +452,7 @@ static struct dma_async_tx_descriptor *e
>  		}
>  
>  		edesc->absync = ret;
> +		edesc->residue += sg_dma_len(sg);
>  
>  		/* If this is the last in a current SG set of transactions,
>  		   enable interrupts so that next set is processed */
> @@ -527,6 +530,7 @@ static struct dma_async_tx_descriptor *e
>  
>  	edesc->cyclic = 1;
>  	edesc->pset_nr = nslots;
> +	edesc->residue = buf_len;
>  
>  	dev_dbg(dev, "%s: nslots=%d\n", __func__, nslots);
>  	dev_dbg(dev, "%s: period_len=%d\n", __func__, period_len);
> @@ -622,6 +626,7 @@ static void edma_callback(unsigned ch_nu
>  				vchan_cyclic_callback(&edesc->vdesc);
>  			} else if (edesc->processed == edesc->pset_nr) {
>  				dev_dbg(dev, "Transfer complete, stopping channel %d\n", ch_num);
> +				edesc->residue = 0;
>  				edma_stop(echan->ch_num);
>  				vchan_cookie_complete(&edesc->vdesc);
>  				edma_execute(echan);
> @@ -754,25 +759,6 @@ static void edma_issue_pending(struct dm
>  	spin_unlock_irqrestore(&echan->vchan.lock, flags);
>  }
>  
> -static size_t edma_desc_size(struct edma_desc *edesc)
> -{
> -	int i;
> -	size_t size;
> -
> -	if (edesc->absync)
> -		for (size = i = 0; i < edesc->pset_nr; i++)
> -			size += (edesc->pset[i].a_b_cnt & 0xffff) *
> -				(edesc->pset[i].a_b_cnt >> 16) *
> -				 edesc->pset[i].ccnt;
> -	else
> -		size = (edesc->pset[0].a_b_cnt & 0xffff) *
> -			(edesc->pset[0].a_b_cnt >> 16) +
> -			(edesc->pset[0].a_b_cnt & 0xffff) *
> -			(SZ_64K - 1) * edesc->pset[0].ccnt;
> -
> -	return size;
> -}
> -
>  /* Check request completion status */
>  static enum dma_status edma_tx_status(struct dma_chan *chan,
>  				      dma_cookie_t cookie,
> @@ -789,12 +775,10 @@ static enum dma_status edma_tx_status(st
>  
>  	spin_lock_irqsave(&echan->vchan.lock, flags);
>  	vdesc = vchan_find_desc(&echan->vchan, cookie);
> -	if (vdesc) {
> -		txstate->residue = edma_desc_size(to_edma_desc(&vdesc->tx));
> -	} else if (echan->edesc && echan->edesc->vdesc.tx.cookie == cookie) {
> -		struct edma_desc *edesc = echan->edesc;
> -		txstate->residue = edma_desc_size(edesc);
> -	}
> +	if (vdesc)
> +		txstate->residue = to_edma_desc(&vdesc->tx)->residue;
> +	else if (echan->edesc && echan->edesc->vdesc.tx.cookie == cookie)
> +		txstate->residue = echan->edesc->residue;
>  	spin_unlock_irqrestore(&echan->vchan.lock, flags);
>  
>  	return ret;

Peter, its ok, can you ensure this works for Audio
(snd_dmaengine_pcm_pointer) users of EDMA (unless Thomas may already have).

Looks fine to me otherwise,
Reviewed-by: Joel Fernandes <joelf@ti.com>

thanks,
  -Joel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 6/6] dma: edma: Provide granular accounting
  2014-04-17 14:40 ` [patch 6/6] dma: edma: Provide granular accounting Thomas Gleixner
@ 2014-04-18  0:45   ` Joel Fernandes
  2014-04-18  7:03     ` Thomas Gleixner
  0 siblings, 1 reply; 21+ messages in thread
From: Joel Fernandes @ 2014-04-18  0:45 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/17/2014 09:40 AM, Thomas Gleixner wrote:
> The first slot in the ParamRAM of EDMA holds the current active
> subtransfer. Depending on the direction we read either the source or
> the destination address from there. In the internat psets we have the
> address of the buffer(s).
> 
> In the cyclic case we only use the internal pset[0] which holds the
> start address of the circular buffer and calculate the remaining room
> to the end of the buffer.
> 
> In the SG case we read the current address and compare it to the
> internal psets address and length. If the current address is outside
> of this range, the pset has been processed already and we can mark it
> done, update the residue value and process the next set. If its inside
> the range we know that we look at the current active set and stop the
> walk. In case of intermediate transfers we update the stats in the
> callback function before starting the next batch of transfers.
> 
> The tx_status callback and the callback are serialized via vchan.lock.
> 
> In the unexpected case that the read of the paramram fails due to
> concurrent updates by the DMA engine, we return the last good value.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  drivers/dma/edma.c |   81 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 79 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/drivers/dma/edma.c
> ===================================================================
> --- linux-2.6.orig/drivers/dma/edma.c
> +++ linux-2.6/drivers/dma/edma.c
> @@ -71,7 +71,10 @@ struct edma_desc {
>  	int				absync;
>  	int				pset_nr;
>  	int				processed;
> +	int				processed_stat;
>  	u32				residue;
> +	u32				residue_stat;
> +	int				slot0;

Instead of slot0, perhaps storing a pointer to the echan would be great
incase in the future we need to access something else in the channel.
Then you can just do edesc->echan->slot[0].

>  	struct edma_pset		pset[0];
>  };
>  
> @@ -448,6 +451,8 @@ static struct dma_async_tx_descriptor *e
>  		}
>  	}
>  
> +	edesc->slot0 = echan->slot[0];
> +
>  	/* Configure PaRAM sets for each SG */
>  	for_each_sg(sgl, sg, sg_len, i) {
>  		/* Get address for each SG */
> @@ -476,6 +481,7 @@ static struct dma_async_tx_descriptor *e
>  		if (i == sg_len - 1)
>  			edesc->pset[i].hwpar.opt |= TCINTEN;
>  	}
> +	edesc->residue_stat = edesc->residue;
>  
>  	return vchan_tx_prep(&echan->vchan, &edesc->vdesc, tx_flags);
>  }
> @@ -543,7 +549,7 @@ static struct dma_async_tx_descriptor *e
>  
>  	edesc->cyclic = 1;
>  	edesc->pset_nr = nslots;
> -	edesc->residue = buf_len;
> +	edesc->residue = edesc->residue_stat = buf_len;
>  	edesc->direction = direction;
>  
>  	dev_dbg(dev, "%s: nslots=%d\n", __func__, nslots);
> @@ -613,6 +619,7 @@ static struct dma_async_tx_descriptor *e
>  		 */
>  		edesc->pset[i].hwpar.opt |= TCINTEN;
>  	}
> +	edesc->slot0 = echan->slot[0];
>  
>  	return vchan_tx_prep(&echan->vchan, &edesc->vdesc, tx_flags);
>  }
> @@ -645,7 +652,17 @@ static void edma_callback(unsigned ch_nu
>  				vchan_cookie_complete(&edesc->vdesc);
>  				edma_execute(echan);
>  			} else {
> +				int i, n;
> +
>  				dev_dbg(dev, "Intermediate transfer complete on channel %d\n", ch_num);
> +
> +				/* Update statistics for tx_status */
> +				n = edesc->processed;
> +				for (i = edesc->processed_stat; i < n; i++)
> +					edesc->residue -= edesc->pset[i].len;
> +				edesc->processed_stat = n;
> +				edesc->residue_stat = edesc->residue;
> +

This is a bit of a NAK since it is O(n) on every intermediate transfer
completion. I'm not sure of a better way to do it but I certainly don't
want the IRQ slowed down anymore... Is there a way to pre-calculate the
size of the intermediate transfer and store it somewhere for accounting?

Also another thing is I don't think processed_stat is required at all,
because I think you introduced it for the possibility that there might
be delays in processing interrupts for intermediate transfers. That is
actually an impossibility because such a scenario would result in
catastrophic failure anyway. Because for intermediate transfer
interrupts shouldn't be missed, so in this case you can just do:

for (i = 1; i <= MAX_NR_SG; i++) {
	edesc->residue -= edesc->pset[edesc->processed - i].len;
}

I think that'll work.


>  				edma_execute(echan);
>  			}
>  		}
> @@ -773,6 +790,66 @@ static void edma_issue_pending(struct dm
>  	spin_unlock_irqrestore(&echan->vchan.lock, flags);
>  }
>  
> +static u32 edma_residue(struct edma_desc *edesc)
> +{
> +	bool dst = edesc->direction == DMA_DEV_TO_MEM;
> +	struct edma_pset *pset = edesc->pset;
> +	dma_addr_t done, pos;
> +	int ret, i;
> +
> +	/*
> +	 * We always read the dst/src position from the first RamPar
> +	 * pset. That's the one which is active now.
> +	 */
> +	ret = edma_get_position(edesc->slot0, &pos, dst);
> +
> +	/*
> +	 * edma_get_position() can fail due to concurrent
> +	 * updates to the pset. Unlikely, but can happen.
> +	 * Return the last known residue value.
> +	 */
> +	if (ret)
> +		return edesc->residue_stat;
> +

This check could be dropped following the discussion in earlier patch
and also residue_stat could be dropped if there's no other use for it.

> +	/*
> +	 * Cyclic is simple. Just subtract pset[0].addr from pos.
> +	 *
> +	 * We never update edesc->residue in the cyclic case, so we
> +	 * can tell the remaining room to the end of the circular
> +	 * buffer.
> +	 */
> +	if (edesc->cyclic) {
> +		done = pos - pset->addr;
> +		edesc->residue_stat = edesc->residue - done;
> +		return edesc->residue_stat;
> +	}
> +
> +	/*
> +	 * For SG operation we catch up with the last processed
> +	 * status.
> +	 */
> +	pset += edesc->processed_stat;
> +
> +	for (i = edesc->processed_stat; i < edesc->processed; i++, pset++) {
> +		/*
> +		 * If we are inside this pset address range, we know
> +		 * this is the active one. Get the current delta and
> +		 * stop walking the psets.
> +		 */
> +		if (pos >= pset->addr && pos < pset->addr + pset->len) {
> +			edesc->residue_stat = edesc->residue;
> +			edesc->residue_stat -= pos - pset->addr;
> +			break;
> +		}
> +
> +		/* Otherwise mark it done and update residue[_stat]. */
> +		edesc->processed_stat++;
> +		edesc->residue -= pset->len;
> +		edesc->residue_stat = edesc->residue;
> +	}


Also, just curious - doesn't calling status too aggressively result in
any slowness for you? Since vchan lock spinlock will be held during the
duration of this processing, and interrupts would be disabled causing
edma_callback interrupt delays possibly. I'm not saying you introduced
the lock or anything because status would previously also hold the lock.
I just haven't played status enough to know how it will behave when
happening concurrently with DMA transfers.

thanks,
  -Joel

> +	return edesc->residue_stat;
> +}
> +
>  /* Check request completion status */
>  static enum dma_status edma_tx_status(struct dma_chan *chan,
>  				      dma_cookie_t cookie,
> @@ -789,7 +866,7 @@ static enum dma_status edma_tx_status(st
>  
>  	spin_lock_irqsave(&echan->vchan.lock, flags);
>  	if (echan->edesc && echan->edesc->vdesc.tx.cookie == cookie)
> -		txstate->residue = echan->edesc->residue;
> +		txstate->residue = edma_residue(echan->edesc);
>  	else if ((vdesc = vchan_find_desc(&echan->vchan, cookie)))
>  		txstate->residue = to_edma_desc(&vdesc->tx)->residue;
>  	spin_unlock_irqrestore(&echan->vchan.lock, flags);
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 5/6] edma: Make reading the position of active channels work
  2014-04-17 14:40 ` [patch 5/6] edma: Make reading the position of active channels work Thomas Gleixner
@ 2014-04-18  0:47   ` Joel Fernandes
  2014-04-18  1:02     ` Joel Fernandes
  2014-04-18  6:40     ` Thomas Gleixner
  0 siblings, 2 replies; 21+ messages in thread
From: Joel Fernandes @ 2014-04-18  0:47 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/17/2014 09:40 AM, Thomas Gleixner wrote:
> Reading destination and source is pointless. In DEV_TO_MEM transfers
> we are only interested in the destination, in MEM_TO_DEV we care about
> the source. In MEM_TO_MEM it really does not matter which one you
> read.
> 
> Remove the extra pointer and select dest/source via a bool.
> 
> Reading the src/dst data from an active parameter set in the EDMA
> parameter RAM is not reliable, as there might be a concurrent update
> from the controller.
> 
> But experimentation showed, that a double readout with comparing the
> results and a limited loop works nicely. I've actually never found a
> case where the loop limit triggered, but we have it there for sanity
> reasons. In case it triggers we return -EBUSY and let the caller deal
> with it.
> 
> Remove the export of this function while at it. The only potential
> user is the dmaengine and that's always builtin.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/arm/common/edma.c             |   48 ++++++++++++++++++++++++++-----------
>  include/linux/platform_data/edma.h |    2 -
>  2 files changed, 35 insertions(+), 15 deletions(-)
> 
> Index: linux/arch/arm/common/edma.c
> ===================================================================
> --- linux.orig/arch/arm/common/edma.c
> +++ linux/arch/arm/common/edma.c
> @@ -994,29 +994,49 @@ void edma_set_dest(unsigned slot, dma_ad
>  EXPORT_SYMBOL(edma_set_dest);
>  
>  /**
> - * edma_get_position - returns the current transfer points
> + * edma_get_position - returns the current transfer point
>   * @slot: parameter RAM slot being examined
> - * @src: pointer to source port position
> - * @dst: pointer to destination port position
> + * @pos:  where to store the data
> + * @dst:  true selects the dest position, false the source
>   *
> - * Returns current source and destination addresses for a particular
> - * parameter RAM slot.  Its channel should not be active when this is called.
> + * Return 0 indicates a stable readout. -EBUSY indicates that the
> + * readout failed due to concurrent updates.
> + *
> + * Call this on active channels with care. For inactive channels this
> + * never fails.
>   */
> -void edma_get_position(unsigned slot, dma_addr_t *src, dma_addr_t *dst)
> +int edma_get_position(unsigned slot, dma_addr_t *pos, bool dst)
>  {
> -	struct edmacc_param temp;
> -	unsigned ctlr;
> +	u32 dat, ctlr, offs;
> +	int i;
>  
>  	ctlr = EDMA_CTLR(slot);
>  	slot = EDMA_CHAN_SLOT(slot);
>  
> -	edma_read_slot(EDMA_CTLR_CHAN(ctlr, slot), &temp);
> -	if (src != NULL)
> -		*src = temp.src;
> -	if (dst != NULL)
> -		*dst = temp.dst;
> +	if (slot >= edma_cc[ctlr]->num_slots)
> +		return -EINVAL;
> +
> +	offs = PARM_OFFSET(slot);
> +	offs += dst ? PARM_DST : PARM_SRC;
> +
> +	/*
> +	 * If the channel is active, we need to double read as we
> +	 * might see half updated data. We limit this to 5
> +	 * attempts. If that fails we return -EBUSY and let the caller
> +	 * deal with it.
> +	 */
> +	dat = edma_read(ctlr, offs);
> +	for (i = 0; i < 5; i++) {
> +		u32 tmp = edma_read(ctlr, offs);
> +
> +		if (tmp == dat) {
> +			*pos = dat;
> +			return 0;
> +		}
> +		dat = tmp;
> +	}
> +	return -EBUSY;
>  }
> -EXPORT_SYMBOL(edma_get_position);
>  
>  /**

The access is synchronized though and you shouldn't see unreliable
results, unless you _are_ seeing unreliable results in which case it
could be a silicon issue.

The original code also doesn't have the double read so I would just drop
that, unless you are seeing reliability issues.

Here's a thread that discusses it and the end conclusion is that it
should be fine (Kyle Kastile is one of the EDMA hardware designers).
http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/99/t/66011.aspx

thanks,
  -Joel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 5/6] edma: Make reading the position of active channels work
  2014-04-18  0:47   ` Joel Fernandes
@ 2014-04-18  1:02     ` Joel Fernandes
  2014-04-18  6:40     ` Thomas Gleixner
  1 sibling, 0 replies; 21+ messages in thread
From: Joel Fernandes @ 2014-04-18  1:02 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/17/2014 07:47 PM, Joel Fernandes wrote:
> On 04/17/2014 09:40 AM, Thomas Gleixner wrote:
>> Reading destination and source is pointless. In DEV_TO_MEM transfers
>> we are only interested in the destination, in MEM_TO_DEV we care about
>> the source. In MEM_TO_MEM it really does not matter which one you
>> read.
>>
>> Remove the extra pointer and select dest/source via a bool.
>>
>> Reading the src/dst data from an active parameter set in the EDMA
>> parameter RAM is not reliable, as there might be a concurrent update
>> from the controller.
>>
>> But experimentation showed, that a double readout with comparing the
>> results and a limited loop works nicely. I've actually never found a
>> case where the loop limit triggered, but we have it there for sanity
>> reasons. In case it triggers we return -EBUSY and let the caller deal
>> with it.
>>
>> Remove the export of this function while at it. The only potential
>> user is the dmaengine and that's always builtin.
>>
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> ---
>>  arch/arm/common/edma.c             |   48 ++++++++++++++++++++++++++-----------
>>  include/linux/platform_data/edma.h |    2 -
>>  2 files changed, 35 insertions(+), 15 deletions(-)
>>
>> Index: linux/arch/arm/common/edma.c
>> ===================================================================
>> --- linux.orig/arch/arm/common/edma.c
>> +++ linux/arch/arm/common/edma.c
>> @@ -994,29 +994,49 @@ void edma_set_dest(unsigned slot, dma_ad
>>  EXPORT_SYMBOL(edma_set_dest);
>>  
>>  /**
>> - * edma_get_position - returns the current transfer points
>> + * edma_get_position - returns the current transfer point
>>   * @slot: parameter RAM slot being examined
>> - * @src: pointer to source port position
>> - * @dst: pointer to destination port position
>> + * @pos:  where to store the data
>> + * @dst:  true selects the dest position, false the source
>>   *
>> - * Returns current source and destination addresses for a particular
>> - * parameter RAM slot.  Its channel should not be active when this is called.
>> + * Return 0 indicates a stable readout. -EBUSY indicates that the
>> + * readout failed due to concurrent updates.
>> + *
>> + * Call this on active channels with care. For inactive channels this
>> + * never fails.
>>   */
>> -void edma_get_position(unsigned slot, dma_addr_t *src, dma_addr_t *dst)
>> +int edma_get_position(unsigned slot, dma_addr_t *pos, bool dst)
>>  {
>> -	struct edmacc_param temp;
>> -	unsigned ctlr;
>> +	u32 dat, ctlr, offs;
>> +	int i;
>>  
>>  	ctlr = EDMA_CTLR(slot);
>>  	slot = EDMA_CHAN_SLOT(slot);
>>  
>> -	edma_read_slot(EDMA_CTLR_CHAN(ctlr, slot), &temp);
>> -	if (src != NULL)
>> -		*src = temp.src;
>> -	if (dst != NULL)
>> -		*dst = temp.dst;
>> +	if (slot >= edma_cc[ctlr]->num_slots)
>> +		return -EINVAL;
>> +
>> +	offs = PARM_OFFSET(slot);
>> +	offs += dst ? PARM_DST : PARM_SRC;
>> +
>> +	/*
>> +	 * If the channel is active, we need to double read as we
>> +	 * might see half updated data. We limit this to 5
>> +	 * attempts. If that fails we return -EBUSY and let the caller
>> +	 * deal with it.
>> +	 */
>> +	dat = edma_read(ctlr, offs);
>> +	for (i = 0; i < 5; i++) {
>> +		u32 tmp = edma_read(ctlr, offs);
>> +
>> +		if (tmp == dat) {
>> +			*pos = dat;
>> +			return 0;
>> +		}
>> +		dat = tmp;
>> +	}
>> +	return -EBUSY;
>>  }
>> -EXPORT_SYMBOL(edma_get_position);
>>  
>>  /**
> 
> The access is synchronized though and you shouldn't see unreliable
> results, unless you _are_ seeing unreliable results in which case it
> could be a silicon issue.
> 
> The original code also doesn't have the double read so I would just drop
> that, unless you are seeing reliability issues.
> 
> Here's a thread that discusses it and the end conclusion is that it
> should be fine (Kyle Kastile is one of the EDMA hardware designers).
> http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/99/t/66011.aspx

I'm sorry to have messed up Kyle's name, its Kyle Castille ;-)

thanks,
 -Joel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 5/6] edma: Make reading the position of active channels work
  2014-04-18  0:47   ` Joel Fernandes
  2014-04-18  1:02     ` Joel Fernandes
@ 2014-04-18  6:40     ` Thomas Gleixner
  2014-04-18  9:24       ` Thomas Gleixner
  1 sibling, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2014-04-18  6:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 17 Apr 2014, Joel Fernandes wrote:
> On 04/17/2014 09:40 AM, Thomas Gleixner wrote:
> > +	/*
> > +	 * If the channel is active, we need to double read as we
> > +	 * might see half updated data. We limit this to 5
> > +	 * attempts. If that fails we return -EBUSY and let the caller
> > +	 * deal with it.
> > +	 */
> > +	dat = edma_read(ctlr, offs);
> > +	for (i = 0; i < 5; i++) {
> > +		u32 tmp = edma_read(ctlr, offs);
> > +
> > +		if (tmp == dat) {
> > +			*pos = dat;
> > +			return 0;
> > +		}
> > +		dat = tmp;
> > +	}
> > +	return -EBUSY;
> >  }
> > -EXPORT_SYMBOL(edma_get_position);
> >  
> >  /**
> 
> The access is synchronized though and you shouldn't see unreliable
> results, unless you _are_ seeing unreliable results in which case it
> could be a silicon issue.
>
> The original code also doesn't have the double read so I would just drop
> that, unless you are seeing reliability issues.
>
> Here's a thread that discusses it and the end conclusion is that it
> should be fine (Kyle Kastile is one of the EDMA hardware designers).
> http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/99/t/66011.aspx

Gah, right I saw the issues with the original code, which does a
memcpy_fromio which copies byte by byte. Did not think about retrying
with the edma_read() based one.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 6/6] dma: edma: Provide granular accounting
  2014-04-18  0:45   ` Joel Fernandes
@ 2014-04-18  7:03     ` Thomas Gleixner
  2014-04-18 16:22       ` Joel Fernandes
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2014-04-18  7:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 17 Apr 2014, Joel Fernandes wrote:
> On 04/17/2014 09:40 AM, Thomas Gleixner wrote:
> > +	u32				residue_stat;
> > +	int				slot0;
> 
> Instead of slot0, perhaps storing a pointer to the echan would be great
> incase in the future we need to access something else in the channel.
> Then you can just do edesc->echan->slot[0].

Sure.
 
> > @@ -645,7 +652,17 @@ static void edma_callback(unsigned ch_nu
> >  				vchan_cookie_complete(&edesc->vdesc);
> >  				edma_execute(echan);
> >  			} else {
> > +				int i, n;
> > +
> >  				dev_dbg(dev, "Intermediate transfer complete on channel %d\n", ch_num);
> > +
> > +				/* Update statistics for tx_status */
> > +				n = edesc->processed;
> > +				for (i = edesc->processed_stat; i < n; i++)
> > +					edesc->residue -= edesc->pset[i].len;
> > +				edesc->processed_stat = n;
> > +				edesc->residue_stat = edesc->residue;
> > +
> 
> This is a bit of a NAK since it is O(n) on every intermediate transfer
> completion. I'm not sure of a better way to do it but I certainly don't
> want the IRQ slowed down anymore... Is there a way to pre-calculate the
> size of the intermediate transfer and store it somewhere for accounting?

Yeah, I can do that. Adds an extra field to edma_pset, but that
shouldnt hurt.

> Also another thing is I don't think processed_stat is required at all,
> because I think you introduced it for the possibility that there might
> be delays in processing interrupts for intermediate transfers. That is

No, I introduced it to not walk the full list of psets in every
tx_status() call. It's keeping track of the already finished slots.

> actually an impossibility because such a scenario would result in
> catastrophic failure anyway. Because for intermediate transfer
> interrupts shouldn't be missed, so in this case you can just do:
> 
> for (i = 1; i <= MAX_NR_SG; i++) {
> 	edesc->residue -= edesc->pset[edesc->processed - i].len;
> }
> 
> I think that'll work.

No. The point is:

submit()

tx_status()
	
  We dont want to iterate through all psets when we are polling in a
  loop. So we progress processed_stat when we see a finished slot.
  And thereby we adjust edesc->residue.

Now in the intermediate interrupt we need to take processed_stat into
account when we update residue.

But when we store the sums in the psets then this should become O(1)
and avoid the loop business.

> > +	if (ret)
> > +		return edesc->residue_stat;
> > +
> 
> This check could be dropped following the discussion in earlier patch
> and also residue_stat could be dropped if there's no other use for it.

Right.
 
> > +
> > +		/* Otherwise mark it done and update residue[_stat]. */
> > +		edesc->processed_stat++;
> > +		edesc->residue -= pset->len;
> > +		edesc->residue_stat = edesc->residue;
> > +	}
> 
> 
> Also, just curious - doesn't calling status too aggressively result in
> any slowness for you? Since vchan lock spinlock will be held during the
> duration of this processing, and interrupts would be disabled causing
> edma_callback interrupt delays possibly. I'm not saying you introduced
> the lock or anything because status would previously also hold the lock.
> I just haven't played status enough to know how it will behave when
> happening concurrently with DMA transfers.

Well, in the case I'm looking at (cyclic transfer) I do not care about
the interrupt at all. I get a notification for the first incoming
packet via the peripheral and go from there.

But yes, it might cause slight delays for the interrupt in the SG
case. Though the irq disabled region is small if you actively poll as
we keep track of the processed slots and therefor only have one or two
readouts to process.

But as Russell pointed out, this should be rewritten by updating the
chain on the fly and not waiting until MAX_SG has been processed.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 5/6] edma: Make reading the position of active channels work
  2014-04-18  6:40     ` Thomas Gleixner
@ 2014-04-18  9:24       ` Thomas Gleixner
  2014-04-18 16:40         ` Joel Fernandes
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2014-04-18  9:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 18 Apr 2014, Thomas Gleixner wrote:
> On Thu, 17 Apr 2014, Joel Fernandes wrote:
> > Here's a thread that discusses it and the end conclusion is that it
> > should be fine (Kyle Kastile is one of the EDMA hardware designers).
> > http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/99/t/66011.aspx
> 
> Gah, right I saw the issues with the original code, which does a
> memcpy_fromio which copies byte by byte. Did not think about retrying
> with the edma_read() based one.

Actually I did expect memcpy_fromio to be more clever. But you are
right, using the 32bit wide read works like a charm.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 6/6] dma: edma: Provide granular accounting
  2014-04-18  7:03     ` Thomas Gleixner
@ 2014-04-18 16:22       ` Joel Fernandes
  0 siblings, 0 replies; 21+ messages in thread
From: Joel Fernandes @ 2014-04-18 16:22 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Thomas,

On 04/18/2014 02:03 AM, Thomas Gleixner wrote:
[..]
>>> @@ -645,7 +652,17 @@ static void edma_callback(unsigned ch_nu
>>>  				vchan_cookie_complete(&edesc->vdesc);
>>>  				edma_execute(echan);
>>>  			} else {
>>> +				int i, n;
>>> +
>>>  				dev_dbg(dev, "Intermediate transfer complete on channel %d\n", ch_num);
>>> +
>>> +				/* Update statistics for tx_status */
>>> +				n = edesc->processed;
>>> +				for (i = edesc->processed_stat; i < n; i++)
>>> +					edesc->residue -= edesc->pset[i].len;
>>> +				edesc->processed_stat = n;
>>> +				edesc->residue_stat = edesc->residue;
>>> +
>>
>> This is a bit of a NAK since it is O(n) on every intermediate transfer
>> completion. I'm not sure of a better way to do it but I certainly don't
>> want the IRQ slowed down anymore... Is there a way to pre-calculate the
>> size of the intermediate transfer and store it somewhere for accounting?
> 
> Yeah, I can do that. Adds an extra field to edma_pset, but that
> shouldnt hurt.

Sure, I guess without adding any more space than you initially did
(added a note on len element in the structure below)

then do something like:
 edma_pset[edesc->processed_stat].sum - edema_pset[edesc->proccessed].sum

and subtract that from the residue in O(1).

len element in pset can be got rid off as the same sum can be used to
find it in tx_status by len = pset[i].sum - pset[i-1].sum.

>> Also another thing is I don't think processed_stat is required at all,
>> because I think you introduced it for the possibility that there might
>> be delays in processing interrupts for intermediate transfers. That is
> 
> No, I introduced it to not walk the full list of psets in every
> tx_status() call. It's keeping track of the already finished slots.

Ah, ok. cool.

>> actually an impossibility because such a scenario would result in
>> catastrophic failure anyway. Because for intermediate transfer
>> interrupts shouldn't be missed, so in this case you can just do:
>>
>> for (i = 1; i <= MAX_NR_SG; i++) {
>> 	edesc->residue -= edesc->pset[edesc->processed - i].len;
>> }
>>
>> I think that'll work.
> 
> No. The point is:
> 
> submit()
> 
> tx_status()
> 	
>   We dont want to iterate through all psets when we are polling in a
>   loop. So we progress processed_stat when we see a finished slot.
>   And thereby we adjust edesc->residue.

Ah ok, that's certainly fast to skip all progressive updates and jump to
the current.


>>> +
>>> +		/* Otherwise mark it done and update residue[_stat]. */
>>> +		edesc->processed_stat++;
>>> +		edesc->residue -= pset->len;
>>> +		edesc->residue_stat = edesc->residue;
>>> +	}
>>
>>
>> Also, just curious - doesn't calling status too aggressively result in
>> any slowness for you? Since vchan lock spinlock will be held during the
>> duration of this processing, and interrupts would be disabled causing
>> edma_callback interrupt delays possibly. I'm not saying you introduced
>> the lock or anything because status would previously also hold the lock.
>> I just haven't played status enough to know how it will behave when
>> happening concurrently with DMA transfers.
> 
> Well, in the case I'm looking at (cyclic transfer) I do not care about
> the interrupt at all. I get a notification for the first incoming
> packet via the peripheral and go from there.
> 
> But yes, it might cause slight delays for the interrupt in the SG
> case. Though the irq disabled region is small if you actively poll as
> we keep track of the processed slots and therefor only have one or two
> readouts to process.
> 
> But as Russell pointed out, this should be rewritten by updating the
> chain on the fly and not waiting until MAX_SG has been processed.

Ok, sounds good. Thanks.

regards,
  -Joel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 5/6] edma: Make reading the position of active channels work
  2014-04-18  9:24       ` Thomas Gleixner
@ 2014-04-18 16:40         ` Joel Fernandes
  0 siblings, 0 replies; 21+ messages in thread
From: Joel Fernandes @ 2014-04-18 16:40 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/18/2014 04:24 AM, Thomas Gleixner wrote:
> On Fri, 18 Apr 2014, Thomas Gleixner wrote:
>> On Thu, 17 Apr 2014, Joel Fernandes wrote:
>>> Here's a thread that discusses it and the end conclusion is that it
>>> should be fine (Kyle Kastile is one of the EDMA hardware designers).
>>> http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/99/t/66011.aspx
>>
>> Gah, right I saw the issues with the original code, which does a
>> memcpy_fromio which copies byte by byte. Did not think about retrying
>> with the edma_read() based one.
> 
> Actually I did expect memcpy_fromio to be more clever. But you are
> right, using the 32bit wide read works like a charm.

Cool, glad it does :)

regards,
  -Joel

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-04-18 16:40 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-17 14:40 [patch 0/6] dma: edma: Provide granular residue accounting Thomas Gleixner
2014-04-17 14:40 ` [patch 2/6] dma: edma: Check the current decriptor first in tx_status() Thomas Gleixner
2014-04-17 14:40 ` [patch 1/6] dma: edma: Sanitize residue reporting Thomas Gleixner
2014-04-18  0:02   ` Joel Fernandes
2014-04-17 14:40 ` [patch 3/6] dma: edma: Create private pset struct Thomas Gleixner
2014-04-17 23:56   ` Joel Fernandes
2014-04-17 14:40 ` [patch 5/6] edma: Make reading the position of active channels work Thomas Gleixner
2014-04-18  0:47   ` Joel Fernandes
2014-04-18  1:02     ` Joel Fernandes
2014-04-18  6:40     ` Thomas Gleixner
2014-04-18  9:24       ` Thomas Gleixner
2014-04-18 16:40         ` Joel Fernandes
2014-04-17 14:40 ` [patch 4/6] dma: edma: Store transfer data in edma_desc and edma_pset Thomas Gleixner
2014-04-17 14:40 ` [patch 6/6] dma: edma: Provide granular accounting Thomas Gleixner
2014-04-18  0:45   ` Joel Fernandes
2014-04-18  7:03     ` Thomas Gleixner
2014-04-18 16:22       ` Joel Fernandes
2014-04-17 20:07 ` [patch 0/6] dma: edma: Provide granular residue accounting Russell King - ARM Linux
2014-04-17 20:31   ` Thomas Gleixner
2014-04-17 20:38     ` Russell King - ARM Linux
2014-04-17 21:14       ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.