Re: ARM: bcm2835: DMA driver + spi_optimize_message - some questions/info

From: Mark Brown <broonie-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
To: martin sperl <kernel-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org>
Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rpi-kernel
	<linux-rpi-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org>,
	Stephen Warren <swarren-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
Subject: Re: ARM: bcm2835: DMA driver + spi_optimize_message - some questions/info
Date: Mon, 21 Apr 2014 23:20:26 +0100	[thread overview]
Message-ID: <20140421222026.GH12304@sirena.org.uk> (raw)
In-Reply-To: <25BB67E3-2929-49D2-BDDE-2D6B3D43534E-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 6240 bytes --]

On Fri, Apr 11, 2014 at 02:40:07PM +0200, martin sperl wrote:
> On 11.04.2014 00:35, Mark Brown wrote:

> > The main thing I'd like to do is actually coalesce adjacent transfers within
> > a message if they are compatible and we're doing DMA so we end up with fewer
> > hardware interactions overall. There's a fairly common pattern of writing
> > multiple registers (for firmware download or setting coefficients) where you
> > have a small transfer for a register then a larger one for a data block which
> > would benefit.
> 
> I just found out something similar just a few days ago: 
> doing a write then read with 2+1 is more effort to handle than a read/write
> of 3 bytes and some moving of data arround. Also it actually saves memory,
> as an spi_transfer is bigger than those 3 extra bytes needed.
> With the DMA Driver this saved me 4us on overall transfer time -  It dropped
> from 52us to 47us. For comparisson: the real data transfer for those 36 
> bytes (without 7x CS) takes 35us - so we are pretty close to ideal...

It can be even nicer for drivers where there's some minimum limit on the
size of transfers (quite a lot of them), it can sometimes allow short
writes to be converted from being PIO to being combined with an adjacent
DMAs.  Possibly even two PIOs making a DMA, though that case is more on
the edge.

> There is not so much of a difference with the PIO drivers as they take 200us
> for the same, but it sees a drop on the number of interrupts (typically 
> 1 IRQ/transfer), which increases CPU availability for userspace...
> Redesigning the interrupt-handler to push bytes spanning multiple 
> spi_transfers into the fifo together would help, but then it becomes much 
> more complex - at least the spi-bcm2835 does call "complete(&bs->done);"
> when it has finished the transfer.

It's probably not worth doing unless it's factored out into the
framework and affects lots of drivers; it seems most likely to get done
as part of a general improvement in the ability to drive the queue from
interrupt context.  Most systems won't be able to do the fully DMAed
pipeline the Pi can.

> > Right, that's a peculiarity of your system (hopefully) which makes life
> > difficult for that. The Samsung controller has a vaugely similar thing
> > where the transmit interrupt goes off too early for what we want so we 
> > need to ensure there's always a read if we're doing interrupts but that
> > is a lot easier to work around.

> Specs says to use 2 Interrupts one for feeding data to FIFO the other to
> read data from FIFO. Probably one could use multiple interleaved DMAs for
> reads and writes, but then you would be limited to 32 byte transfers and
> would have some "gaps" between the fifos filling/emptying - this way at 
> least 65535 bytes can get sent with a single transfer...

Sure, that's standard though looking at the code in drivers it seems
there's a fairly common pattern of just ignoring one of the completions,
usually the transmit one, at the SPI level since we can guarantee that
the other will always come later.

> The interrupt DMA is more on the tricky side - if we had to have an 
> interrupt every time an spi_message finished, it would be possible with
> 2 DMAs, but as we have the possibility of chaining multiple spi_messages
> together for higher thruput, then this "waiting" for interrupt is not
> possible (IRQ flag gets sometimes cleared by the next DMA controlblock),
> so we need to have another "stable" IRQ source and hence DMA3.
> (actually I was using the TX DMA for that, which gets started by the
> RX-DMA, so we have a race-time between DMA for IRQ and the interrupt
> handler on the CPU reading the registers quickly enough to find out the
> source of the IRQ)

I think it's worth implementing the three DMA solution as an
optimisation on the two DMA version partly to get the big part of the
win from the two DMAs integrated but also because that's going to be the
more generally useful pattern.  It also lets the tricky bit with the
extra DMA be considered separately.

> One of those found is the fact that an optimized message can only reliably
> get used with spi_async. For spi_sync use it has to have the complete
> function defined when optimizing, as it would otherwise not implement a
> callback IRQ in DMA. And then when spi_sync sets the complete function
> the DMA would still issue the optimized message without interrupts.

Hrm.  We could provide a flag saying this will be used synchronously, or
more likely given driver use patterns the other way around.  Doesn't
seem quite elegant though.  Or insert a dummy transfer (eg, some
register read) which does interrupt into the chain after the real one to
do the callback.

> > On the other hand you may find that what you've got already is enough
> > for people, or get a feel for the sort of things people are looking for
> > to show a win.  Sometimes you'll even get "oh, that's a really good idea
> > I'll just add the feature" (if you're very lucky).

> So who else should I involve as soon as I have presentable code?
> (coding standard issues I have already fixed - mostly...)

The dmaengine maintainers for the DMA stuff - Vinod Koul
<vinod.koul-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> and Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> - and
the GPIO maintainers for that - Linus Walleij <linus.walleij-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
and Alexandre Courbot <gnurou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> (check MAINTAINERS).  Plus
anyone working on the drivers you modified.

> the same, but is specific to the PI...
> http://skpang.co.uk/catalog/pican-canbus-board-for-raspberry-pi-p-1196.html

> And in both cases you probably will need a peer which accepts messages or
> sends them...

> It is a bit tricky - especially the setting up of the correct Baud rate
> on the CAN-Bus side - so people are complaining that it is hard to make
> it work...

Hrm, OK.  Might be too fiddly, dunno.  I do also keep thinking I should
get a FPGA board and impelement some test devices that way to let me
exercise the framework, it's not exactly realistic though.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]