On Tue, Nov 12, 2013 at 03:42:59PM +0100, Martin Sperl wrote:

> This "chaining" is _obviously_ not possible with the transfer_one
> message BUS-master-API (or it would require some "ugly hacks" to
> make use of transfer_one_message API)

Meh, it doesn't seem that urgently different to be honest.  The basic
flow of interactions with the device is very much typical of SPI
controllers.

In any case while it's not entirely clearly explained here I *think* the
major thing that's being done differently here is that the register
writes to the controller are being done using DMA.  Is that the case?

That's certainly a bit of an innovation and I don't think it's actually
all that device specific - while reusing the same channel for things
other than the actual data transfer is a bit cute I can't see any
obvious reason why it shouldn't work and it's something that I'd expect
a great proportion of DMA controllers would be able to support.  The
delays are a bit dodgy for that but for untimed delays could probably be
implemented with dummy memcpy() type transfers easily enough.

I'd expect this to integrate nicely with the general stuff to split out
the pushing of transfers and configuration of transfers that I'd been
kicking around - transfer_one() is intended to be split into two for
this, one bit to configure the transfer and another bit to actually do
it.  The original idea was more around recognising and exploiting
scatter/gather patterns but pipelining the DMAs for configuration would
allow that to be extended a bit.

One thing that does concern me is the overhead of all the DMA work for
smaller transfers - once there's a queue built up it's more clearly
useful but without a queue it's not obvious that the effort is going to
be worth it.

> Well - as said I am wondering if the DMA engine can really can help
> - especially if you want to keep the "prepare" code fast. I am using
> dmapool for memory alloc/free, but I fear we would need to run 2 mallocs
> if I was using DMA-engine and more setup/tear-down overhead, as well
> as more code that needs to get executed to go via DMA-engine.

The big win from dmaengine is that it means that you don't need to
implement essentially the same DMA code in each and every driver, all
the DMA drivers end up with a uniform API.  Since by and by large the
process of mapping a SPI message for DMA is the same no matter what the
controller is this ends up being a win and we don't require each and
every master driver author to jump through the same hoops.

If it doesn't do what's desired we should fix it to make it better (same
as here).

> As said: I will keep it as is for now and first implement the full DMA
> chaining approach. then we can see how much the move to DMA-Engine
> is costing...

That's fine for experimentation but for an actual merged API I think it
does need to be generally applicable so that it's worth the effort for
client drivers to support it.

> not the prime "candidates" for the improvement in the first place.
> If they want to run with minimal latencies, they HAVE to run spi_async.
> Anything else is not very efficient as it requires several context
> switches by the way of the current design.

Well, the benefits of submitting things asynchronously are as much about
giving the stack something to play with and reducing turnaround time as
anything else - it means the stack can submit the next transfer to the
hardware before it tells the caller that the current one completed which
reduces the idle time on the bus and means that there's much more scope
for reducing needless context switching.  Right now we're not doing as
good a job of that as we can in most drivers though frankly even the
current state is an improvement on what we had.

> >The other bit of easy to implement that concerns me is making sure that
> >it's easy to implement for the stack as a whole - one of the big

> If you can get away from the idea that prepare is ONLY related to
> the Master, then it may provide additional "advantages". Also when
> you do have a non-prepared message, you can still can call the
> prepare code from the framework and afterwards unprepare it - like I
> had proposed.

> That way any "checks" the driver would need to do to see if it can
> handle the message, it can do immediately in the "prepare" code.

I'm having a really hard time getting excited about the improvements
from saving the checks though, they just seem so small for the effort by
themselves.  You are right, they do exist, but they seem so small in
relation to the overall transfer time.  That said if a patch were to
appear which did that and didn't provide any master side hooks I think
that'd probably be OK.

> whatever the driver may find helpful for its own purpose. One
> candidate here could be also implement dma_map_single in the prepare
> call...

Of course this reduces the range of situations where the calling driver
can reuse a prepared message since it means if they want to access the
data they have to unprepare and then reprepare the message which rather
seems to defeat the point of having them worry about doing that.  This
is why I was asking about nailing down the sematics, I was worried about
this.

> So separating things out immediately reducing code of the individual
> drivers if they move to the newer interface of schedule_message...
> if they do not need prepare/unprepare, then it is (almost) a NOP.

Except for all the actually configuring the hardware stuff on
configuration changes if the register writes aren't DMAed (which is
going to be the common case unless it's frameworked).

> >Adding new interfaces is a change too, on the driver side as well.

> Which is optional and would get first used only by device-drivers
> that need high tru-put. And those device-drivers that need this,
> should already be heavily optimized to avoid memory allocations
> (even if it is just on the heap) and message setup for every call.

Equally well is it really worth their while if you don't get much win
without a specially optimised controller driver?  Extra advantages with
a good driver are one thing but requiring it makes it a harder sell to
drivers using the API.

> >These seem rather artificial and I'm not sure they'd tell us much about
> >why the effects were being observed.  What would be more interesting
> >would be seeing where the time is being spent at the minute for real
> >loads (for example by using trace points like those that are currently
> >being added in mainline for examining the behaviour of the message pump)

> well - my use case is exactly that CAN issue where I have message loss
> when only sending 5 packets in sequence in 30% of all cases with
> stock drivers. That is why I started investigating the issue in the
> first place - so do not tell me it is academic!

Sure, there's room for improvement but let's make that as general an
improvement as possible.  Something that flows better for existing
drivers is much more of a win there.  I'm getting alarm bells that the
gains for this one system might be hard to realise on other systems.

> The proposed SPI-test driver is academic to some extent, but it
> still tries to simulate "real" device traffic with optimal thru-put.
> Or do you think that reading an ADC as fast as possible is an
> academic exercise? (the test driver would essentially mimic that and
> allow for
> scheduling mechanism to work that allow high thruput (if you
> configure multiple messages)

I'm suspicous of benchmarks which limit so that you end up running so
fast you don't have time to actually do anything with the data, you can
end up tuning things so that more realistic users suffer (spending too
much time in interrupt context for example).

> >- if we really are spending a noticeable proportion of the time in setup
> >then that's obviously a good case for optimising it but right now I
> >don't feel I understand which bits are slow enough to register in the
> >context of the overall transaction and hence if the suggested approach
> >is the best way to optimise them.

> well - you see the setup I have to make to transfer 2 writes 10 reads?
> that is 10+ calls to dmapool_alloc, then filling in the structure.
> That takes some time (cant remember how many usec right now).
> And then tearing it down I need to do the same again - ok mostly
> dmapool_release...

Yes, but then this is because the driver is doing lots of really small
DMAs when most only DMA for larger transfers and then only the data.  I
can see can be a benefit if you have a big enough DMA queue to be busy
for extended periods but that's not always going to be the case and then
you get into wondering about how much you can keep mapped and so on.

Like I say I can see this being a technique which is useful over a range
of drivers but there's going to be a point where it's simpler to just do
the I/O.

> >An implementation that avoided the need to have the master drivers
> >updated in order to get benefits would be a lot more interesting since
> >that would make this much more generally useful, and of course concrete
> >patches would be good too, but the main thing I'm missing is clarity on
> >how the gains are being made.

> a master driver will never be able to profit from the new interface,
> besides the "small" gain of not having to validate the messages
> every time in __spi_async (if the device-driver does call prepare)

What about all the DMA pipelining stuff for the register I/O and so on?
That seems to be the biggest part of the win here.

> And as said: all the spi_sync methods are candidates to optimization
> of this - at the cost of some memory used for caching multiple
> messages.
> (say having 16 times 1xspi_message+1xspi_transfer structures for
> transfer length 1 to 16 in spi_write. Similar for spi_read. Or 256
> 1xspi_message+2xspi_transfer to transfer 1-16 writes and 1-16 reads.
> plus some memory copy for the reads and writes)

But then that's a memory plus cache management cost (per master), not to
mention complexity.