linux-spi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Fwd: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]   ` <CACRpkdb6y=o4__snBs2DR1f=xW_u7KdkHg3fb7XN5e2gicBJeg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-10-29 16:59     ` Linus Walleij
       [not found]       ` <CACRpkdb7y88oq7XyVFc_0Nx4pXtaebPe7KB2yizBRJGwWLqJig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Linus Walleij @ 2013-10-29 16:59 UTC (permalink / raw)
  To: linux-spi-u79uwXL29TY76Z2rM5mHXA, Mark Brown, Martin Sperl

Let us discuss this on the new mailing list, BTW.

On Mon, Oct 28, 2013 at 11:42 AM, Martin Sperl <martin-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> wrote:

> (... ) I thought of moving away from spi_transfer_one_message and back to
> the simpler transfer interface, where the preprocessing would get done
> (DMA control block-chain generation) and then appending it to the
> existing (possibly running) DMA chain.

OK quite a cool idea.

But I hope that you have the necessary infrastructure using the dmaengine
subsystem for this, or that changes requires will be proposed to that
first or together with these changes.

As you will be using dmaengine (I guess?) maybe a lot of this can
actually be handled directly in the core since that code should be
pretty generic, or in a separate file like spi-dmaengine-chain.c?

> But just yesterday I was looking thru the code and came to the message:
> "master is unqueued, this is depreciated" (drivers/spi/spi.c Line 1167).
> This came in with commit ffbbdd21329f3e15eeca6df2d4bc11c04d9d91c0 and got included in 3.4.
>
> So I am wondering why you would depreciate this interface

Simply because none of the in-kernel users was doing what you are
trying to do now. And noone said anything about such future usecases,
so how could we know?

> Now this brings me to different question:
> Could we implement some additional "functions" for preparing
> an SPI message (...)
> The interface could looks something like this:
> int spi_prepare_message(struct_spi_dev*, struct spi_message *);
> int spi_unprepare_message(struct_spi_dev*, struct spi_message *);

Maybe? I cannot tell from the above how this would look so
I think it is better if you send a patch showing how this improves
efficiency.

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Fwd: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]       ` <CACRpkdb7y88oq7XyVFc_0Nx4pXtaebPe7KB2yizBRJGwWLqJig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-10-29 17:56         ` Mark Brown
  2013-10-29 21:18         ` Martin Sperl
  1 sibling, 0 replies; 57+ messages in thread
From: Mark Brown @ 2013-10-29 17:56 UTC (permalink / raw)
  To: Linus Walleij; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, Martin Sperl

[-- Attachment #1: Type: text/plain, Size: 2012 bytes --]

On Tue, Oct 29, 2013 at 09:59:23AM -0700, Linus Walleij wrote:
> Let us discuss this on the new mailing list, BTW.

And with the maintainer...  I have zero context for this so I'm not 100%
sure what the original proposal was.

> On Mon, Oct 28, 2013 at 11:42 AM, Martin Sperl <martin-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> wrote:
> 
> > (... ) I thought of moving away from spi_transfer_one_message and back to
> > the simpler transfer interface, where the preprocessing would get done
> > (DMA control block-chain generation) and then appending it to the
> > existing (possibly running) DMA chain.

> OK quite a cool idea.

How is this going to interact with /CS manipulation?  We need /CS to
rise and fall as requested by the client, if we coalesce adjacent
messages that's going to be an issue...

> As you will be using dmaengine (I guess?) maybe a lot of this can
> actually be handled directly in the core since that code should be
> pretty generic, or in a separate file like spi-dmaengine-chain.c?

FWIW this is one of the things I've been aiming to tackle in my ongoing
refactoring, the initial two things were making the core able to do the
DMA mapping (probably adding a transfer API which passses in sg_list, it
can then handle things like blocks that aren't physically contiguous and
coalescing scatter/gather for the controller) and trying to overlap the
cache flushing of the next transfer with the I/O for the current
transfer.

> > Now this brings me to different question:
> > Could we implement some additional "functions" for preparing
> > an SPI message (...)
> > The interface could looks something like this:
> > int spi_prepare_message(struct_spi_dev*, struct spi_message *);
> > int spi_unprepare_message(struct_spi_dev*, struct spi_message *);

> Maybe? I cannot tell from the above how this would look so
> I think it is better if you send a patch showing how this improves
> efficiency.

The current SPI API has exactly those functions...

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Fwd: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found] ` <F70C4469-6325-48D5-A7CA-52CC1FC404BD-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-10-29 19:09   ` Linus Walleij
  0 siblings, 0 replies; 57+ messages in thread
From: Linus Walleij @ 2013-10-29 19:09 UTC (permalink / raw)
  To: linux-spi-u79uwXL29TY76Z2rM5mHXA, Mark Brown; +Cc: Martin Sperl

Here is Martins original message.

Yours,
Linus Walleij


---------- Forwarded message ----------
From: Martin Sperl <martin-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
Date: Mon, Oct 28, 2013 at 11:42 AM
Subject: Depreciated spi_master.transfer and "prepared spi messages"
for an optimized pipelined-SPI-DMA-driver
To: spi-devel-general-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
Cc: grant.likely-s3s/WqlpOiPyB63q8FvJNQ@public.gmane.org, broonie-yzvPICuk2AATkU/dhu1WVueM+bqZidxxQQ4Iyu8u01E@public.gmane.org,
linus.walleij-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org


Hi!

I am currently writing an SPI driver for the raspberry pi that only
relies on DMA for the whole transfer.
It can already handle a FULL SPI_message with multiple transfers,
CS_CHANGE, and speed_hz changes only by using DMA, so it is using the
dma to the max...

Right now (for the ease of "development") the driver is still using
the transfer_one_message interface.
But unfortunately this interface introduces latencies/gaps between
individual messages because of the latencies: interrupt ->
wakeup_spinlock->schedule message_pump_thread.

So as the DMA driver is already handling a single SPI transfer very
efficiently, there is also an option to chain multiple SPI transfers
in the DMA engine and just get interrupts for completion.

This I why I thought of moving away from spi_transfer_one_message and
back to the simpler transfer interface, where the preprocessing would
get done (DMA control block-chain generation) and then appending it to
the existing (possibly running) DMA chain.

But just yesterday I was looking thru the code and came to the
message: "master is unqueued, this is depreciated" (drivers/spi/spi.c
Line 1167).
This came in with commit ffbbdd21329f3e15eeca6df2d4bc11c04d9d91c0 and
got included in 3.4.

So I am wondering why you would depreciate this interface - I agree
that for most drivers moving to the transfer_one_message interface is
the way forward and reduces code size and bugs.
And this depreciation message will hopefully motivate driver authors
to rewrite their code to make use of the new interface.

But there may be some drivers that would really benefit of using the
more complex interface without it getting depreciated and at some
point possibly removed.
And I believe that the pipelined DMA driver is a good candidate to
show that under some circumstances the transfer interface may still be
the better solution...
Obviously I could work around the inner working of the message-pump by
manssaging the data before some of the calls, which I would say is
even more depreciated!

Now this brings me to different question:
Could we implement some additional "functions" for preparing an SPI
message to reduce the overhead of computation of a spi message
repeatedly by keeping a "prepared" Transaction, which we just need to
schedule it. Changing data would not be a problem, but changing the
data block-addresses would be a change. that would require
recalculating the data.

The interface could looks something like this:
int spi_prepare_message(struct_spi_dev*, struct spi_message *);
int spi_unprepare_message(struct_spi_dev*, struct spi_message *);

together with an additional object (void* prepared) in the spi_message
structure for keeping such prepared data...
OK - abusing the existing queue and state  could be used, but -if i
read the comments correctly - this data is only guaranteed to be
available to the driver between spi_async and the corresponding
callback (outside of it it is used used by the message pump and gets
cleared in spi_finalize_current_message function.


Thanks,
              Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]       ` <CACRpkdb7y88oq7XyVFc_0Nx4pXtaebPe7KB2yizBRJGwWLqJig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-10-29 17:56         ` Mark Brown
@ 2013-10-29 21:18         ` Martin Sperl
       [not found]           ` <06C7F4D3-EC91-46CF-90BE-FC24D54F2389-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  1 sibling, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-10-29 21:18 UTC (permalink / raw)
  To: Linus Walleij; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, Mark Brown

Hi!

> But I hope that you have the necessary infrastructure using the dmaengine
> subsystem for this, or that changes requires will be proposed to that
> first or together with these changes.
> 
> As you will be using dmaengine (I guess?) maybe a lot of this can
> actually be handled directly in the core since that code should be
> pretty generic, or in a separate file like spi-dmaengine-chain.c?

I have to admit - I have not been using that infrastructure so far - a bit uncertain how to make it work yet - I first wanted the prototype working.

Also the kernel on the raspberry Pi (on which I do the development) is not fully upstream (I know) and also some parts are not really implemented.
I know of people who are only using the upstream kernel on a RPI, but there are other limitations there with regards to some drivers. Also there would be a bit of experimentation required to get it in working order - time I do not want to spend momentarily...

Finally I am also not sure how dmaengine can work when 2 DMAs are required working in parallel (one for RX the other for TX).
All this means that I have to schedule the RX DMA from the TX-DMA and vice versa - in addition to configuring the SPI registers via DMA and working arround those HW-bugs.

If the DMA engine would be able to support all of this I have no Idea yet.

But with all that said, I can do a single SPI_MESSAGE with 5 transfers ,  2 of which have CS_CHANGE only with DMA without any interrupts - besides the final interrupt to trigger the wakeup of the spi_message_pump thread.

So obviously from there it is not that much more complicated coalescing multiple spi_messages into a single running DMA thread  - it would be just adding the additional transfers to the DMA chain and making sure the DMA is still running after the fact that we have added it.

The complication then comes mostly in the form of memory management (especially releasing DMA Control blocks), locking,... - which one does not have to take too much care of with the _transfer_one_message interface.

So to put it into perspective:

My main goal is to get an efficient CAN driver for the mcp2515 chip which sits on the SPI bus of the RPI.
I can right now receive about 3200 messages from the CAN Bus per second (close to the absolute maximum for messages with 8 bytes in length - I could more than double the packet count by reducing the packet size to 0 -   with this driver (plus an version of the mcp2515 that uses the async interface and advanced scheduling of messages to "leverage" concurrency - also self written)

With the "stock" spi-bcm2835 driver that is upstream it uses around 50k interrupts and a similar amount of context switches  and still looses packets.
With the current incarnation of the spi-bcm2835dma driver (using the transfer_one_message interface) I run at around 16500 interrupts/s and 22000 context switches.
Still what is biting me the most from the transfer perspective is the fact that there are still too many interrupts and context-switches, which introduces too much latency-unnecessarily.

So with the spi-bcm2835dma with transfer interface I would estimate that I would get the interrupts down to 6400 interrupts and 0 context switches. All this should also have a positive impact on CPU utilization - no longer 80% System load due to scheduling/dma overhead...

As for the prepare_spi_message - I was asking for something different than prepare_transfer_hardware (unless there is new code there in 3.12 that already includes that - or it is in a separate tree).
One would prepare the HW in some way - say waking up a separate thread,...
While what i would like to see would be similar to prepared statements in SQL - prepare the DMA control blocks once for a message (you may not change the structure/addresses, but you may change the data you are transferring) and then when the message gets submitted via spi_async and then via transfer to the driver, it would then make use of the prepared DMA chain to get attached to the DMA queue. This would shorten the need to calculate those DMA control blocks every time - in my case above I run the same computations 3200 times/s - including dmapool_alloc/dmapool_free/... Obviously it ONLY makes sense for SPI-transfers that have is_dma_mapped=1 - otherwise I would have to go thru the loops of dma_map_single_attrs  every time....

So this is to fill in the context for my questions regarding why the "transfer" interface is depreciated and to provide a rational for why I would want to use it.

Ciao,
		Martin

P.s: and for completeness: yes, I can do speed_hz and delay_usecs on a per spi_transfer basis as well in the DMA loop - probably coming closer to the "real" requested delay than the steps interrupt->interrupt-handler->wakeup pump_thread (inside the transfer_one_message handler) -> process the other arguments of xfer that are applied after the transfer-> udelay(xfer->delay_usec) 
On the RPI something like this takes about 0.1ms from one message to the next to get deliverd - ok, it includes some overhead (like calculating the DMA control-blocks), but still it shows the order of magnitude which you can expect when you have to do wait for the message pump to get scheduled (with realtime priority).
So a delay_usec=100 would already be in the range of what we would see naturally from the design for a low-power device with a single core - and that would result in waiting way too long. While with DMA I can get the timing correct to +-5% of the requested value (jitter due to other memory transfers that block the memory bus - but this can possibly get )

I will not mention that I believe that with this SPI-DMA driver it opens up the possibility to read 2 24-ADCs at a rate of 200k with minimal jitter with this simple hardware at 20MHz SPI-bus speed. (some calibration may be required...)


--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]           ` <06C7F4D3-EC91-46CF-90BE-FC24D54F2389-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-10-29 22:53             ` Mark Brown
       [not found]               ` <20131029225353.GB11424-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  2013-10-30  8:40             ` Martin Sperl
  1 sibling, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-10-29 22:53 UTC (permalink / raw)
  To: Martin Sperl; +Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 1786 bytes --]

On Tue, Oct 29, 2013 at 10:18:55PM +0100, Martin Sperl wrote:

Please fix your mailer to word wrap within paragraphs, it will greatly
improve readability of your messages.

> Finally I am also not sure how dmaengine can work when 2 DMAs are
> required working in parallel (one for RX the other for TX).  All this
> means that I have to schedule the RX DMA from the TX-DMA and vice
> versa - in addition to configuring the SPI registers via DMA and
> working arround those HW-bugs.

There's plenty of examples of this in mainline for SPI and other
subsystems, mostly this is fairly simple to implement - for example just
enable the DMAs prior to enablig the SPI controller then the SPI
controller triggerrs the actual transfers.

> But with all that said, I can do a single SPI_MESSAGE with 5 transfers
> ,  2 of which have CS_CHANGE only with DMA without any interrupts -
> besides the final interrupt to trigger the wakeup of the
> spi_message_pump thread.

How exactly does this work in concrete terms?  I'm particularly
interested in how the /CS manipluation is done.  The basic idea doesn't
sound terribly different to things a lot of devices can do with driving
the transfers from the IRQ handler (possibly hard IRQ handler) without
waking the tasklet, I'd hope we can make this work as a special case of
that.

> As for the prepare_spi_message - I was asking for something different
> than prepare_transfer_hardware (unless there is new code there in 3.12
> that already includes that - or it is in a separate tree).

You should be working against the latest development code, we're almost
at the merge window for v3.13, the v3.12 code is about 3 months old at
this point.  Also bear in mind that this is open source, you can extend
the frameworks if they don't do what you want.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]           ` <06C7F4D3-EC91-46CF-90BE-FC24D54F2389-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  2013-10-29 22:53             ` Mark Brown
@ 2013-10-30  8:40             ` Martin Sperl
       [not found]               ` <02BFF0F6-3836-4DEC-AA53-FF100E037DE9-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  1 sibling, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-10-30  8:40 UTC (permalink / raw)
  To: Linus Walleij; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, Mark Brown

Maybe one more argument for the spi_prepare_message interface:

For my use-case it takes about: 0.053ms to get from CS disabled+interrupt+interrupthandler+wakeup to the point where I am starting to do the "real work" - in my case preparing dma chains.
And this then takes - depending on the number of transfers and features requested for this message - another 0.026ms to compute the DMA chain.
and then finally another 0.001ms to the point where CS really gets "driven low" by the DMA.

So if we can "optimize" away this "CPU-task" with preparing the messages, then we are faster and there is a definite save on CPU cycles...

Some quick analysis shows that the dma_pool_alloc takes quite a bit of time out of those (now 0.029ms - due to additional CS up/down):
Total time spent  to create the chain of 48 dma-control-blocks is 29us
Time spent in alloc_dma_pool is 9.1us
The same, but also including the "initialization" of the control-block increases to 18.6us

So the setup of DMA is definitely expensive in itself.

But if I compare with SPI-BCM2835 "apples to apples", I get:
for the spi-bcm2835dma.c (from start of DMA calc till final CS) a total time of: 58us.
for the spi-bcm2835.c (from CS down to CS UP) a total time of: 71us.

So there is an advantage to the DMA driver from the bus-efficiency perspective.
And it would be much bigger when we would have the chains "prepared" already - in my example above it should go down to 27us, which is _almost_ optimal from the CPU perspective...

I can share some images from the logic analyzer if required for future reference.

Ciao,
		Martin

P.s: the spi_message presented in the above is of the following format:
* 4 spi_transfers
** 2 bytes write
** 2 bytes read+CS_CHANGE
** 2 bytes write
** 13 bytes read (implicit CS_change because of end of message)

also my test-workload produces the following statistics on vmstat:
on spi-bcm2835: 79000 interrupts/s and 22000 context-switches and 70% System load
on spi-bcm2835dma: 15700 interrupts/s and 20000 context-switches and 75% System load


On 29.10.2013, at 22:18, Martin Sperl wrote:

> Hi!
> 
>> But I hope that you have the necessary infrastructure using the dmaengine
>> subsystem for this, or that changes requires will be proposed to that
>> first or together with these changes.
>> 
>> As you will be using dmaengine (I guess?) maybe a lot of this can
>> actually be handled directly in the core since that code should be
>> pretty generic, or in a separate file like spi-dmaengine-chain.c?
> 
> I have to admit - I have not been using that infrastructure so far - a bit uncertain how to make it work yet - I first wanted the prototype working.
> 
> Also the kernel on the raspberry Pi (on which I do the development) is not fully upstream (I know) and also some parts are not really implemented.
> I know of people who are only using the upstream kernel on a RPI, but there are other limitations there with regards to some drivers. Also there would be a bit of experimentation required to get it in working order - time I do not want to spend momentarily...
> 
> Finally I am also not sure how dmaengine can work when 2 DMAs are required working in parallel (one for RX the other for TX).
> All this means that I have to schedule the RX DMA from the TX-DMA and vice versa - in addition to configuring the SPI registers via DMA and working arround those HW-bugs.
> 
> If the DMA engine would be able to support all of this I have no Idea yet.
> 
> But with all that said, I can do a single SPI_MESSAGE with 5 transfers ,  2 of which have CS_CHANGE only with DMA without any interrupts - besides the final interrupt to trigger the wakeup of the spi_message_pump thread.
> 
> So obviously from there it is not that much more complicated coalescing multiple spi_messages into a single running DMA thread  - it would be just adding the additional transfers to the DMA chain and making sure the DMA is still running after the fact that we have added it.
> 
> The complication then comes mostly in the form of memory management (especially releasing DMA Control blocks), locking,... - which one does not have to take too much care of with the _transfer_one_message interface.
> 
> So to put it into perspective:
> 
> My main goal is to get an efficient CAN driver for the mcp2515 chip which sits on the SPI bus of the RPI.
> I can right now receive about 3200 messages from the CAN Bus per second (close to the absolute maximum for messages with 8 bytes in length - I could more than double the packet count by reducing the packet size to 0 -   with this driver (plus an version of the mcp2515 that uses the async interface and advanced scheduling of messages to "leverage" concurrency - also self written)
> 
> With the "stock" spi-bcm2835 driver that is upstream it uses around 50k interrupts and a similar amount of context switches  and still looses packets.
> With the current incarnation of the spi-bcm2835dma driver (using the transfer_one_message interface) I run at around 16500 interrupts/s and 22000 context switches.
> Still what is biting me the most from the transfer perspective is the fact that there are still too many interrupts and context-switches, which introduces too much latency-unnecessarily.
> 
> So with the spi-bcm2835dma with transfer interface I would estimate that I would get the interrupts down to 6400 interrupts and 0 context switches. All this should also have a positive impact on CPU utilization - no longer 80% System load due to scheduling/dma overhead...
> 
> As for the prepare_spi_message - I was asking for something different than prepare_transfer_hardware (unless there is new code there in 3.12 that already includes that - or it is in a separate tree).
> One would prepare the HW in some way - say waking up a separate thread,...
> While what i would like to see would be similar to prepared statements in SQL - prepare the DMA control blocks once for a message (you may not change the structure/addresses, but you may change the data you are transferring) and then when the message gets submitted via spi_async and then via transfer to the driver, it would then make use of the prepared DMA chain to get attached to the DMA queue. This would shorten the need to calculate those DMA control blocks every time - in my case above I run the same computations 3200 times/s - including dmapool_alloc/dmapool_free/... Obviously it ONLY makes sense for SPI-transfers that have is_dma_mapped=1 - otherwise I would have to go thru the loops of dma_map_single_attrs  every time....
> 
> So this is to fill in the context for my questions regarding why the "transfer" interface is depreciated and to provide a rational for why I would want to use it.
> 
> Ciao,
> 		Martin
> 
> P.s: and for completeness: yes, I can do speed_hz and delay_usecs on a per spi_transfer basis as well in the DMA loop - probably coming closer to the "real" requested delay than the steps interrupt->interrupt-handler->wakeup pump_thread (inside the transfer_one_message handler) -> process the other arguments of xfer that are applied after the transfer-> udelay(xfer->delay_usec) 
> On the RPI something like this takes about 0.1ms from one message to the next to get deliverd - ok, it includes some overhead (like calculating the DMA control-blocks), but still it shows the order of magnitude which you can expect when you have to do wait for the message pump to get scheduled (with realtime priority).
> So a delay_usec=100 would already be in the range of what we would see naturally from the design for a low-power device with a single core - and that would result in waiting way too long. While with DMA I can get the timing correct to +-5% of the requested value (jitter due to other memory transfers that block the memory bus - but this can possibly get )
> 
> I will not mention that I believe that with this SPI-DMA driver it opens up the possibility to read 2 24-ADCs at a rate of 200k with minimal jitter with this simple hardware at 20MHz SPI-bus speed. (some calibration may be required...)
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]               ` <20131029225353.GB11424-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-10-30 17:11                 ` Martin Sperl
       [not found]                   ` <F64AD25A-C7EC-4A0D-9169-850C12F4D8A3-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-10-30 17:11 UTC (permalink / raw)
  To: Mark Brown; +Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA


>> But with all that said, I can do a single SPI_MESSAGE with 5 transfers
>> ,  2 of which have CS_CHANGE only with DMA without any interrupts -
>> besides the final interrupt to trigger the wakeup of the
>> spi_message_pump thread.
> 
> How exactly does this work in concrete terms?  I'm particularly
> interested in how the /CS manipluation is done.  The basic idea doesn't
> sound terribly different to things a lot of devices can do with driving
> the transfers from the IRQ handler (possibly hard IRQ handler) without
> waking the tasklet, I'd hope we can make this work as a special case of
> that.
> 

Essentially what the DMA-looks like is :
* if clock_divider changed:
   Write 4 byte to CS-register config resetting TX/RX
* Write 4 byte to CS-register config resetting TX/RX
* write 4 byte to LEN-register configuring length of the next transfer
* write 4 bytes to CS-register setting up the transfer via DMA (for the length above)
* Write to DMA-RX-Controlblock the next "target" Controlblock for the RX-DMA
* Write to DMA-RX-Controlregister to start DMA reading the next control-block
* Add the following transfers in an inner loop
    adding similar transfers to the RX/TX DMA queues until we are 
   reaching the end of xfers
   or we reach a reconfigured xfer (Cs_change,delay_us)
   speed_hz is slightly different - we need to do a look-ahead in that case
   and abort as well
*Note here the TX queue stopps after the last transfer 
  - we continue on the RX DMA chain
* schedule some "idle" DMA cycles to RX queue to wait until the clock is finished
* if delay_usec is reached schedule some more idle cycles to RX queue
* if cs_change write 4 bytes to CS-register resetting the "transfer flags
  also add some idle cycles so that CS does stay high for half a clock cycle
* Write to DMA-TX-Controlblock the next "target" Controlblock for the TX-DMA
* Write to DMA-TX-Controlregister to start DMA reading the next control-block
* Create a transfer that keeps us informed which message, which transfer
   and how many bytes we have transferred
* if not finished and there are more xfers go back to the first step
* last message gets the send IRQ flag as a final step
  (if callback is set on the message)

So it is a bit complicated, but this is the way it works in principle and this is why
I need some CPU cycles to set it up - and for some of those little more 
complicated messages i need a total of about 60 control-blocks.
But - hey: it works!
And it is the reason why I am asking for a "prepared" SPI-message option
to reduce this kind of overhead and get down to "real" transfer costs that are 
handled by HW...

I believe in most drivers, spi_messages are potentially "prerendered"
if speed is important and are not created and torn down every time.
Otherwise - setting up the spi message itself becomes very computationally
intensive...

So adding this additional call "spi_message_prepare" would come cheap...


>> As for the prepare_spi_message - I was asking for something different
>> than prepare_transfer_hardware (unless there is new code there in 3.12
>> that already includes that - or it is in a separate tree).
> 
> You should be working against the latest development code, we're almost
> at the merge window for v3.13, the v3.12 code is about 3 months old at
> this point.  Also bear in mind that this is open source, you can extend
> the frameworks if they don't do what you want.

That is a bit more effort for me - if the RPI came with upstream code already,
then it would be much easier, but it does not...
Device-tree came just before the RPI tree could get integrated, so I am left 
somewhat behind, because my main project relies on things that are not 
upstream and might take a long (RPI camera,...)

Still I want to get these things upstream, because I hope that it will
eventually happen...

So I will try to write this "spi_message_prepare interface" and provide a patch 
for inclusion, so that it may filter down again to the RPI in due time...

Ciao,
	Martin--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]               ` <02BFF0F6-3836-4DEC-AA53-FF100E037DE9-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-10-30 17:19                 ` Mark Brown
       [not found]                   ` <20131030171902.GL2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-10-30 17:19 UTC (permalink / raw)
  To: Martin Sperl; +Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 225 bytes --]

On Wed, Oct 30, 2013 at 09:40:21AM +0100, Martin Sperl wrote:

> Maybe one more argument for the spi_prepare_message interface:

Did you see my replies yesterday - I mentioned that we already have this
interface in mainline?

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                   ` <20131030171902.GL2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-10-30 18:33                     ` Martin Sperl
       [not found]                       ` <8D8B0BAD-0E00-4147-B4C8-FBB18F060C96-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-10-30 18:33 UTC (permalink / raw)
  To: Mark Brown; +Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA

Hi Mark!

I have seen your email and I have checked 3.12-rc7:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/linux/spi/spi.h?id=refs/tags/v3.12-rc7
(Assuming this is the latest, but I would say that the version of Linus would 
be authoritative)

What I have found is:
int (*prepare_transfer_hardware)(struct spi_master *master);

This one I have mentioned before, but this does prepare the hardware 
(spi_master) prior to every time we  start after an empty queue, 
which does NOT do what I was asking for.
It is useful for starting up clocks and other power-management tasks...

It does not prepare a spi_message once allowing it to get processed faster
on calls to spi_async(...)!

So can you tell me which interface you are talking about that is supposed 
to be in the mainline, which would solve my needs?

I would need an interface addition similar in intent to this:

The following would go into struct spi_master:
  int (*prepare_message)(struct spi_device *dev,struct spi_message *mesg);
  int (*unprepare_message)(struct spi_device *dev,struct spi_message *mesg);

static int spi_message_prepare(struct spi_device *spi, struct spi_message* msg) {
  return (dev->master->prepare_message)? 
    dev->master->prepare_message(spi,msg):0;
}

static int spi_message_unprepare(struct spi_device *spi, struct spi_message*msg) {
  return (dev->master->unprepare_message)? 
    dev->master->unprepare_message(spi,msg):0;
}

Plus some additional information in spi_message similar to this:
  void *prepared;
  struct spi_driver *prepared_for_device; 

The second more to avoid "problems" when drivers prepare for one spi-device 
and then use it on a different SPI device - mostly to croak about misuse of the 
prepared message... If this field is in there, then the "unprepare" call  could even 
do without the argument for SPI_device, as it would be already stored in 
the message itself...

Thanks,
		Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                   ` <F64AD25A-C7EC-4A0D-9169-850C12F4D8A3-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-10-30 21:51                     ` Mark Brown
  0 siblings, 0 replies; 57+ messages in thread
From: Mark Brown @ 2013-10-30 21:51 UTC (permalink / raw)
  To: Martin Sperl; +Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 1930 bytes --]

On Wed, Oct 30, 2013 at 06:11:45PM +0100, Martin Sperl wrote:

> * schedule some "idle" DMA cycles to RX queue to wait until the clock is finished

I don't follow this, sorry.  What does 'schedule some "idle" DMA cycles
to RX queue' involve?  Please remember that I've never seen anything
about this specific hardware before...

> * if cs_change write 4 bytes to CS-register resetting the "transfer flags
>   also add some idle cycles so that CS does stay high for half a clock cycle

If you're writing to registers doesn't that mean you need to do this
after the transfer has completed rather than having the hardware do
things autonomously (which seemed to be what you were describing)?

With the exception of these 'idle cycles' this all sounds like fairly
standard hardware with a DMA controller capable of chaining transfers
for scatter/gather.

> > You should be working against the latest development code, we're almost
> > at the merge window for v3.13, the v3.12 code is about 3 months old at
> > this point.  Also bear in mind that this is open source, you can extend
> > the frameworks if they don't do what you want.

> That is a bit more effort for me - if the RPI came with upstream code already,
> then it would be much easier, but it does not...
> Device-tree came just before the RPI tree could get integrated, so I am left 
> somewhat behind, because my main project relies on things that are not 
> upstream and might take a long (RPI camera,...)

> Still I want to get these things upstream, because I hope that it will
> eventually happen...

Device tree is totally orthogonal to any of this, this is all about the
SPI core.  I can't think of anythning in the SPI core that is even
particularly dependant on advanced changes in the rest of the kernel,
you could probably just drop a new version into an old kernel with
minimal fixups for integration with the old kernel.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                       ` <8D8B0BAD-0E00-4147-B4C8-FBB18F060C96-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-10-30 21:57                         ` Mark Brown
       [not found]                           ` <20131030215716.GV2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-10-30 21:57 UTC (permalink / raw)
  To: Martin Sperl; +Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 764 bytes --]

On Wed, Oct 30, 2013 at 07:33:30PM +0100, Martin Sperl wrote:

> I have seen your email and I have checked 3.12-rc7:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/linux/spi/spi.h?id=refs/tags/v3.12-rc7
> (Assuming this is the latest, but I would say that the version of Linus would 
> be authoritative)

No, look in -next (or the SPI tree which you can see in MAINTAINERS):

  git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git

> The following would go into struct spi_master:
>   int (*prepare_message)(struct spi_device *dev,struct spi_message *mesg);
>   int (*unprepare_message)(struct spi_device *dev,struct spi_message *mesg);

Which, like I say, is exactly what's there in terms of a driver API.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                           ` <20131030215716.GV2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-10-30 22:52                             ` Martin Sperl
       [not found]                               ` <3342FD19-4438-463B-89B2-A83D3475AC22-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-10-30 22:52 UTC (permalink / raw)
  To: Mark Brown; +Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA

Hi Mark!

After a lot of digging I found the API you are referring to in:
http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/include/linux/spi/spi.h

So when will this go in?

How would you recommend to connect your prepared structure to a specific  message?
There is no flag/field in spi_message to link that.

Would this mean that I would keep an internal list with the address of the SPI message 
any my "prepared data" and I need to iterate over that list every time I need it?
This search might - worsted case - become a scaling issue if too many prepared 
statements are created... 
Ok, there are other constructs to store data in making lookup faster,
but does this not make the driver more complex than necessary?

It could lead to all drivers starting "simple" and then at some point it becomes
an issue as one driver makes heavy use of this showing it becomes a bottle-neck
and then all spi-bus-drivers would need to get touched to optimize things...

Why not provide a lookup function for the prepared data immediately and make 
sure the drivers use this one. Then we only have to change it later in one place
and reduce code-replication across drivers.
(obviously you would also need put/delete methods to make that interface complete.)

Also those function pointers do not have wrappers functions for ease of use, 
so that they can get used in the individual drivers that know that they can 
prepare the spi_message, as they do not change the transfer structure....

Thanks,
		Martin

On 30.10.2013, at 22:57, Mark Brown wrote:

> On Wed, Oct 30, 2013 at 07:33:30PM +0100, Martin Sperl wrote:
> 
>> I have seen your email and I have checked 3.12-rc7:
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/linux/spi/spi.h?id=refs/tags/v3.12-rc7
>> (Assuming this is the latest, but I would say that the version of Linus would 
>> be authoritative)
> 
> No, look in -next (or the SPI tree which you can see in MAINTAINERS):
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git
> 
>> The following would go into struct spi_master:
>>  int (*prepare_message)(struct spi_device *dev,struct spi_message *mesg);
>>  int (*unprepare_message)(struct spi_device *dev,struct spi_message *mesg);
> 
> Which, like I say, is exactly what's there in terms of a driver API.

--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                               ` <3342FD19-4438-463B-89B2-A83D3475AC22-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-10-31  0:10                                 ` Mark Brown
       [not found]                                   ` <20131031001004.GW2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-10-31  0:10 UTC (permalink / raw)
  To: Martin Sperl; +Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 3521 bytes --]

On Wed, Oct 30, 2013 at 11:52:33PM +0100, Martin Sperl wrote:

Please fix your mailer to word wrap within paragraphs, it makes things
much more legible.

> After a lot of digging I found the API you are referring to in:
> http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/include/linux/spi/spi.h

> So when will this go in?

During the v3.13 merge window.

> How would you recommend to connect your prepared structure to a specific  message?
> There is no flag/field in spi_message to link that.

I'm working on the basis that we should add what we need to the
structure as we add code using it, right now there's no real usage so
nothing to do.

> Would this mean that I would keep an internal list with the address of
> the SPI message any my "prepared data" and I need to iterate over that
> list every time I need it?  This search might - worsted case - become
> a scaling issue if too many prepared 
> statements are created... 

Like I say just add stuff to the core as needed.  If we're getting
scaling problems due to too many transfers in flight we can probably
arrange to limit how many we do.  Hopefully that won't be too much of an
issue, though - we're mostly concerned with the last few entries and
with the ones currently in flight with the hardware.

I'd like to see all this stuff done in the framework as much as
possible, like I say a lot of this is very common and even where we
can't offload everything to hardware we can push a lot down into hard
IRQ context which will help avoid scheduling time.  We should build
things up piece by piece though rather than trying to do everything in
one fell swoop.

My general plan had been to start off by factoring out the prepare and
unprepare of DMA transfers for dmaengine based devices, including things
like the cache flushing, then add support for coalscing scatter/gather
operations - right now I think only one driver can do any form of gather
though a large proportion of the hardware with DMA has that support.

> It could lead to all drivers starting "simple" and then at some point it becomes
> an issue as one driver makes heavy use of this showing it becomes a bottle-neck
> and then all spi-bus-drivers would need to get touched to optimize things...

That's why I want to get as much into the core as possible, like was
done with Linus' work on the queue.  dmaengine helps a lot here since
it means all the DMA controllers look the same and hopefully we can also
model the operations we want to do at the hardware interaction level so
all the fancy queue/transfer management stuff is shared in the core.

> Why not provide a lookup function for the prepared data immediately and make 
> sure the drivers use this one. Then we only have to change it later in one place
> and reduce code-replication across drivers.
> (obviously you would also need put/delete methods to make that interface complete.)

That's the sort of stuff I'm talking about, yes (though I'm not 100%
sure what the "prepared data" is here).  

> Also those function pointers do not have wrappers functions for ease of use, 
> so that they can get used in the individual drivers that know that they can 
> prepare the spi_message, as they do not change the transfer structure....

This is part of pushing things into the core - the idea is that we
should have little if any code using these things in the driver and
instead the core could have the code for working out when to do things.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                   ` <20131031001004.GW2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-04 17:33                                     ` Martin Sperl
       [not found]                                       ` <18639D9A-630E-44F3-AA7A-ADFF5D5E8B56-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-04 17:33 UTC (permalink / raw)
  To: Mark Brown, Linus Walleij
  Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren

Hi!

(after some emails that went offline)
I have created the "prepared" message interface for the spi-bcm2835dma driver
on the 3.10 kernel I am working off right now. 
(so I had to do a bit of "linking" trickery)

So here a link of an example showing how much the prepared spi_messages 
really can improve the SPI thruput - even without changing anything
(besides) preparing the messages for direct DMA use.

http://www.raspberrypi.org/phpBB3/viewtopic.php?f=44&t=19489&p=448328#p448328
The link goes to the RPI forum and includes the measurements plus some images
showing both cases (configured via a module parameter for now)

Quick summary:
For my test-usecase by just enabling "prepared messages" I have reduced the 
time  for a "simple" transfer from 390us (without prepare) to 230us (with
prepare) and the driver is still using the threaded message_pump.

OK - the "default" spi-bcm2835.c driver currently in the mainline takes 245us
for the same thing, but it runs at 130k interrupts/s and 95% System load, 
which is quite impractical to do anything else.
So getting this down to 80% System load plus shorter responses is already
quite an improvement. 

It is a drawback hat DMA driver needs more overhead for the unprepared 
spi-message case. But then if the driver that is perfromance critical
does not make use of prepared messages yet and there is a performance issue,
then it needs to get modified to call spi_prepare message during setup. 

Anyway, if a driver cares about thru-put, it better avoids allocating 
new memory to create (and release)  an SPI message in the interrupt handler 
or callback in the first place.
Then adding the single extra statement to prepare and unprepare the message
comes cheap.

There is one hard assumptions:
each and every xfer has to be configured with a DMA-address for source
and destination (unless NULL)
Obviously the message is "static" as soon as prepare has been called.

Otherwise the overhead for iterating the messages and calling
dma_(un)map_single becomes the limiting factor and the difference in code 
(compared to creating the whole chain from scratch) is minimal besides 
additional allocations for the memory - we have to "walk", we have to parse,
...

(also think about how it works if memory itself is fragmented on the bus 
address then - depending on the alignment of data a different amount of 
DMA-transfers would be needed - thus it seems quite impractical to implement)

So from my experience I would recommend adding something like this to spi.h, 
so that it can get used for real and not export-linked way, like I had to do it
for this proof of concept.

static int bcm2835dma_spi_prepare_message(struct spi_device *spi,
	struct spi_message *msg)
{
	if(spi->prepare_message) {
		return spi->prepare_message(spi,msg);
	} else {
		return 0;
	}
}
static int bcm2835dma_spi_unprepare_message(struct spi_device *spi,
	struct spi_message* msg)
{
	if(spi->unprepare_message) {
		return spi->unprepare_message(spi,msg);
	} else {
		return 0;
	}
}

and finally also some management functionality for "finding" those prepared 
messages - better to implement something and making it opaque,
then every driver implementing its own thing and then having to track that.
Already my mcp2515a driver is allocating something like 14 prepared messages 
already, so walking that list is not as expensive, but at some point a binary
tree might be better from the performance-perspective 


struct spi_prepared_message {
        /* the list in which we store the message */
        struct list_head prepared_list;
        /* the identification data for matching */
        struct spi_device *spi;
        struct spi_message *message;
};

something like this:
struct list_head prepared_messages_list;
plus a spinlock protecting the above in spi_master

and the following prototypes:
static struct spi_prepared_message *spi_find_prepared_message_nolock(
        struct spi_device *spi,
        struct spi_message *message);
static struct spi_prepared_message spi_find_prepared_message(
        struct spi_device *spi,
        struct spi_message *message);
static int spi_add_prepared_message(
        struct spi_prepared_message * prepared);
static struct spi_prepared_message *spi_remove_prepared_message(
        struct spi_device *spi,
        struct spi_message *message);

I got the above prototypes implemented (but with the data sitting in the 
master-private data structure, so I am not sure if it is worth sharing them)

Maybe we can get those into the "next" tree so that it can get pulled into 3.13?

Thanks,
	 Martin

P.s:
Note: I have not switched to the dmaengine interface for the driver.
There are too many things needed for me to learn to make that switch now.
Also (as far as Mark said during our offline discussions) there seem to be some 
gaps that would require extending the DMA engine, which would making the driver 
work at first hard. Finally I fear (from my limited knowledge) that scheduling 
via the DMA engine itself would require a prepare and also I would have to keep 
up with double allocations to link everything in place further reducing the 
"thruput" for the case of non-prepared drivers.

The driver itself can get found at: https://github.com/msperl/spi-bcm2835


--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                       ` <18639D9A-630E-44F3-AA7A-ADFF5D5E8B56-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-04 18:45                                         ` Mark Brown
       [not found]                                           ` <20131104184511.GR2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-04 18:45 UTC (permalink / raw)
  To: Martin Sperl
  Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren

[-- Attachment #1: Type: text/plain, Size: 302 bytes --]

On Mon, Nov 04, 2013 at 06:33:19PM +0100, Martin Sperl wrote:

> The driver itself can get found at: https://github.com/msperl/spi-bcm2835

Please post code/patches normally rather than linking to git, it's much
less likely that people will look at some random gitweb thing and much
harder to respond.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                           ` <20131104184511.GR2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-04 21:43                                             ` Martin Sperl
       [not found]                                               ` <5A55A832-5313-499C-A483-BF5A6649D69D-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-04 21:43 UTC (permalink / raw)
  To: Mark Brown
  Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren

Hi Mark!

I just found out that I am in a chicken or egg situation:
There is no DMA support whatsoever in the upstream RPI support.
So sending any patches for spi with DMA becomes impossible right now.

Setting up a system that runs the upstream kernel without patches will be 
some effort on my side so that may take quite some time for me to start.

Until then I will need to "work" with the tainted kernel and APIs from the
foundation using their special APIs, which you obviously can not accept 
for moving them upstream.

That is why I just wanted to share with you what may get achieved with your 
new interface and what is missing from my perspective to make it work in
a practical manner. 

And that is why I have added only the link to the driver.

I am in a sort of situation that makes it hard to move in either direction.

I believe we all got our interests:
* you want your prepared interface tested in a driver
* I want to get the spi-dma driver in the main-line
* some other people want the dma engine work to get started for the RPI,
  so that they can get to real productive work on USB and others.

All I can give you right now is the feedback from my experience with this out 
of kernel driver, which is not a candidate for merging in its current state 
due to limitations mentioned above.

The results look promising and will improve further!

The only thing I can think of that would help me (and others) in the long 
run is getting those missing parts of your prepare API into the kernel 
early to avoid code duplication (see earlier email with the basics).

By then I may have the driver up and running quite well on the Foundation RPI 
kernel and hopefully some people should also have tested it with more devices.

As soon as that is stable and I get a "vanilla" kernel running with the 
improvements I had recommended from my experience, I can start some basics 
of dmaengine for the RPI (allocation and release DMA to start keeping 
everything else in the driver for now) and then we can really start talking 
about patches for the spi-driver to send.

again: first the dma-engine part would need to be upstream before I assume
you would accept to merge the driver, as the functionality would NOT work
otherwise. So we talk about 3.15 at the very earliest...

So is the presented idea an approach to move forward on?
Or do you have a better approach which we could follow to get us out of this 
tie?

Thanks,
	Martin

P.s: here the pieces of code for which I have earlier only shown the 
prototypes - these are untested and I am not sure they will even compile as
is because I had to modify the driver specific version here in this email
to make them work with the list/lock variables in the master-structure - the 
driver specific code works. Also no documentation for the functions.

That is the best I can do for now.

/* generic wrapper functions for device drivers to use */
static int bcm2835dma_spi_prepare_message(struct spi_device *spi,
	struct spi_message *msg)
{
	if(spi->master->prepare_message) {
		return spi->master->prepare_message(spi,msg);
	} else {
		return 0;
	}
}

static int bcm2835dma_spi_unprepare_message(struct spi_device *spi,
	struct spi_message* msg)
{
	if(spi->master->unprepare_message) {
		return spi->master->unprepare_message(spi,msg);
	} else {
		return 0;
	}
}

in struct spi_master
/* note that it misses initialization for those two fields */
	struct list_head	prepared_messages_list;
	spinlock_t		prepared_messages_lock;

/* functions used to store/find/remove prepared objects 
 * for now putting them in a simple list.
 * only bus drivers should call these really in their (un)prepare code
 * later on it may be worthwhile changing this structure to
 * be in a binary tree - the big question is: how many prepared messages
 * may we expect on a single SPI bus?
 * alternatively we could also move it of to the spi_device structure.
 */

/* the structure to embed in the bus specific implementation */
struct spi_prepared_message {
	/* the list in which we store the prepared messages */
	struct list_head prepared_list;
	/* the identification data for searching the list*/
	struct spi_device *spi;
	struct spi_message *message;
	/* maybe some search-statistics, so that we can reorder/optimize
	 * the list to make the searches for the most commonly used 
	 * messages faster by being in "front"
         */
};

static struct spi_prepared_message *spi_find_prepared_message_nolock(
        struct spi_device *spi,
        struct spi_message *message)
{
        struct spi_master *master=spi->master;
        /* ideally this would be in master */
        struct spi_prepared_message *prepared;
        /* now loop the entries */
        list_for_each_entry(prepared, &master->prepared_messages_list, prepared_list) {
                /* if we match, then return */
                if ((prepared->spi==spi)
                        && (prepared->message==message))
                        return prepared;
        }
        /* return not found */
        return NULL;
}

static struct spi_prepared_message *spi_find_prepared_message(
        struct spi_device *spi,
        struct spi_message *message)
{
        struct spi_prepared_message *prepared;
        struct spi_master *master=spi->master;
        unsigned long flags;
	/* lock */
        spin_lock_irqsave(&master->prepared_messages_lock,flags);
        /* try to find it */
        prepared=spi_find_prepared_message_nolock(spi,message);
        /* and unlock and return */
        spin_unlock_irqrestore(&master->prepared_messages_lock,flags);
        return prepared;
}

static int spi_add_prepared_message(
        struct spi_prepared_message * prepared)
{
        struct spi_device *spi=prepared->spi;
        struct spi_master *master=spi->master;
        struct spi_message *message=prepared->message;
        unsigned long flags;
	/* lock */
        spin_lock_irqsave(&master->prepared_messages_lock,flags);
        /* find the entry and croak if it has beeen prepared before */
        if (spi_find_prepared_message_nolock(spi,message)) {
	        spin_unlock_irqrestore(&master->prepared_messages_lock,flags);
                dev_err(&spi->dev,"SPI message has already been prepared once\n");
                return -EPERM;
        }
        /* now add it to the list at tail*/
        list_add_tail(&prepared->prepared_list,&master->prepared_messages_list);

        /* unlock and return */
        spin_unlock_irqrestore(&master->prepared_messages_lock,flags);
        return 0;
}

struct spi_prepared_message *spi_remove_prepared_message(
        struct spi_device *spi,
        struct spi_message *message
        )
{
        struct spi_prepared_message *prep=NULL;
        struct spi_master *master=spi->master;
        unsigned long flags;
	/* lock */
        spin_lock_irqsave(&master->prepared_messages_lock,flags);
	/* find the entry */
        prep=spi_find_prepared_message_nolock(spi,message);
        /* now unlink the prepared item - if found  - we could croak otherwise */
        if (prep) {
                list_del(&prep->prepared_list);
	} else {
                dev_err(&spi->dev,"SPI message has not been prepared\n");
	}
        /* unlock and return the version removed from the list*/
        spin_unlock_irqrestore(&master->prepared_messages_lock,flags);
        return prep;
}


On 04.11.2013, at 19:45, Mark Brown wrote:

> On Mon, Nov 04, 2013 at 06:33:19PM +0100, Martin Sperl wrote:
> 
>> The driver itself can get found at: https://github.com/msperl/spi-bcm2835
> 
> Please post code/patches normally rather than linking to git, it's much
> less likely that people will look at some random gitweb thing and much
> harder to respond.

--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                               ` <5A55A832-5313-499C-A483-BF5A6649D69D-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-05  1:03                                                 ` Linus Walleij
  2013-11-06  9:48                                                 ` Mark Brown
  1 sibling, 0 replies; 57+ messages in thread
From: Linus Walleij @ 2013-11-05  1:03 UTC (permalink / raw)
  To: Martin Sperl; +Cc: Mark Brown, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren

On Mon, Nov 4, 2013 at 10:43 PM, Martin Sperl <martin-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> wrote:

> I just found out that I am in a chicken or egg situation:
> There is no DMA support whatsoever in the upstream RPI support.
> So sending any patches for spi with DMA becomes impossible right now.

I wrote in my first reply:

 "But I hope that you have the necessary infrastructure using the dmaengine
  subsystem for this, or that changes requires will be proposed to that
  first or together with these changes."

So now you are back at this point :-)

Anyway, doing dmaengine support is a venture on its own and as usual
it falls upward to anyone who first needs it to do that. I suggest you
look at some of the newer dmaengines in drivers/dma for inspiration,
and get memcpy in place first so you can test the driver stand-alone
with drivers/dma/dmatest.c

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                               ` <5A55A832-5313-499C-A483-BF5A6649D69D-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  2013-11-05  1:03                                                 ` Linus Walleij
@ 2013-11-06  9:48                                                 ` Mark Brown
       [not found]                                                   ` <20131106094854.GF11602-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  1 sibling, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-06  9:48 UTC (permalink / raw)
  To: Martin Sperl
  Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren

[-- Attachment #1: Type: text/plain, Size: 1080 bytes --]

On Mon, Nov 04, 2013 at 10:43:03PM +0100, Martin Sperl wrote:

> again: first the dma-engine part would need to be upstream before I assume
> you would accept to merge the driver, as the functionality would NOT work
> otherwise. So we talk about 3.15 at the very earliest...

Well, v3.14 should be possible - we're just coming up to the merge
window for v3.13 so v3.14 is still wide open.

> So is the presented idea an approach to move forward on?
> Or do you have a better approach which we could follow to get us out of this 
> tie?

I think the current approach seems fine.

> /* generic wrapper functions for device drivers to use */
> static int bcm2835dma_spi_prepare_message(struct spi_device *spi,
> 	struct spi_message *msg)
> {
> 	if(spi->master->prepare_message) {
> 		return spi->master->prepare_message(spi,msg);
> 	} else {
> 		return 0;
> 	}
> }

Like I said earlier I'm concerned about the idea of drivers calling
these functions directly, this should all be factored out of drivers
since there's nothing really driver specific here.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                   ` <20131106094854.GF11602-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-06 11:28                                                     ` Martin Sperl
       [not found]                                                       ` <844EDAEA-3FDC-48D0-B59E-CECC0A83761E-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-06 11:28 UTC (permalink / raw)
  To: Mark Brown
  Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren

Hi Mark!

I am currently setting up a second RPI for my work on upstream kernels.
This still will take some more time, as I can only access it via a serial
console, as there is no upstream USB driver (which I need for the on-board
network card to work)... 

>> /* generic wrapper functions for device drivers to use */
>> static int bcm2835dma_spi_prepare_message(struct spi_device *spi,
>> 	struct spi_message *msg)
>> {
>> 	if(spi->master->prepare_message) {
>> 		return spi->master->prepare_message(spi,msg);
>> 	} else {
>> 		return 0;
>> 	}
>> }
> 
> Like I said earlier I'm concerned about the idea of drivers calling
> these functions directly, this should all be factored out of drivers
> since there's nothing really driver specific here.


Actually it is VERY driver specific!

The thing is that you can not easily detect if the same spi_message has 
been used and in some way changed without having to walk and compare the
structure to an existing checksum - and such computation would take say 
1/3rd of the time it takes to prepare the message again.

Then if it has been found to be different, you would first need to erase 
the old structure before creating the new one. 

So this kind of heuristics increases the CPU load resulting in 
unnecessary computations at minimal benefits.

So there is the need to "identify" a message somehow as "prepare-able"
(and thus conforming to some restrictions - like the one I had mentioned)
in the driver itself.

If you want to do this via a separate boolean in the spi_message structure
or if you do this by explicitly calling from the driver IMO does not make
much of a difference - the code needs to get optimized for this anyway.

The advantage of the explicit call is that those prepared messages can get 
released immediately when the driver is unloaded.  Otherwise they can only 
get released when the master is unloaded (or some LRU logic).

And if you do not release on driver unload, then in the worsted case you 
could load the driver again and the spi_message would get created in the
same location but the xfers and their pointers might point elsewhere.
And that would _definitely_ be a problem. (You could add cookies,...
but then you would have more computations still.)

For spi_async all the above checks could happen in an IRQ context,
so the code-path should _really_ be optimized!

Anyway: this mostly applies to drivers that need to be fast
and have high thru-put. And those would be the only ones that need
"special" handling. 

For those, that just use the "simple" sync API (variations of 
spi_write_then_read and similar), the framework may already have done the
prepare for its "hot-path" situation.
That way those "simple" drivers can get a benefit as well.

I do not think it is worth coding "complex" heuristics, like
checking X times for a message to be "identical" and only then taking 
the short-cut. That can lead to some very strange situations to debug.
So as far as I am concerned it is better to explicitly say in the driver:
yes we can.

It might be worth thinking of extending the interface for "typical" transfers
cases so that more use-cases get handled automatically (with prepared code).
But that would require reviewing/rewriting existing drivers to make use of 
those interfaces.

Finally: with the preare/unprepare calls in the master structure, the 
thing you risk is that each driver writes is own:
if (master->prepared_message)
	master->prepared_message(spi,message);
so better make that wrapper available...

Ciao,
	Martin

P.s: just looking at __spi_async already gives me some thought:
we are already iterating over the stucture once for the sanity-checks.
And that takes time already - cycles that could be saved for the "prepared"
message case entirely by short-cutting those "sanity-checks" and just call 
master->transfer(spi->message) immediately.

So integrating the "search" for prepared statements here and skipping
the tests in this case would also improve the thru-put...

Otherwise you would risk that a high-thruput driver directly calls
message->spi=spi;master->transfer(spi,message);
to avoid those unnecessary CPU cycles - especially if it is optimized.

Ok, you could reject the patch, but then you would ignore the root cause
WHY people do this!

One way around this could be that you would call a default "prepared"
statement code that would do the checks in the structure.

And that way any driver that does prepare its messages would get the
benefit independent of if the bus driver itself implements prepared 
messages or not.

It could boil down to something like this:
(this would require a flag marking the message as prepared)

static int __spi_sanity_check_message(struct spi_device *spi,
	struct spi_message *message)
{
	/* message sanity check code here */
}

/* set prepared_message to this in spi_register_master,
 * if prepare_message is NULL (not included)
 */
int __spi_prepare_message_default(struct spi_device *spi,
	struct spi_message *message)
{
	int ret=__spi_sanity_check_message(spi,message);
	if (!ret)
		message->is_prepared=1;
	return ret;
}

static int __spi_async(struct spi_device *spi, struct spi_message *message)
{
  int ret=0;
  /* the non prepared path */
  if (! message->is_prepared) {
    /* sanity checks here */
    ret=__spi_sanity_check_message(spi,message);
    if (ret)
	return ret;
  }
  /* and the "normal" call of transfer */
  message->spi = spi;
  message_status = -EINPROGRESS;
  return master->transfer(spi,message);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                       ` <844EDAEA-3FDC-48D0-B59E-CECC0A83761E-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-06 11:32                                                         ` Mark Brown
       [not found]                                                           ` <20131106113219.GJ11602-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-06 11:32 UTC (permalink / raw)
  To: Martin Sperl
  Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren

[-- Attachment #1: Type: text/plain, Size: 837 bytes --]

On Wed, Nov 06, 2013 at 12:28:15PM +0100, Martin Sperl wrote:

> >> /* generic wrapper functions for device drivers to use */
> >> static int bcm2835dma_spi_prepare_message(struct spi_device *spi,

> > Like I said earlier I'm concerned about the idea of drivers calling
> > these functions directly, this should all be factored out of drivers
> > since there's nothing really driver specific here.

> Actually it is VERY driver specific!

> The thing is that you can not easily detect if the same spi_message has 
> been used and in some way changed without having to walk and compare the
> structure to an existing checksum - and such computation would take say 
> 1/3rd of the time it takes to prepare the message again.

I'm sorry but the above makes no sense to me.  Why on earth would you
need to do these checks?

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                           ` <20131106113219.GJ11602-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-06 12:10                                                             ` Martin Sperl
       [not found]                                                               ` <C6C68042-63A0-40FD-8363-B4553ECB4774-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-06 12:10 UTC (permalink / raw)
  To: Mark Brown
  Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren

 
> I'm sorry but the above makes no sense to me.  Why on earth would you
> need to do these checks?

How can you check that the code does not do something like this?

struct spi_message msg;
struct spi_transfer a,b;
spi_message_init(&msg);
spi_message_add_tail(&msg,&a);
spi_sync(spi,&msg);
spi_message_add_tail(&msg,&b);
spi_sync(spi,msg);

The above is an allowed sequence, that must work - otherwise we would 
break an existing API.

So assuming that the framework is executing those prepare calls itself
 - say inside the spi_async function - how would it detect such a change 
without any checksumming the message?

Think of 10000 interrupts/s each scheduling the identical SPI message
If each check takes 5us to verify the structure is identical to the last
call, then that alone sums up to 50ms (or 5% CPU) without doing anything
"helpfull".

The explicit one time call spi_prepare_message states that the driver
delegates ownership over the message object to the SPI bus-master until 
it takes ownership back with spi_unprepare_message. So it may not free
or change the structures in any way, but it still may use the message
(spi_async) and read write in the data bytes (not the location).

It may even delegate it to multiple masters, but may only use it in one
spi bus at a time (this second part should be obvious, as it is in the 
current API in the form of an implicit requirement from the queue 
structure member, also 2 channels writing to the same memory location 
does not make sense - but theoretically would for transmits only).

Martin








--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                               ` <C6C68042-63A0-40FD-8363-B4553ECB4774-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-06 16:24                                                                 ` Mark Brown
       [not found]                                                                   ` <20131106162410.GB2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-06 16:24 UTC (permalink / raw)
  To: Martin Sperl
  Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren

[-- Attachment #1: Type: text/plain, Size: 1125 bytes --]

On Wed, Nov 06, 2013 at 01:10:32PM +0100, Martin Sperl wrote:

> How can you check that the code does not do something like this?

> struct spi_message msg;
> struct spi_transfer a,b;
> spi_message_init(&msg);
> spi_message_add_tail(&msg,&a);
> spi_sync(spi,&msg);
> spi_message_add_tail(&msg,&b);
> spi_sync(spi,msg);

> The above is an allowed sequence, that must work - otherwise we would 
> break an existing API.

This would be broken anyway with existing drivers; if we want to support
that we need to take copies of both the messages and the transfers but
really that just looks crazy.

> The explicit one time call spi_prepare_message states that the driver
> delegates ownership over the message object to the SPI bus-master until 
> it takes ownership back with spi_unprepare_message. So it may not free
> or change the structures in any way, but it still may use the message
> (spi_async) and read write in the data bytes (not the location).

No, modifying an in flight transfer is not supported or sensible.
There's a whole bunch of problems there, including synchronisation with
DMA.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                   ` <20131106162410.GB2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-06 19:54                                                                     ` Martin Sperl
       [not found]                                                                       ` <3B0EDE3F-3386-4879-8D89-2E4577860073-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-06 19:54 UTC (permalink / raw)
  To: Mark Brown
  Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren

>> The above is an allowed sequence, that must work - otherwise we would 
>> break an existing API.
> 
> This would be broken anyway with existing drivers; if we want to support
> that we need to take copies of both the messages and the transfers but
> really that just looks crazy.

The sequence shown is not broken - see that I have been using spi_sync 
not spi_async! (think also of people using this via SPIDEV, where you do
not have such control as you would in kernel space)

So it is in principle a valid use of the API - it is not a "beautiful" 
implementation, but it is something you have to be able to handle. 

Unless you _want_ to break the API.
But then you should write a more explicit explanation of what the API
allows and what it does not and what are the other assumptions.
(and you have to track down all the bugs in existing drivers)

Also see the scenario like this:
* kalloc memory for message
* add xfers to message
* call spi_sync
* kfree xfers
* kfree message
* kalloc memory for message
** here we may have a high likelihood of getting the same pointer again
* add xfers to message
* call spi_sync
* kfree xfers
* kfree message

so we would need again a means to detect such changes.
The pointers are the same, but the data may be different.

As said: we would need some "random" token in the message to identify a 
prepared message - we can do that...

Also you would need to track down all drivers that might have such a 
pattern.

But you need some cleanup mechanism for your prepared statements, as
you would need to keep them somewhere (a list or similar).

So now imagine the following:
for(i=0;i<10000;i++)
* kalloc memory for message
* add xfers to message
* call spi_sync
* kfree xfers
* kfree message

that would mean that you would create lots of prepared messages
filling up kernel memory (possibly allocated with GPF_ATOMIC,
as we may run in irq-context) and then only after some time a 
separate thread  would need to run to clean it up.

Not to mention that you have a long list to walk to find the prepared
"structure" again killing lots of time.

So that is why I say that making it explicit in your driver should be 
a possibility - especially that way those prepares would happen at 
setup time and not while in an interrupt handler...

If you find a solution to the problems described in the "framework at
some point, then it is probably easier to remove the spi_prepare_message
functions from those drivers that already make use of it (or keeping it 
as a hint and make them a no-ops...) than figuring out all the issues 
that such "behind the scenes" optimizations in the framework may
introduce.

Martin







--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                       ` <3B0EDE3F-3386-4879-8D89-2E4577860073-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-06 23:26                                                                         ` Mark Brown
       [not found]                                                                           ` <20131106232605.GC2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-06 23:26 UTC (permalink / raw)
  To: Martin Sperl
  Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren

[-- Attachment #1: Type: text/plain, Size: 1491 bytes --]

On Wed, Nov 06, 2013 at 08:54:10PM +0100, Martin Sperl wrote:
> >> The above is an allowed sequence, that must work - otherwise we would 
> >> break an existing API.

> > This would be broken anyway with existing drivers; if we want to support
> > that we need to take copies of both the messages and the transfers but
> > really that just looks crazy.

> The sequence shown is not broken - see that I have been using spi_sync 
> not spi_async! (think also of people using this via SPIDEV, where you do
> not have such control as you would in kernel space)

Oh, if you're using spi_sync() it's fine but then what's the issue?  The
stack isn't going to remember anything about old messages when they're
resubmitted so why would there ever be any problem?  Even if there were
a problem I can't see how having the core manage things is going to be
an issue.

> Also see the scenario like this:
> * kalloc memory for message
> * add xfers to message
> * call spi_sync
> * kfree xfers
> * kfree message
> * kalloc memory for message
> ** here we may have a high likelihood of getting the same pointer again
> * add xfers to message
> * call spi_sync
> * kfree xfers
> * kfree message

> so we would need again a means to detect such changes.
> The pointers are the same, but the data may be different.

Why would anything even notice a problem?  Like I say by the time the
transfer is returned to the user the stack isn't going to remember
anything about it.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                           ` <20131106232605.GC2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-07  0:43                                                                             ` Martin Sperl
       [not found]                                                                               ` <72D635F5-4229-4D78-8AA3-1392D5D80127-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-07  0:43 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

Hi Mark!

The problem is that you intend to do prepare/unprepare transparently.

OK - but I see problems with heuristics on when it is safe to assume
that a message has not been changed since the last call (either xfers
or tx/rx_buff pointer). If it has not changed then we can use the cached 
computations from the previous step ( equivalent to prepared).

Figuring out if the message has changed or not is consuming too much CPU
and that is - as far as I am understanding - the main reason why we want to 
have the option to prepare/unprepare a message: save CPU cycles.

See my "quick&dirty" modification to _spi_async how prepare could help 
improve avoid spending CPU cycles even for "normal" cases that do not 
run on HW where preparing would not help except for avoiding running the 
sanity checks on the message every time.

As said: running a single transfer 10000 times/s will run the following code:
        /* Half-duplex links include original MicroWire, and ones with
         * only one data pin like SPI_3WIRE (switches direction) or where
         * either MOSI or MISO is missing.  They can also be caused by
         * software limitations.
         */
        if ((master->flags & SPI_MASTER_HALF_DUPLEX)
                        || (spi->mode & SPI_3WIRE)) {
                unsigned flags = master->flags;

                list_for_each_entry(xfer, &message->transfers, transfer_list) {
                        if (xfer->rx_buf && xfer->tx_buf)
                                return -EINVAL;
                        if ((flags & SPI_MASTER_NO_TX) && xfer->tx_buf)
                                return -EINVAL;
                        if ((flags & SPI_MASTER_NO_RX) && xfer->rx_buf)
                                return -EINVAL;
                }
        }

        /**
         * Set transfer bits_per_word and max speed as spi device default if
         * it is not set for this transfer.
         */
        list_for_each_entry(xfer, &message->transfers, transfer_list) {
                if (!xfer->bits_per_word)
                        xfer->bits_per_word = spi->bits_per_word;
                if (!xfer->speed_hz)
                        xfer->speed_hz = spi->max_speed_hz;
                if (master->bits_per_word_mask) {
                        /* Only 32 bits fit in the mask */
                        if (xfer->bits_per_word > 32)
                                return -EINVAL;
                        if (!(master->bits_per_word_mask &
                                        BIT(xfer->bits_per_word - 1)))
                                return -EINVAL;
                }
        }

10000 times/s.

With a "prepared" message by the driver instead, this message would get prepared 
(maybe in this case better call it sanity-checked) once and from then on this
piece of code does not get run at all, which would reduce CPU load immediately.

But if you have to loop over the message to see if it has changed (by calculating 
a checksum or comparing to something else), then you have lost this opportunity 
to optimize out the above code...

Is the above described scenario of 10k spi messages/s realistic?

Yes - definitely.

The can Driver for which I have started my investigation is processing about 3200 
can-messages/second. Each generates an interrupt - so 3200 interrupts and each
is sending 3 spi_messages with multiple transfers in sequence (all of which can
get and are already prepared...)

So we are essentially at the example above with 9600 SPI messages/s and the
code getting run 9600 times unnecessarily.

Obviously this is not so important for drivers that run maybe 10 SPI messages/s, but 
it becomes an important factor at higher volumes.

That is the reason why I am so persistent about this subject and trying to get
a solution that does not require a lot of CPU cycles.

Ciao,
	Martin

P.s: And my guess is that avoiding this "sanity-check" alone would reduce the 
system load for my test-case by 2-3% alone that would be down from 77% to 74% CPU.
(and I am sure I can bring it down further if I am moving away from the scheduling
overhead that is immanent to the "transfer_one_message" worker interface.
The preparing of the DMA Structure reduced the System load by about 7-8%.

I can try to quantify this number if you really want by NOT using spi_async but calling 
  message->spi=spi;
  message->status=-EINPROGRESS;
  status=spi->master->transmit(spi,message);
directly.

But not tonight...

--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                               ` <72D635F5-4229-4D78-8AA3-1392D5D80127-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-07 20:31                                                                                 ` Mark Brown
       [not found]                                                                                   ` <20131107203127.GB2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-07 20:31 UTC (permalink / raw)
  To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 1347 bytes --]

On Thu, Nov 07, 2013 at 01:43:39AM +0100, Martin Sperl wrote:

> The problem is that you intend to do prepare/unprepare transparently.

Please be more concrete and specific, I'm still really struggling to
understand what your concern is.  Please do also reply to what's being
said rather than deleting the entire mail, it makes it easier to follow
the thread of conversation.

> OK - but I see problems with heuristics on when it is safe to assume
> that a message has not been changed since the last call (either xfers
> or tx/rx_buff pointer). If it has not changed then we can use the cached 
> computations from the previous step ( equivalent to prepared).

Why would the SPI core assume that a message just passed into it is a
previously submitted message which has not been changed?  This seems
like something which would add extra complexity to the implementation
over the simpler approach of just treating each new message as a new
one.

Bear in mind that the goals here are to refactor the code so that the
DMA mapping is factored out of the drivers (and can be moved so that it
runs in parallel with transfers), factor out the handling of /CS and in
general the handling of message breaks and avoid all the drivers having
to implement the logic required to push transfers from interrupt context
where possible.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                   ` <20131107203127.GB2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-08 14:16                                                                                     ` Martin Sperl
       [not found]                                                                                       ` <86AE15B6-05AF-4EFF-8B8F-10806A7C148B-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-08 14:16 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

Hi Mark!

On 07.11.2013, at 21:31, Mark Brown wrote:

> On Thu, Nov 07, 2013 at 01:43:39AM +0100, Martin Sperl wrote:
> 
>> The problem is that you intend to do prepare/unprepare transparently.
> 
> Please be more concrete and specific, I'm still really struggling to
> understand what your concern is.  Please do also reply to what's being
> said rather than deleting the entire mail, it makes it easier to follow
> the thread of conversation.
...
> Bear in mind that the goals here are to refactor the code so that the
> DMA mapping is factored out of the drivers (and can be moved so that it
> runs in parallel with transfers), factor out the handling of /CS and in
> general the handling of message breaks and avoid all the drivers having
> to implement the logic required to push transfers from interrupt context
> where possible.

OK - I believe I start to see where we are diverging with our ideas about
prepare_message.

You want to refactor the message pump mechanism reducing driver code size,
which is commendable (I like the idea of having a "generic" scheduler on 
interrupt from any GPIO pin, but there are others that come to my mind as
well). And part of this you want to make drivers move away also from the 
transfer_one_message interface. And for that you need the drivers to have 
prepare and unprepare functions, which are called from the spi_pump (hence
the description, that those are called from threaded context).

While I am more concerned about the cost of running a prepare EVERY time we
schedule a message - and that becomes significant CPU utilization if you 
look at my report (doing the setup calculation for the DMA of a message,
with 2x2 Byte (write_then_read) transfer at 8MHz takes about as long as the
actual transfer over the bus. (if more data gets transmitted the ratio 
changes to the transfer side)

So my idea was to create something like SQL prepared statement API that
allows the driver to say: Hey, here is a message, that I will not modify
from a structure perspective - I still retain the "ownership" and I am
still responsible for releasing it before releasing the memory itself.

So this api, I would now call it driver_prepared_message to make a 
distinction to your case.

This specific API call would do the following:
* run its own sanity checks once (the ones in __spi_async)
* run the prepare_message function of the bus-driver (we would need to 
  keep some data in the spi_message structure that the driver may assign
  for future reference: "message->prepared_data=kalloc(...);" )
* mark the message as prepared - say message->is_driver_prepared=1;

Would this description above describe your and my "points of views"?
(not in all details, but close?)

If so then, the __spi_async call could look like this (the functions are
not in the correct order for the compiler not to complain):

struct spi_message {
	...
	/* set by spi_prepare_message() */
	unsigned                is_driver_prepared:1;
	/* an opaque structure for prepared messages 
	 * allocated on spi_prepare_message()
	 * freed in spi_unprepare_message()
	 */
	void			*prepared_data;
	...
};

static int __spi_async(struct spi_device *spi, struct spi_message *message)
{
	int ret;
	/* the non driver-prepared path */
	if (! message->is_driver-prepared) {
		/* prepare the message */
		ret=__spi_prepare_message(spi,message);
		if (ret)
			return ret;
	}
	/* setting up some fields in message */
	message->spi = spi;
	message_status = -EINPROGRESS;
	/* here we might also want to have a driver function to schedule
	 * the message immediately say to DMA, but still use the spi_pump
	 * worker for some part of the driver like memory cleanup,...
	 * if (spi->master->schedule_prepared_message_irqcontext)
	 *  spi->master->schedule_prepared_message_irqcontext(spi,message);
         */
	/* and the "normal" call of transfer, which typically would call
	 * the spi message pump.
	 */
	return master->transfer(spi,message);
}

static int __spi_prepare_message(struct spi_device *spi,
	struct spi_message *message)
{
        struct spi_master *master = spi->master;
        struct spi_transfer *xfer;

	/* the parts from __spi_async that handle message verification */

        /* Half-duplex links include original MicroWire, and ones with
         * only one data pin like SPI_3WIRE (switches direction) or where
         * either MOSI or MISO is missing.  They can also be caused by
         * software limitations.
         */
        if ((master->flags & SPI_MASTER_HALF_DUPLEX)
                        || (spi->mode & SPI_3WIRE)) {
                unsigned flags = master->flags;

                list_for_each_entry(xfer, &message->transfers, transfer_list) {
                        if (xfer->rx_buf && xfer->tx_buf)
                                return -EINVAL;
                        if ((flags & SPI_MASTER_NO_TX) && xfer->tx_buf)
                                return -EINVAL;
                        if ((flags & SPI_MASTER_NO_RX) && xfer->rx_buf)
                                return -EINVAL;
                }
        }

        /**
         * Set transfer bits_per_word and max speed as spi device default if
         * it is not set for this transfer.
         */
        list_for_each_entry(xfer, &message->transfers, transfer_list) {
                if (!xfer->bits_per_word)
                        xfer->bits_per_word = spi->bits_per_word;
                if (!xfer->speed_hz)
                        xfer->speed_hz = spi->max_speed_hz;
                if (master->bits_per_word_mask) {
                        /* Only 32 bits fit in the mask */
                        if (xfer->bits_per_word > 32)
                                return -EINVAL;
                        if (!(master->bits_per_word_mask &
                                        BIT(xfer->bits_per_word - 1)))
                                return -EINVAL;
                }
        }

	/* and the custom prepare call of the bus-driver */
	if (spi->master->prepare_message)
		return spi->master->prepare_message(
			spi->master,message);

	return 0;
}

/*
 * the function that allows the driver to prepare a message
 * for later use without further processing overhead
 * the driver will retain ownership of the memory,
 * but is NOT allowed to make any changes to the structure
 * of the message in especially not:
 *   * list of transfers
 *   * length of an individual transfer
 *   * bits/speed/...
 *   * tx_buff, tx_dma
 *   * rx_buff, rx_dma
 * the driver may modify the data inside
 *   tx_buff, rx_buff
 * the calling driver has also the obligation to call
 * spi_unprepare_message prior to releasing the memory
 * erasing the message structure. 
 */
static int spi_prepare_message(struct spi_device *spi,
	struct spi_message *message, bool is_prepared)
{
	int ret;
	/* call the internal code but mark message as prepared */
	ret=__spi_prepare_message(spi,message);
	/* mark the message as prepared if there was no error*/
	if (!ret)
		message->is_driver_prepared=1;

	return ret;	
}

static int __spi_unprepare_message(struct spi_master *master,
	struct spi_message *message)
{
	int ret;
	/* if the message is not prepared, then call the handler */
	if (master->unprepare_message)
		ret=master->unprepare_message(master,message);

	/* and clear the memory structures */
	message->is_driver_prepared=0;
	message->prepared_data=NULL;

	return ret;
}

static int spi_unprepare_message(struct spi_device *spi,
	struct spi_message *message)
{
	return __spi_unprepare_message(spi->master,message);
}

void spi_finalize_current_message(struct spi_master *master)
{
        struct spi_message *mesg;
        unsigned long flags;

        spin_lock_irqsave(&master->queue_lock, flags);
        mesg = master->cur_msg;
        master->cur_msg = NULL;

        queue_kthread_work(&master->kworker, &master->pump_messages);
        spin_unlock_irqrestore(&master->queue_lock, flags);

	/* call the unprepare message if case the message is not driver prepared */
	if (!message->is_driver_prepared)
		spi_unprepare_message(mesg->spi,mesg);

        mesg->state = NULL;
        if (mesg->complete)
                mesg->complete(mesg->context);
}

there is only one thing to this:
As mentioned above, you have described the prototypes for prepare/unprepare message
as getting called from threaded context. This obviously is not (necessarily) true if 
you handle the prepare in the __spi_async function.

But waiting for the spi_pump thread to wake up is wasting valuable time waiting
for the scheduler - especially if you could submit the message to DMA immediately.
But if you get that far, then the DMA driver would also be running the complete 
calls from the DMA-IRQ handler to avoid the "scheduling" latencies in the complete
path as well...

So I hope I have summarized your ideas and you have now also a better understanding
for what I want to achieve and where I see places that can reduce CPU cost and 
latencies... I believe we could come to a common solution that fits both sides.

Thanks, Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                       ` <86AE15B6-05AF-4EFF-8B8F-10806A7C148B-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-08 16:19                                                                                         ` Mark Brown
       [not found]                                                                                           ` <20131108161957.GP2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-08 16:19 UTC (permalink / raw)
  To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 1456 bytes --]

On Fri, Nov 08, 2013 at 03:16:57PM +0100, Martin Sperl wrote:

Please fix the formatting in your mails, odd line wrapping makes things
harder to read.

> While I am more concerned about the cost of running a prepare EVERY time we
> schedule a message - and that becomes significant CPU utilization if you 
> look at my report (doing the setup calculation for the DMA of a message,
> with 2x2 Byte (write_then_read) transfer at 8MHz takes about as long as the
> actual transfer over the bus. (if more data gets transmitted the ratio 
> changes to the transfer side)

I'd be rather surprised if there's a meaningful benefit from doing such
a transfer using DMA in the first place, with most hardware you are
better off doing small transfers with PIO and using a completion
interrupt for the transfer.  Talking to the DMA controller tends to end
up being more work than it's worth.  Are you sure that you are
optimising the right thing here?

I'd want to see strong numbers from a real use case showing that the
complexity of trying to do this was worth it.

> But waiting for the spi_pump thread to wake up is wasting valuable
> time waiting for the scheduler - especially if you could submit the
> message to DMA immediately.

This is the sort of thing we can do sensibly enough once we don't have
to open code the logic in individual controller drivers, it doesn't seem
terribly connected with any external API changes though?

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                           ` <20131108161957.GP2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-08 17:31                                                                                             ` Martin Sperl
       [not found]                                                                                               ` <5F70E708-89B9-4DCF-A31A-E688BAA0E062-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-08 17:31 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

Hi Mark!

On 08.11.2013, at 17:19, Mark Brown wrote:
> 
> I'd want to see strong numbers from a real use case showing that the
> complexity of trying to do this was worth it.
> 

I remember having shared all sorts on values in my earlier posts 
regarding to absolute measurements.
* from CPU utilization to receive 3000 CAN messages/s
* to latency perspective (interrupt to SPI Message)
* to time spent "preparing" a message.

If you go back to my mail on the 4th November you will find this:

On 04.11.2013, at 18:33, Martin Sperl wrote:
> 
> So here a link of an example showing how much the prepared spi_messages 
> really can improve the SPI thruput - even without changing anything
> (besides) preparing the messages for direct DMA use.
> 
> http://www.raspberrypi.org/phpBB3/viewtopic.php?f=44&t=19489&p=448328#p448328
> The link goes to the RPI forum and includes the measurements plus some images
> showing both cases (configured via a module parameter for now)
> 
> Quick summary:
> For my test-usecase by just enabling "prepared messages" I have reduced the 
> time  for a "simple" transfer from 390us (without prepare) to 230us (with
> prepare) and the driver is still using the threaded message_pump.
> 
> OK - the "default" spi-bcm2835.c driver currently in the mainline takes 245us
> for the same thing, but it runs at 130k interrupts/s and 95% System load, 
> which is quite impractical to do anything else.
> So getting this down to 80% System load plus shorter responses is already
> quite an improvement. 


I included a lot more "hard" data in my previous emails - if you want to 
go back to those...

The above link also includes images images taken with the logic analyzer
comparing prepared and unprepared code-base (by simply enabling/disabling 
prepare). It also does not yet make use of "DMA chaining", for which I
need to move back to the transfer interface to get the most out of DMA...

Is this "strong" enough data for a "real" (worsted) case?

And this 3200 messages/s on the can BUS is not really the worsted case,
as the message is 8 bytes with extended ID at 500KHz.
It is more like the "best" "worsted" case, as such a message would take between
258 and 316us (depending on bit-stuffing).
This translates to 3164 to 3875 CAN-messages per second.

But If I was just using 11 bit IDs and 0 byte data a single message would be 
taking 94 to 111us, which translates to between 9009 and 10638 CAN messages/s.

And then you can double those values again if you go to 1MHz CAN Bus Speed.

Also you can easily "trigger" such a situation, if you:
* have a device that sends a message (and its controller is set to resend the 
  message if there is no recipient, which would be typical)
* have the CAN controller on the RPI configured for "listen only mode"
  (that means it does not acknowledge messages, which is a prerequisite)
* and you connect those two devices together with no other device.

Just start up the "sending" device and you get 80% SYSTEM CPU with the DMA 
driver (95% with the PIPO driver + packet-loss)!

The effect is similar to a packet storm on a network with an older
Network card with a driver that can not switch from "interrupt" to "polling"
mode...

The only difference to normal traffic is that the bus would not be saturated
100% of the time over any measured interval - you would see idle gaps.

Need more evidence? If so please tell me what you would like to see...

Ciao,
	Martin

P.s: and I can not understand why I can read 1000MHz CAN-bus saturated with
0 byte length messages on a simple 8-bit AVR at 16MHz and 4kb memory...
(with the same CAN-controller). And that microcontroller still has the time
to write the whole stream to a SD card (with all the non-deterministic 
latencies that a SD card introduces - but without a filesystem, I have to 
admit, so similar to "dd of=/dev/sdd")

So that is why I am trying to optimize the linux driver as well to get to
"better" performance and still have some cycles left for doing the "real"
work...--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                               ` <5F70E708-89B9-4DCF-A31A-E688BAA0E062-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-08 18:09                                                                                                 ` Mark Brown
       [not found]                                                                                                   ` <20131108180934.GQ2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-08 18:09 UTC (permalink / raw)
  To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 995 bytes --]

On Fri, Nov 08, 2013 at 06:31:37PM +0100, Martin Sperl wrote:
> On 08.11.2013, at 17:19, Mark Brown wrote:

> > I'd want to see strong numbers from a real use case showing that the
> > complexity of trying to do this was worth it.

> I remember having shared all sorts on values in my earlier posts 
> regarding to absolute measurements.
> * from CPU utilization to receive 3000 CAN messages/s
> * to latency perspective (interrupt to SPI Message)
> * to time spent "preparing" a message.

This sounds like an artificial benchmark attempting to saturate the bus
rather than a real world use case, as does everything else you mention,
and the contributions of the individual changes aren't broken down so it
isn't clear what specifically the API change delivers.

I'd like to see both a practical use case and specific analysis showing
that changing the API is delivering a benefit as opposed to the parts
which can be done by improving the implementation of the current API.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                   ` <20131108180934.GQ2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-08 19:18                                                                                                     ` Martin Sperl
       [not found]                                                                                                       ` <C375DEE6-1AEC-4AFB-A9D6-583DCB4476A3-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-08 19:18 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

On 08.11.2013, at 19:09, Mark Brown wrote:

> On Fri, Nov 08, 2013 at 06:31:37PM +0100, Martin Sperl wrote:
>> On 08.11.2013, at 17:19, Mark Brown wrote:
> 
>>> I'd want to see strong numbers from a real use case showing that the
>>> complexity of trying to do this was worth it.
> 
>> I remember having shared all sorts on values in my earlier posts 
>> regarding to absolute measurements.
>> * from CPU utilization to receive 3000 CAN messages/s
>> * to latency perspective (interrupt to SPI Message)
>> * to time spent "preparing" a message.
> 
> This sounds like an artificial benchmark attempting to saturate the bus
> rather than a real world use case, as does everything else you mention,
> and the contributions of the individual changes aren't broken down so it
> isn't clear what specifically the API change delivers.
> 
As explained - it is a reasonable use-case that you can easily trigger.

For example: updating the firmware of a 128KB Flash via the CAN bus requires
about: 22528 8 byte can packets to transfer just the data and some signaling.
With 3200 CAN-messages/s as the upper limit for these 8 byte messages
this requires 7.04seconds to transfer all the Flash data.

As CAN is a broadcast medium every node on the bus will see this data-rate.
Even if it is NOT involved in the real communication.

So if you are just listening while this happens between 2 other devices,
then you still would be running into a performance bottleneck while the Flash
is getting written.

And if you can not keep up with that rate (packet-loss,...) then you might be 
missing out on other messages that are more important and are directed at you.

OK, there would be gaps every 44 packets while a flash page gets written.
But even then, at that time other devices that are blocked, will send their
messages as the Firmware update is idle. So with more nodes under such a 
situation the bus becomes very likely saturated for 10 seconds.

So it is IMO a realistic load simulation to take the "automatic re-broadcast"
as a repeatable scenario.

> I'd like to see both a practical use case and specific analysis showing
> that changing the API is delivering a benefit as opposed to the parts
> which can be done by improving the implementation of the current API.

I have already shared at some point and also it shows in the forum:
Without prepare I see:
* 14.6k interrupts
* 17.2k context-switches
* 88%CPU-System load

While with prepare I see:
* 29.2k interrupts
* 34.5k context-switches
* 80%CPU-System load

The reason why we see more interrupts is because without the "prepared"
messages we are close to packet loss as 44% of all packets are fetched 
from the second buffer of the CAN controller - if this reaches 50% then 
we start to have packet-loss... So we get 1 Controller interrupt for 2
messages and that triggers some more interrupts for other parts.

While the "prepared" case has only 4% of all packets coming from the
2nd buffer - that is why we have doubled the number of interrupts, as
the number of interrupts from the GPIO device has almost doubled.

How did I measure?

The difference is a different module parameter to the SPI bus driver

Basically looking like this from the code-perspective:
static bool allow_prepared = 1;
module_param(allow_prepared, bool, 0);
MODULE_PARM_DESC(allow_prepared, "Run the driver with spi_prepare_message support");

int bcm2835dma_spi_prepare_message(struct spi_device *spi,
                                struct spi_message *message)
{
        struct bcm2835dma_prepared_message *prep;
        struct bcm2835dma_dma_cb *cb;
        int status=0;
        /* return immediately if we are not supposed to support prepared spi_mes
sages */
        if (!allow_prepared)
                return 0;
...
}

and the above mentioned values have been measured like this:
cat /proc/net/dev; vmstat 10 6; cat /proc/net/dev

Does this answer your question and convince you of this being realistic?

Also my next work is moving to DMA scheduling multiple messages via "transfer".
This should bring down the CPU utilization even further and it should also
decrease the context switches as the spi_pump thread goes out of the picture...
(and that will probably decrease the number of overall interrupts as well...)

How far I can optimize this way is a good question and then we can have a look how
much "penalty" the move to DMA engine will produce. (I have just seen that someone
has just started to post an initial DMA engine for the RPI...)

Thanks,
	Martin--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                       ` <C375DEE6-1AEC-4AFB-A9D6-583DCB4476A3-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-09 18:30                                                                                                         ` Mark Brown
       [not found]                                                                                                           ` <20131109183056.GU2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-09 18:30 UTC (permalink / raw)
  To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 3337 bytes --]

On Fri, Nov 08, 2013 at 08:18:25PM +0100, Martin Sperl wrote:
> On 08.11.2013, at 19:09, Mark Brown wrote:
> > On Fri, Nov 08, 2013 at 06:31:37PM +0100, Martin Sperl wrote:

> > This sounds like an artificial benchmark attempting to saturate the bus
> > rather than a real world use case, as does everything else you mention,
> > and the contributions of the individual changes aren't broken down so it
> > isn't clear what specifically the API change delivers.

> As explained - it is a reasonable use-case that you can easily trigger.

> For example: updating the firmware of a 128KB Flash via the CAN bus requires
> about: 22528 8 byte can packets to transfer just the data and some signaling.
> With 3200 CAN-messages/s as the upper limit for these 8 byte messages
> this requires 7.04seconds to transfer all the Flash data.

You mentioned that systems could be constructed but you didn't mention
examples of doing that; in general prolonged bus saturation tends to be
something that people avoid when designing systems.

> OK, there would be gaps every 44 packets while a flash page gets written.
> But even then, at that time other devices that are blocked, will send their
> messages as the Firmware update is idle. So with more nodes under such a 
> situation the bus becomes very likely saturated for 10 seconds.

> So it is IMO a realistic load simulation to take the "automatic re-broadcast"
> as a repeatable scenario.

What I'm missing with this example is the benefit of having an API for
pre-cooked messages, how it will deliver a performance improvement?
Flashing should be an infrequent operation and I'd expect it to involve
little reuse of messages which I'd expect to be the main thing that we
could gain from the API change.  I'd also not expect the controller
specific work to be especially expensive.

> > I'd like to see both a practical use case and specific analysis showing
> > that changing the API is delivering a benefit as opposed to the parts
> > which can be done by improving the implementation of the current API.

> I have already shared at some point and also it shows in the forum:

You've been doing a bunch of other performance improvements as well,
it's not clear to me how much of this is coming from the API change and
how much of this is due to changes which can also be done by improving
the implementation and without requiring drivers to be specifically
updated to take advantage of it. 

> Does this answer your question and convince you of this being realistic?

It's still not clear to me exactly how this works and hence if the
benefit comes from the API change itself.

> Also my next work is moving to DMA scheduling multiple messages via "transfer".
> This should bring down the CPU utilization even further and it should also
> decrease the context switches as the spi_pump thread goes out of the picture...
> (and that will probably decrease the number of overall interrupts as well...)

Right, and simply driving transfers from interrupt rather than task
context probably gets an awfully long way there.  This is the sort of
improvement which will benefit all users of the API - the main reason
I'm so cautious about changing the API is that I don't want to make it
more complex to implement this sort of improvement.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                           ` <20131109183056.GU2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-10 10:54                                                                                                             ` Martin Sperl
       [not found]                                                                                                               ` <6C7903B3-8563-490E-AD7D-BA5D65FFB9BC-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  2013-11-10 11:05                                                                                                             ` Mark Brown
  1 sibling, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-10 10:54 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

Hi Mark!

You seem to be very hard to convince!

On 09.11.2013, at 19:30, Mark Brown wrote:
> You mentioned that systems could be constructed but you didn't mention
> examples of doing that; in general prolonged bus saturation tends to be
> something that people avoid when designing systems.
> 

True, but CAN in principle can work saturated when only 2 nodes talk to each 
other and the others are quiet. Collision detection is also in the protocol
and the one "winning" the Collision can continue transmitting its message 
without an abort. And in the Firmware upgrade case the CAN ID is typically 7
configured with the most recessive bits (so highest possible CAN ID), so that
it will loos any arbitration and thus allow other nodes to send their data
with higher priority and thus effectively slowing down the firmware upgrade 
without impacting other devices.

So a saturated bus is not "ideal", but it is possible and would not impact
devices.

Also note that some CAN controllers, like the mcp2515, can have message 
HW-filters configured to limit system load. But these are typically limited
to a certain set of filter-pattern (2 Masks and 2-3 bit pattern per mask), so
for a "central" controller these do not seem to be a good match, as you would
run out of filters, and a central CPU of a vehicle will want to get all data
on which it may base its decision and the limited amount of filters is not
scaling for this.

Also one can estimate from 100% CAN-Bus Utilization with 90% System load.
to 50% Bus utilization and 45% System load
to 25% Bus utilization and 22.5% System load

And even the 22.5% CPU load as already too much in my opinion when you can get
it down further with a simple API extension and a driver optimized for speed.

Note also, that the message-processing overhead is in principle just (mostly)
dependent on the number of messages/interval, while the possible message rate 
on the CAN-bus are primarily dependent on the length of the message.

So with those 2 receive buffers in HW, I have the following time to react
to a two buffers full + avoid an overflow situation:
29-bit IDs and 8 byte/frame at 500khz: 774us to 948 us (depending on 
 data bytes, as there is bit-stuffing happening after 5 identical bits in 
 sequence)
11 bit IDs and 0 bytes/frame at 500kHz: 282us to 333us

These shows the distinct best to worsted case form a bit-length perspective.
So if you got 3 distinct nodes that all want to send the worsted case (single
frame of 0 bytes length , what would be typically called a RTR frame) within 
a window of 188us you can easily trigger the situation of the RPI not being 
able to keep up with the request rate and will drop the packet with the lowest
priority - and that is just 3 distinct packages.

Actually I have a simple case, where a single node sends a sequence of 5
CAN messages of 8 byte size and then stays quiet for 10 seconds (so an effective 
bus utilization of 0.015%) where the "stock" in-kernel drivers (mcp251x and 
spi-bcm2835) are already running into buffer overruns and in 20% of all cases
only report 3 out of those 5 frames. 
(Not to mention that the in tree mcp251x driver stops to work after such a 
situation because of a race confition due to edge-interrupts getting 
"reactivated" too late after the INT line has already gone again.
This happens on a Foundation based kernel somewhere between version 3.2 and 3.6,
with 3.6 showing this behaviour explicitly)


So you see that the "bus saturated" is just a "prolonged" situation of my 5
frame scenario which I can easily reproduce in real-live - and I doubt you 
can say, that this scenario would count as "prolonged bus situation".

> What I'm missing with this example is the benefit of having an API for
> pre-cooked messages, how it will deliver a performance improvement?
> Flashing should be an infrequent operation and I'd expect it to involve
> little reuse of messages which I'd expect to be the main thing that we
> could gain from the API change.  I'd also not expect the controller
> specific work to be especially expensive.

It will deliver a performance improvement by not having to create identical 
structures or running sanity thest every time you transmit the prepared message.
And for the DMA chains I have to create DMA chains that are of a length 10-20 
- depending on the number of transfers.

So just running this piece of code 3000 times/ second minutes or hours is 
overhead that we could definitely safe - IMO it is a low-hanging fruit.

This does not mean that every driver has to implement it necessarily - the first
candidates are the ones that need very high SPI-message thru-put and/or low 
scheduling latencies.

We can even look into optimizing spi_write, spi_read and spi_write_then_read
for the "hot" path by spending a little more memory for "typical" transfer
sizes. Then the "generic" the "generic" drivers would also require less CPU
resources - even if the bus driver does not implement prepare_message itself.
The "sanity" checks in __spi_async would not get executed every time.

I would say any little less CPU utilization helps if it is easy to implement.

>> 
>> Also my next work is moving to DMA scheduling multiple messages via "transfer".
>> This should bring down the CPU utilization even further and it should also
>> decrease the context switches as the spi_pump thread goes out of the picture...
>> (and that will probably decrease the number of overall interrupts as well...)
> 
> Right, and simply driving transfers from interrupt rather than task
> context probably gets an awfully long way there.  This is the sort of
> improvement which will benefit all users of the API - the main reason
> I'm so cautious about changing the API is that I don't want to make it
> more complex to implement this sort of improvement.

Also as an example if you would want the framework to provide "generic" GPIO
interrupt handling to send a (prepared) message without the driver having to 
write the code itself, then on the "registration" call of the message to send
in case of a triggered Interrupt, the framework can also execute this
preparation call in the framework already, so the drivers would not even care
for that!

>>> I'd like to see both a practical use case and specific analysis showing
>>> that changing the API is delivering a benefit as opposed to the parts
>>> which can be done by improving the implementation of the current API.
> 
>> I have already shared at some point and also it shows in the forum:
> 
> You've been doing a bunch of other performance improvements as well,
> it's not clear to me how much of this is coming from the API change and
> how much of this is due to changes which can also be done by improving
> the implementation and without requiring drivers to be specifically
> updated to take advantage of it. 
I have told you, that I have created a module parameter which enables/disables 
the prepare code.

then I ran the loading of the modules, configuring can speeds,...
and ran
vmstat 10 6
then removed the modules.

and then I ran the same again with the parameter to disable prepare.

and we get the different values...

88% System versus 80% system (and the 80% are actually too high, as the system 
got so much faster, that it does not need to handle 2 packets at a time
can handle each packet individually.

Nothing else changed besides this enable/disable parameter.

>> Does this answer your question and convince you of this being realistic?
> 
> It's still not clear to me exactly how this works and hence if the
> benefit comes from the API change itself.
There is no change in API - there is only an optional extension...

If you are not convinced yet, then here a proposal for measurements I can try 
to gather for you to hopefully convince you:
I would implement the previously posted patches to modify the spi_async call to 
allow us to avoid running the "sanity checks" in the __spi_async function.

Here the bus-drivers I would use for testing:
* spi-bcm2708 (from the raspbien kernel using the transfer interface)
* spi-bcm2835 (from the "stock" kernel using the transfer_one_message interface
  with an patch to allow for real time priority in the message pump - enabled 
  via a module parameter).
  * run with and without RT priority
* spi-bcm2835dma with the message_pump interface
  * run with and without RT priority (module parameter as well)
* spi-bcm2835dma with the transfer interface and implementing the message queue
  in DMA itself without any processing

And 2 device-drivers:
* A (hypothetical) ADC read driver, where we want to achieve the highest 
  possible thru-put. I would implement this as "2 messages" that get scheduled
  in sequence.
  both of them having a callback scheduling itself again immediately in the 
  callback call. 
  Why not not work with a single message, you may ask.
  Well, with a single message and a callback scheduling the message again
  the "limiting" factor becomes the processing of the callback and subsequent.
  while with this 2 message approach, when one message is "finished" the other 
  message could get executed while the callback of the first message is running.
  This makes a difference for the pipelined DMA driver as we can leverage
  "the DMA/ARM code concurrency". (not to work effectively in the end it might 
  require more than 2 messages, as there might still be a situation where the
  DMA driver would stall because there is no message in the pipeline.
  Important metrics are:
  * higher means better:
    * spi-message thru-put
  * lower means better:
    * System CPU%
    * interrupts
    * context switches
* maybe also a version of the above that would mimic reading an max187 ADC,
  which requires clock idle after CS down for 8.5us during which it does the
  conversion is run - only then data is valid and clock may start for the
  transfer.
* the optimized mcp2515 driver that I have been writing running in loopback mode
  connected to a 100% saturated CAN-bus@500kHz due to retransmits because of 
  unacknowledged CAN-messages.
  Important metrics are:
  * higher means better:
    * total number of messages processed (there is a top limit)
  * lower means better:
    * ratio of messages processed from second buffer versus first HW buffer
    * number of buffer overruns resulting in packet-loss
    * number of overruns not even detected due to "latencies that miss a 
      second frame"
    * System CPU%
    * interrupts
    * context switches

Both drivers would have a parameter to enable/disable the
spi_prepare_message(spi,message) calls during their initial setup based on
a module parameter.

Would these be good enough scenarios for you to finally convince you if the
numbers are favorable?

Martin

P.s: You know, it would be a LOT of work to get such values!--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                           ` <20131109183056.GU2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  2013-11-10 10:54                                                                                                             ` Martin Sperl
@ 2013-11-10 11:05                                                                                                             ` Mark Brown
       [not found]                                                                                                               ` <20131110110524.GA878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  1 sibling, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-10 11:05 UTC (permalink / raw)
  To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 885 bytes --]

On Sat, Nov 09, 2013 at 06:30:56PM +0000, Mark Brown wrote:

> You've been doing a bunch of other performance improvements as well,
> it's not clear to me how much of this is coming from the API change and
> how much of this is due to changes which can also be done by improving
> the implementation and without requiring drivers to be specifically
> updated to take advantage of it. 

An example of the sort of thing I'm concerned about: if I understand
correctly one of the things the prepare and unprepare operations in the
driver do is the DMA mapping.  This will mean that if you prepare in the
caller the system will be able to do the DMA mapping while a prior
transfer is in progress which should help performance.  That's a good
thing to do but it should be something that can be done as part of the
existing API without needing drivers using SPI to be updated.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                               ` <20131110110524.GA878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-10 16:41                                                                                                                 ` Martin Sperl
       [not found]                                                                                                                   ` <3361A01A-C7D0-4689-AFBD-085D3E62A67C-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-10 16:41 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

On 10.11.2013, at 12:05, Mark Brown wrote:
> An example of the sort of thing I'm concerned about: if I understand
> correctly one of the things the prepare and unprepare operations in the
> driver do is the DMA mapping.  This will mean that if you prepare in the
> caller the system will be able to do the DMA mapping while a prior
> transfer is in progress which should help performance.  That's a good
> thing to do but it should be something that can be done as part of the
> existing API without needing drivers using SPI to be updated.

yes - but that comes back to my argument that this is feasible to do 
transparently within the framework, but the cost of doing so would
eat away most of the possible performance gains unnecessarily and make the
situation worse for situations where there IS a change.

As explained - you would need to make an educated guess if a spi_transfer 
chain has changed or not (and I talk about the need to check the chain 
itself, speed, transferlength, rx_buff,tx_buff,... have not changed).

Doing this sort of checking will still take a few usecs per message and call.
One could do some hashing (with the risk of collisions) or keeping a second
copy of the data and do a byte wise compare of spi_message and the spi_transfer
chain. Or some other.

But still this requires processing you need to run on each spi_async call, 
independently if it has been prepared via the framework preciously or not.

And the risk of finding a false positive is non-0 (in case of hashing) and 
then you still spend spinning CPU cycles on it (not to mention the debugging
nightmare if such a situation occurs on a production system)

So you waste CPU cycles unnecessarily for something where you are unsure.

So assuming you need 1us per message for this kind of "test" if the structure 
has changed or not, and you run 3000 such SPI messages/s, then you are wasting
3ms of time.

And (again to repeat myself) you have on top of that a memory management
issue, because you then would need to garbage collect your prepared resources
for all those cases where the message just got allocated on the heap and 
discarded afterwards - there is no means in the current API to release the 
memory when releasing the memory of the spi_message. 
So to make things efficient here you would need to extend the API - but this
time it would be a mandatory change for all drivers (and I doubt you will find
EVERY instance resulting in bugs to fix down the road)

Also with this approach the "message" would possibly need to get prepared in
IRQ context (because spi_async can get called from interrupt handler) 
which would require allocation of memory in GFP_ATOMIC context just to play 
it safe in all situations. Not so good either.

I am mostly seeing this "spi_prepare_message" as a method for the driver to 
give some optimization hints to the engine to minimize "parsing" of the 
structures. 

And I would also prefer - if possible - to see the API to be able
to allocate memory from "blocking" memory pools, so the spi_prepare_message
would be allowed to run only in threaded context and may not get called from
interrupt contexts. Anyway those "typical" calls can easily get prepared 
during the module initialization phase and to avoid memory allocation in 
interrupt handlers and callbacks...

Ok - I have tried to explain my concerns again.

So if you still think you can do the things requested via the framework, then

Here a "quick" example where the "prepared messages" would reduce CPU overhead
immediately without any bus-driver specific implemention:

static int __spi_async(struct spi_device *spi, struct spi_message *message)
{
        struct spi_master *master = spi->master;
        struct spi_transfer *xfer;
+	/* run the below computations only if our message has not been changed */
+ 	if (__spi_message_has_changed(spi,message)) {
	        /* Half-duplex links include original MicroWire, and ones with
	         * only one data pin like SPI_3WIRE (switches direction) or where
	         * either MOSI or MISO is missing.  They can also be caused by
	         * software limitations.
	         */
	        if ((master->flags & SPI_MASTER_HALF_DUPLEX)
                	        || (spi->mode & SPI_3WIRE)) {
        	        unsigned flags = master->flags;
	
	                list_for_each_entry(xfer, &message->transfers, transfer_list) {
                        	if (xfer->rx_buf && xfer->tx_buf)
                                	return -EINVAL;
                        	if ((flags & SPI_MASTER_NO_TX) && xfer->tx_buf)
                        	        return -EINVAL;
                	        if ((flags & SPI_MASTER_NO_RX) && xfer->rx_buf)
        	                        return -EINVAL;
        	        }
        	}

        	/**
        	 * Set transfer bits_per_word and max speed as spi device default if
        	 * it is not set for this transfer.
        	 */
	        list_for_each_entry(xfer, &message->transfers, transfer_list) {
                	if (!xfer->bits_per_word)
                	        xfer->bits_per_word = spi->bits_per_word;
                	if (!xfer->speed_hz)
        	                xfer->speed_hz = spi->max_speed_hz;
	                if (master->bits_per_word_mask) {
                        	/* Only 32 bits fit in the mask */
                	        if (xfer->bits_per_word > 32)
        	                        return -EINVAL;
	                        if (!(master->bits_per_word_mask &
                        	                BIT(xfer->bits_per_word - 1)))
                	                return -EINVAL;
        	        }
	        }
+		/* call spi_prepare_message */
+		if (master->prepare_message) {
+			ret=master->prepare_message(master,message);
+			if (ret)
+				return ret;
+		}
+	}
        message->spi = spi;
        message->status = -EINPROGRESS;
        return master->transfer(spi, message);
}

So how would you implement the __spi_message_has_changed function in a manner 
that is performance efficient and worth called in a "hot-path"?

And how would you handle the garbage collection required for the "transparent case"
you are trying to get.

Please stop wishfull thinking, instead provide alternative ideas how to solve the issue.

Please provide a solution that works as simple as:
/* in the driver init code - if you want to "improve" the latency of spi_async */
spi_prepare_message(spi,message);
/* in the driver release code - if you want to "improve" the latency of spi_async */
spi_unprepare_message(spi,message);

/* and the "check" itself */
static int __spi_message_has_changed(spi,message) 
{ 
	return (! message->is_driver_prepared); 
}

So what about the performance measurements I have mentioned?
Do you have any concerns with the described approach and the metrics proposed?
You want me to measure and publish those results?

Martin--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                   ` <3361A01A-C7D0-4689-AFBD-085D3E62A67C-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-11 11:18                                                                                                                     ` Mark Brown
       [not found]                                                                                                                       ` <20131111111842.GE2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-11 11:18 UTC (permalink / raw)
  To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 511 bytes --]

On Sun, Nov 10, 2013 at 05:41:30PM +0100, Martin Sperl wrote:

> As explained - you would need to make an educated guess if a spi_transfer 
> chain has changed or not (and I talk about the need to check the chain 
> itself, speed, transferlength, rx_buff,tx_buff,... have not changed).

No, you're going off and talking about something else here.  You are
very focused on the idea of skipping some of the setup but that's not
the only thing being discussed.  Please try to separate the various
ideas.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                       ` <20131111111842.GE2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-11 11:44                                                                                                                         ` Martin Sperl
  0 siblings, 0 replies; 57+ messages in thread
From: Martin Sperl @ 2013-11-11 11:44 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

Hi Mark!

On 11.11.2013 12:18, Mark Brown wrote:
> No, you're going off and talking about something else here. You are
> very focused on the idea of skipping some of the setup but that's not
> the only thing being discussed. Please try to separate the various
> ideas. 

Well - from my perspective (for which I also started this thread) it is
_all_ about skipping setup work (besides the question about
why spi_master.transfer is depreciated, which is unanswered).

If you want to discuss something in the same context, then please
start to state explicitly what you want to get addressed here as well.

Because I have also seen a disconnect in this discussion before
(including running in circles) and I had tried to summarize my
understanding of this disconnect  earlier, but no response to that portion.

So to summarize my understanding of the disconnect again:
* your concern is "refactoring" message pump to reduce bus-driver code
* my concern is speed and latencies of SPI messages

And as far as I can see they primarily touch on the subject of functionality
both containing "prepare" but for different purpose - one for the
bus-driver interface and the other for the device driver interface.

So again, please state what you want to achieve in this context and how
you see the two subject are interconnected and why/how theyneed to
get handled together.

Thanks, Martin








--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                               ` <6C7903B3-8563-490E-AD7D-BA5D65FFB9BC-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-12  1:19                                                                                                                 ` Mark Brown
       [not found]                                                                                                                   ` <20131112011954.GH2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-12  1:19 UTC (permalink / raw)
  To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 5439 bytes --]

On Sun, Nov 10, 2013 at 11:54:16AM +0100, Martin Sperl wrote:
> On 09.11.2013, at 19:30, Mark Brown wrote:

> > What I'm missing with this example is the benefit of having an API for
> > pre-cooked messages, how it will deliver a performance improvement?
> > Flashing should be an infrequent operation and I'd expect it to involve
> > little reuse of messages which I'd expect to be the main thing that we
> > could gain from the API change.  I'd also not expect the controller
> > specific work to be especially expensive.

> It will deliver a performance improvement by not having to create identical 
> structures or running sanity thest every time you transmit the prepared message.

But is that a 0.01% improvement or a 50% improvement and is that because
we're pushing a silly implementation into a corner or because there's
actually substantial work that needs doing there?

If the error checking and so on is really that expensive this sounds
like we need to fix that anyway, even if it's just quick things like
making some of the validation debug only or pushing some of the work
out to the message update functions.

The messages and transfers ought to be totally reusable already, the
only bit that should need to be recreated is any data the driver creates
internally, most likely that's only going to be data for DMA.  We do
already have an interface for drivers to do the DMA mapping though it's
not very well loved at the minute...

> And for the DMA chains I have to create DMA chains that are of a length 10-20 
> - depending on the number of transfers.

What's the expense here - the mapping or something else?  This does also
seem like a lot of scatter/gather, why is the data so spread out and is
there room to do better in the client driver?

We ought to modernise this interface to be more in tune with what
dmaengine is doing anyway (probably taking scatterlists for example) -
you were also talking about DMA a lot in some of your earlier messages
and it is where I'd expect the cost to be so I'm wondering if this isn't
actually more to do with this interface being poor and if we don't need
dmaengine changes too to realise the benefits.

> I would say any little less CPU utilization helps if it is easy to implement.

I'm not clear on the easy to implement bit here (though obviously there
haven't been patches posted for review yet), especially the interaction
with DMA if that's a part of this, or what exactly you're expecting to
be allowed to happen to the message while it's been partially initalised
by the driver (I wouldn't use the term prepared for what you're
suggesting by the way, I'd expect preparation to be interacting with the
hardware) since that will affect how widely clients could take advantage
of this.

The other bit of easy to implement that concerns me is making sure that
it's easy to implement for the stack as a whole - one of the big
problems with the existing stack is that there is too little factoring
out of code into the core so we end up with individual drivers
implementing good ideas that are widely useful and most drivers being
basic so drivers are more complex to review and the benefits are not
widely distributed.  This makes me worry about an API that just
delegates everything to the driver and which requires both master and
client drivers to specifically support the feature.

> > You've been doing a bunch of other performance improvements as well,
> > it's not clear to me how much of this is coming from the API change and
> > how much of this is due to changes which can also be done by improving
> > the implementation and without requiring drivers to be specifically
> > updated to take advantage of it. 

> I have told you, that I have created a module parameter which enables/disables 
> the prepare code.

Right, and what I'm saying is that I'm not clear what's being done in
that code and hence why moving it makes a difference.

> >> Does this answer your question and convince you of this being realistic?

> > It's still not clear to me exactly how this works and hence if the
> > benefit comes from the API change itself.

> There is no change in API - there is only an optional extension...

Adding new interfaces is a change too, on the driver side as well.

> Would these be good enough scenarios for you to finally convince you if the
> numbers are favorable?

These seem rather artificial and I'm not sure they'd tell us much about
why the effects were being observed.  What would be more interesting
would be seeing where the time is being spent at the minute for real
loads (for example by using trace points like those that are currently
being added in mainline for examining the behaviour of the message pump)
- if we really are spending a noticeable proportion of the time in setup
then that's obviously a good case for optimising it but right now I
don't feel I understand which bits are slow enough to register in the
context of the overall transaction and hence if the suggested approach
is the best way to optimise them.

An implementation that avoided the need to have the master drivers
updated in order to get benefits would be a lot more interesting since
that would make this much more generally useful, and of course concrete
patches would be good too, but the main thing I'm missing is clarity on
how the gains are being made.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                   ` <20131112011954.GH2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-12 14:42                                                                                                                     ` Martin Sperl
       [not found]                                                                                                                       ` <52823E73.503-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-12 14:42 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

Hi Mark!

 >> It will deliver a performance improvement by not having to create 
identical
>> structures or running sanity thest every time you transmit the prepared message.
>
> But is that a 0.01% improvement or a 50% improvement and is that because
> we're pushing a silly implementation into a corner or because there's
> actually substantial work that needs doing there?

well - in the case of the DMA driver of mine it is an 8% improvement
so from 88% System load down to 80% System load. (and the current code 
is still walking some lists to find a match - to avoid changing spi.h 
for now)

And that is with the transfer_one_message approach.

I will try to find some time to patch spi.c so that it will toggle the
CS lines while it is doing the "verification" code. then I can look
with my logic analyzer to see how many ns it takes (in 64ns step 
granularity).

And then I will take the stock spi-bcm2835 driver from upstream and
the enc28j60 network driver also from upstream and report on those numbers.

Note that - as far as I have seen - the enc28j60 is mostly using 
write_then_read for its SPI-transactions

>
> If the error checking and so on is really that expensive this sounds
> like we need to fix that anyway, even if it's just quick things like
> making some of the validation debug only or pushing some of the work
> out to the message update functions.
>
> The messages and transfers ought to be totally reusable already, the
> only bit that should need to be recreated is any data the driver creates
> internally, most likely that's only going to be data for DMA.  We do
> already have an interface for drivers to do the DMA mapping though it's
> not very well loved at the minute...

Well - that is what I have been asking for - a means to shortcut the 
computation and a set of 2 fields (one flag and one void pointer both of 
which are defined as fully owned by the bus driver - unlike state and 
queue, which is ONLY in the ownership of the driver/framework between 
calls to spi_async() and complete())

What DMA-interface are you talking about? at least in spi.h in 3.10 I 
find nothing that would indicate an extra DMA interface...

>> And for the DMA chains I have to create DMA chains that are of a length 10-20
>> - depending on the number of transfers.
>
> What's the expense here - the mapping or something else?  This does also
> seem like a lot of scatter/gather, why is the data so spread out and is
> there room to do better in the client driver?
Well - that is a bit long to answer, so here a short description first:

the driver runs ALL configs of the SPI-register needed also via DMA. The 
main reason for this is that I want to be able to pipeline the DMAs 
without any interaction from the CPU - besides the scheduling/chaining 
to existing DMA and in case the message defines a complete() callback, 
then trigger an interrupt for that portion. (with the driver still using
the spi_message_pump and transmit_one_message, this

this means that as long as you just schedule spi_async messages, then 
the DMA is doing everything on its own.

So (with some extension to the SPI API to allow for cyclic messages) you 
may even run a LCD display that gets updated 25 times per second 
streaming from two frame buffers to the display. And that without any
CPU interaction at all (besides the complete() if you need to get 
notified of the buffer "swap").

But that is an idea for the future...

So here the long description on what we need to do for say a message 
that implements write_then_read (2 bytes write, 10 bytes read) showing 
the RX and TX DMA:

RX-DMA                                    TX-DMA
configure CDIV register
  (= SPI Clock Divider)
reset RX/TX-FIFOs
configure number of Bytes in TX/RX-DMA (2)
configure Control-register to start SPI
configure next Control block for TX-DMA
trigger start of TX-DMA controlblock
                                           Transfer 2 bytes to SPI-FIFO
transfer 2 bytes from SPI-FIFO (with DST_IGNORE)

 >> if delay_usec is set, then:
configure some delay cycles in DMA by
   (tranfer X bytes with DMA_IDLE_CYCLES=31
    and SRC_IGNORE and DST_IGNORE)

 >> if the transfer would have been a multiple of 4 and cs_change is NOT 
set and no change in CDIV we would skip the following steps:

 >> if change_cs is set for the transfer:
configure Control_register to bring CS up
configure some delay so that CS is high for
   about half a SPI-clock cycle (otherwise
   devices might not recognize it)

 >> now the second transfer:

 >> if cdif has changed, then:
configure CDIV register
  (= SPI Clock Divider)

reset RX/TX-FIFOs
configure number of Bytes in TX/RX-DMA (10)
configure Control-register to start SPI
configure next Control block for TX-DMA
trigger start of TX-DMA controlblock


 >> here we continue (in case the first transfer is a multiple of 4)
                                           Transfer 10 bytes to SPI-FIFO
                                             (with src_ignore, so 0)
transfer 10 bytes from SPI-FIFO

 >> if we need to trigger a DMA (because of callback())
configure next Control block for TX-DMA
trigger start of TX-DMA controlblock
					  transfer information which
					  message we have finished
					  and trigger DMA Interrupt

So this includes "all" options that we have currently implemented in SPI 
(change_cs, speed_hz, delay_usec) and which the device supports.

And as mentioned in the "short" description the next message can get 
chained immediately to the last control block of the RX-DMA chain.
And all that with minimal delay and CPU overhead (as soon as the chain
has been set up).

This "chaining" is _obviously_ not possible with the transfer_one 
message BUS-master-API (or it would require some "ugly hacks" to make 
use of transfer_one_message API)

The code right now does not make use of dmaengine, but is using dmapool
for memory management of the control blocks+extra management data.

And this is taking some time already without even going thru the 
dmaengine layer, which possibly would mean two allocations instead of a
single

And that setup cost of this is what I would like to avoid to run for 
every time we call spi_async for a spi_message that has not changed.

> We ought to modernise this interface to be more in tune with what
> dmaengine is doing anyway (probably taking scatterlists for example) -
> you were also talking about DMA a lot in some of your earlier messages
> and it is where I'd expect the cost to be so I'm wondering if this isn't
> actually more to do with this interface being poor and if we don't need
> dmaengine changes too to realise the benefits.

You might look at my example above (and at some time at the driver code 
itself - when it becomes stable - unless you want to peak at git-hub..) 
to see how we may generalize it.

And at least in the ARM arena it should be possible to apply the same
principle to other SPI controllers+DMA as well. But you might need to 
work around some HW bugs that differ from the 4 that I have found in the 
bcm2835 SOC (DMA-Interrupt only on chain end, CLEAR-FIFO-TX will bring 
CS high in 5% of all cases, CLEAR-FIFO-XX+start transfer will trigger a 
bug that some SCK does not change while MOSI does change, ADCS needs to 
be set for "the described" way to do SPI-DMA on the SOC - which is 
totally different from what I do to get all features into the DMA 
chain), so other "scheduling plans" might be needed.

So I would not try to come up with a generic approach when we have a 
single driver that can do that right now.

Well - as said I am wondering if the DMA engine can really can help - 
especially if you want to keep the "prepare" code fast. I am using
dmapool for memory alloc/free, but I fear we would need to run 2 mallocs
if I was using DMA-engine and more setup/tear-down overhead, as well
as more code that needs to get executed to go via DMA-engine.

As said: I will keep it as is for now and first implement the full DMA
chaining approach. then we can see how much the move to DMA-Engine is 
costing...

> I'm not clear on the easy to implement bit here (though obviously there
> haven't been patches posted for review yet), especially the interaction
> with DMA if that's a part of this, or what exactly you're expecting to
> be allowed to happen to the message while it's been partially initalised
> by the driver (I wouldn't use the term prepared for what you're
> suggesting by the way, I'd expect preparation to be interacting with the
> hardware) since that will affect how widely clients could take advantage
> of this.
well, my naming comes form the analogy of prepared SQL statements, which 
also takes a SQL and precompiles an optimization plan for it once.
The big difference is that comparing 2 SQL-strings is cheaper than 
comparing 2 memory structures and DBs typically run on much faster HW 
than an ARM at 700Mhz, so the prepared SQL is not so important any more.
It is still kept as a hint to the execution planner (also it reduces 
bytes to transfer over the network...).

As for "unmodified" drivers taking advantage of this:
I believe that any driver that only uses the sync interfaces spi_read, 
spi_write, spi_write_then_read can get improved (at the cost of memory)
by having optimized versions of those functions that keep the structures 
in a cache for the "hot" path.
But then all those drivers that make use of the sync interface are not 
geared towards thru-put in the first, as they are blocking, so these are
not the prime "candidates" for the improvement in the first place.
If they want to run with minimal latencies, they HAVE to run spi_async.
Anything else is not very efficient as it requires several context 
switches by the way of the current design.

> The other bit of easy to implement that concerns me is making sure that
> it's easy to implement for the stack as a whole - one of the big
> problems with the existing stack is that there is too little factoring
> out of code into the core so we end up with individual drivers
> implementing good ideas that are widely useful and most drivers being
> basic so drivers are more complex to review and the benefits are not
> widely distributed.  This makes me worry about an API that just
> delegates everything to the driver and which requires both master and
> client drivers to specifically support the feature.
If you can get away from the idea that prepare is ONLY related to the 
Master, then it may provide additional "advantages". Also when you do 
have a non-prepared message, you can still can call the prepare code 
from the framework and afterwards unprepare it - like I had proposed.
That way any "checks" the driver would need to do to see if it can 
handle the message, it can do immediately in the "prepare" code.

So essentially start to think of "prepare_message" as: do any 
verifications on the message that is needed to set it up correctly and 
find unsupported situations as well as optimize the message into 
whatever the driver may find helpful for its own purpose. One candidate 
here could be also implement dma_map_single in the prepare call...

Then the change would look like this:
static int __spi_prepare_message (struct spi_device *spi,
                 struct spi_message *mesg)
{
	/* sanity check message and copy the fields - see __spi_async */

	/* call master prepare if it exists */
	if (spi->master->prepare_message)
		return spi->master->prepare_message(spi->master,mesg);

	return 0;
}

static int spi_prepare_message (struct spi_device *spi,
                 struct spi_message *mesg)
{
	status=__spi_prepare_message(spi,mesg);
	if (!status)
		mesg->is_prepared=1;
	return status;
}

static int __spi_unprepare_message (struct spi_device *spi,
                 struct spi_message *mesg)
{
	/* call master prepare if it exists */
	if (spi->master->unprepare_message)
		return spi->master->unprepare_message(spi->master,mesg);
	return 0;
}

static int spi_unprepare_message (struct spi_device *spi,
                 struct spi_message *mesg)
{
	mesg->is_prepared=0;
	return __spi_unprepare_message(spi,mesg);
}

static int __spi_async(struct spi_device *spi,
                 struct spi_message *mesg)
{
	int status;
	/* prepare the message using the internal prepare */
	if (!message->is_prepared) {
		status=__spi_prepare_message(spi,messsage);
		if (status)
			return status;
         message->spi = spi;
         message->status = -EINPROGRESS;
         return master->transfer(spi, message);
}

And this is what I would envision as a total change you would have to 
do... (besides the 2 fields as mentioned...)

So now if you want to start refactoring the transfer_one into a generic 
version - as far as I know it could get reduced to:

static int spi_transfer_one(struct spi_master *master,
                 struct spi_message *message)
{
	int status;
	/* schedule the message in the Controller */
	status=spi->master->schedule_message(master,message);
	if (status)
		goto error:
	/* wait for completion */
	if (wait_for_completion_timeout(
                         &bs->done, /* wherever you put complete */
                         msecs_to_jiffies(SPI_TIMEOUT_MS)) == 0) {
		status=-ETIMEDOUT;
	}

error:
	if (!message->is_prepared) {
		status=__spi_prepare_message(spi,messsage);
		if (status)
			return status;

	spi_finalize_current_message(master);
	return status;
}

with this the bcm2835_spi_transfer_one function could (mostly) get 
reduced to: bcm2835_spi_start_transfer (and some of the additional 
xfer-scheduling moved to the interrupt handler).

So separating things out immediately reducing code of the individual 
drivers if they move to the newer interface of schedule_message...
if they do not need prepare/unprepare, then it is (almost) a NOP.

> Right, and what I'm saying is that I'm not clear what's being done in
> that code and hence why moving it makes a difference.

well - the calculation of the above DMA chain takes a few usecs.
and with the code "above" plus the prepare in the driver we avoid those 
few usecs that get run every time.

> Adding new interfaces is a change too, on the driver side as well.
Which is optional and would get first used only by device-drivers that 
need high tru-put. And those device-drivers that need this, should 
already be heavily optimized to avoid memory allocations (even if it is 
just on the heap) and message setup for every call.

> These seem rather artificial and I'm not sure they'd tell us much about
> why the effects were being observed.  What would be more interesting
> would be seeing where the time is being spent at the minute for real
> loads (for example by using trace points like those that are currently
> being added in mainline for examining the behaviour of the message pump)
well - my use case is exactly that CAN issue where I have message loss
when only sending 5 packets in sequence in 30% of all cases with stock 
drivers. That is why I started investigating the issue in the first 
place - so do not tell me it is academic!

The proposed SPI-test driver is academic to some extent, but it still 
tries to simulate "real" device traffic with optimal thru-put.
Or do you think that reading an ADC as fast as possible is an academic 
exercise? (the test driver would essentially mimic that and allow for
scheduling mechanism to work that allow high thruput (if you configure 
multiple messages)

> - if we really are spending a noticeable proportion of the time in setup
> then that's obviously a good case for optimising it but right now I
> don't feel I understand which bits are slow enough to register in the
> context of the overall transaction and hence if the suggested approach
> is the best way to optimise them.
well - you see the setup I have to make to transfer 2 writes 10 reads?
that is 10+ calls to dmapool_alloc, then filling in the structure.
That takes some time (cant remember how many usec right now).
And then tearing it down I need to do the same again - ok mostly 
dmapool_release...

> An implementation that avoided the need to have the master drivers
> updated in order to get benefits would be a lot more interesting since
> that would make this much more generally useful, and of course concrete
> patches would be good too, but the main thing I'm missing is clarity on
> how the gains are being made.
a master driver will never be able to profit from the new interface, 
besides the "small" gain of not having to validate the messages every 
time in __spi_async (if the device-driver does call prepare)

And as said: all the spi_sync methods are candidates to optimization of 
this - at the cost of some memory used for caching multiple messages.
(say having 16 times 1xspi_message+1xspi_transfer structures for 
transfer length 1 to 16 in spi_write. Similar for spi_read. Or 256 
1xspi_message+2xspi_transfer to transfer 1-16 writes and 1-16 reads.
plus some memory copy for the reads and writes)

Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                       ` <52823E73.503-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-12 17:59                                                                                                                         ` Martin Sperl
       [not found]                                                                                                                           ` <2252E63E-176C-43F7-B259-D1C3A142DAFE-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  2013-11-13 15:11                                                                                                                         ` Mark Brown
  2013-11-14  1:50                                                                                                                         ` Mark Brown
  2 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-12 17:59 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

First, sorry for the bad formatting in the last email - it looked ok when typing 
it with Thunderbird... I even used the recommended plugin to make it wrap 
correctly (still format=flowed).

OK, but here some statistics that I had promised.
I got the following data from my logic analyzer with a patched spi.c
that brings CS down while running the sanity check in __spi_async!
(note that this kind of debugging has some potential race-conditions with HW 
- if a DMA-transfer is running -  but for the minimal cases it gives
good enough data with minimal "overhead")

This is the code that I measure:
	/* bring CS1 down by switching polarity */
        writel((1<<22),*regs);
        /* Half-duplex links include original MicroWire, and ones with
         * only one data pin like SPI_3WIRE (switches direction) or where
         * either MOSI or MISO is missing.  They can also be caused by
         * software limitations.
         */
        if ((master->flags & SPI_MASTER_HALF_DUPLEX)
                        || (spi->mode & SPI_3WIRE)) {
                unsigned flags = master->flags;

                list_for_each_entry(xfer, &message->transfers, transfer_list) {
                        if (xfer->rx_buf && xfer->tx_buf)
                                return -EINVAL;
                        if ((flags & SPI_MASTER_NO_TX) && xfer->tx_buf)
                                return -EINVAL;
                        if ((flags & SPI_MASTER_NO_RX) && xfer->rx_buf)
                                return -EINVAL;
                }
        }

        /**
         * Set transfer bits_per_word and max speed as spi device default if
         * it is not set for this transfer.
         */
        list_for_each_entry(xfer, &message->transfers, transfer_list) {
                if (!xfer->bits_per_word)
                        xfer->bits_per_word = spi->bits_per_word;
                if (!xfer->speed_hz)
                        xfer->speed_hz = spi->max_speed_hz;
                if (master->bits_per_word_mask) {
                        /* Only 32 bits fit in the mask */
                        if (xfer->bits_per_word > 32)
                                return -EINVAL;
                        if (!(master->bits_per_word_mask &
                                        BIT(xfer->bits_per_word - 1)))
                                return -EINVAL;
                }
        }
	/* bring CS1 up by switching polarity again */
        writel(0x00,*regs);

Depending on the message actually delivered I see that CS is down for about 
1 transfers/message: 0.75us
3 transfers/message: 2.5us
8 transfers/message: 4.7us

Now the preparing of the DMA structure described before takes:
1 transfers/message: 26us
2 transfers/message: 18-23us
3 transfers/message: 46-55us
8 transfers/message: 106us

There are obviously variations depending on what data is in L2 cache.

As for other interesting measurements a single example with 5 transfers:
Interrupt to __spi_async:     19us
__spi_async sanity start/end:  2us
__SPI_ASYNC to DMA_PREPARE:   99us
dma_prepare start/end:        40us
dma_prepare_end to CS DOWN:    4us
CS DOWN to CS UP:             16us (real transfer)

the values differ between interrupts, but I believe it still shows
where we see the delays that produce latencies.

If we use the prepare Message interface, with my proposed change
then the 2us for sanity checks are not applied
and also 40us for dma_prepare are not applied.

This shows that we can reduce necessary CPU cycles both for "normal"
use for bus drivers that do NOT implement any internal "preparations"
to make DMA work and how much we can save if we go the DMA route.

Also you see how long it takes from the time we have called transfer
to the time when DMA_Prepare gets run inside transfer_one_message?

This is actually one of the biggest "factors" to latency on its own.
And that is the scheduling delay of the message-pump!

So the "threaded" approach - even though nice on paper - is actually
introducing long latencies. And that is why i would like to see how 
a fully DMA pipelined driver would fare - my guess is that the delays
would be much smaller. It will not change the delays interrupt to
running the handler, but the scheduling of the message pump would go
away immediately.

To summarize: 10k SPI messages/s with stock drivers produce 10ms/s of 
CPU utilization when running with "stock" code. When optimizing the
device-drivers to run the same messages prepared (by adding the
spi_prepare/unprepare_message in the initialization phase) we save
about 1% System CPU (assuming messages with a single transfer)

So I hope that these are some hard enough facts for you to say, that 
"preparing" messages, that do not "change in structure" in a device
driver _does_ bring an advantage.


Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                       ` <52823E73.503-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  2013-11-12 17:59                                                                                                                         ` Martin Sperl
@ 2013-11-13 15:11                                                                                                                         ` Mark Brown
       [not found]                                                                                                                           ` <20131113151102.GS878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  2013-11-14  1:50                                                                                                                         ` Mark Brown
  2 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-13 15:11 UTC (permalink / raw)
  To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 830 bytes --]

On Tue, Nov 12, 2013 at 03:42:59PM +0100, Martin Sperl wrote:

Not fully read this yet but one thing...

> >The messages and transfers ought to be totally reusable already, the
> >only bit that should need to be recreated is any data the driver creates
> >internally, most likely that's only going to be data for DMA.  We do
> >already have an interface for drivers to do the DMA mapping though it's
> >not very well loved at the minute...

> What DMA-interface are you talking about? at least in spi.h in 3.10
> I find nothing that would indicate an extra DMA interface...

Drivers can supply pre-mapped DMA buffers if they set is_dma_mapped
which is about as good as you get with dmaengine at the minute.  The
main opportunities for enhancement here are in the master interface
rather than the client, at least as things stand.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                           ` <2252E63E-176C-43F7-B259-D1C3A142DAFE-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-13 15:43                                                                                                                             ` Mark Brown
       [not found]                                                                                                                               ` <20131113154346.GT878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-13 15:43 UTC (permalink / raw)
  To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 2722 bytes --]

On Tue, Nov 12, 2013 at 06:59:02PM +0100, Martin Sperl wrote:

> First, sorry for the bad formatting in the last email - it looked ok when typing 
> it with Thunderbird... I even used the recommended plugin to make it wrap 
> correctly (still format=flowed).

Like I say still reading that, a couple of initial thoughts though...

> As for other interesting measurements a single example with 5 transfers:
> Interrupt to __spi_async:     19us
> __spi_async sanity start/end:  2us
> __SPI_ASYNC to DMA_PREPARE:   99us
> dma_prepare start/end:        40us
> dma_prepare_end to CS DOWN:    4us
> CS DOWN to CS UP:             16us (real transfer)

This is making me question the use of DMA at all here, this looks like
the situation of a lot of drivers where they switch to PIO mode for
small transfers since the cost of managing DMA is too great.  I'm also
curious which parts of the DMA preparation are expensive - is it the
building of the datastructure for DMA or is it dealing with the
coherency issues for the DMA controller?  The dmaengine API currently
needs transfers rebuilding each time I believe...

Also how does this scale for larger messages?

I appreciate that you want to push the entire queue down into hardware,
I'm partly thinking of the costs for drivers that don't go and do any of
the precooking here.

> This is actually one of the biggest "factors" to latency on its own.
> And that is the scheduling delay of the message-pump!

Right, which is why I'm much more interested in the refactoring of that
code to support kicking transfers without bouncing up there - the stuff
I was talking about which made you talk about checksumming and so on.

> So the "threaded" approach - even though nice on paper - is actually
> introducing long latencies. And that is why i would like to see how 
> a fully DMA pipelined driver would fare - my guess is that the delays
> would be much smaller. It will not change the delays interrupt to
> running the handler, but the scheduling of the message pump would go
> away immediately.

Exactly, but this is largely orthogonal to having the client drivers
precook the messages.  You can't just discard the thread since some
things like clock reprogramming can be sleeping but we should be using
it less (and some of the use that is needed can be run in parallel with
the transfers of other messages if we build up a queue).

> So I hope that these are some hard enough facts for you to say, that 
> "preparing" messages, that do not "change in structure" in a device
> driver _does_ bring an advantage.

The 40us is definitely somewhat interesting though I'd be interested to
know how that compares with PIO too.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                           ` <20131113151102.GS878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-13 15:48                                                                                                                             ` Martin Sperl
       [not found]                                                                                                                               ` <77070979-0CE4-4C76-B12E-DA94B2577172-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-13 15:48 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

Hi Mark!

> 
> Drivers can supply pre-mapped DMA buffers if they set is_dma_mapped
> which is about as good as you get with dmaengine at the minute.  The
> main opportunities for enhancement here are in the master interface
> rather than the client, at least as things stand.


This is part of my pre-requisite for making it prepareable in my case...

Because otherwise you have (again) to walk the tree to set the rx_dma
via dma_map_single and here the API explicitly states that between the map 
and unmap call the caller may NOT access the pages.

The reason is (probably) for old-ISA devices that may only DMA map 24 bit 
addresses. So the map/unmap actually is supposed to copy data to a bounce 
buffer and back (depending on the direction given).

Typically not an issue on ARM though, I would say...

Hence this becomes expensive to map and then creating the "whole" thing
works just as fine.

Ciao,
	Martin





--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                               ` <77070979-0CE4-4C76-B12E-DA94B2577172-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-13 16:59                                                                                                                                 ` Mark Brown
  0 siblings, 0 replies; 57+ messages in thread
From: Mark Brown @ 2013-11-13 16:59 UTC (permalink / raw)
  To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 1519 bytes --]

On Wed, Nov 13, 2013 at 04:48:47PM +0100, Martin Sperl wrote:

> > Drivers can supply pre-mapped DMA buffers if they set is_dma_mapped
> > which is about as good as you get with dmaengine at the minute.  The
> > main opportunities for enhancement here are in the master interface
> > rather than the client, at least as things stand.

> This is part of my pre-requisite for making it prepareable in my case...

> Because otherwise you have (again) to walk the tree to set the rx_dma
> via dma_map_single and here the API explicitly states that between the map 
> and unmap call the caller may NOT access the pages.

So you expect the caller to also keep the buffers mapped separately?
The restrictions on mapping are one of the concerns I have with nailing
down exactly what you're envisaging a prepared and unprepared
transaction doing (plus masters that don't actually want DMA in the
first place).

> The reason is (probably) for old-ISA devices that may only DMA map 24 bit 
> addresses. So the map/unmap actually is supposed to copy data to a bounce 
> buffer and back (depending on the direction given).

> Typically not an issue on ARM though, I would say...

No, it's not just for ISA - other systems can have DMA access
restrictions for example due to LPAE or other things that cause limited
addressing on some internal buses and obviously many modern systems have
IOMMUs which require things to be mapped in and out before devices can
access them without needing bounce buffers.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                               ` <20131113154346.GT878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-13 18:35                                                                                                                                 ` Martin Sperl
       [not found]                                                                                                                                   ` <ED58E869-A9F6-4BB2-8EC6-D71F946509DC-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-13 18:35 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

Hi Mark
>> As for other interesting measurements a single example with 5 transfers:
>> Interrupt to __spi_async:     19us
>> __spi_async sanity start/end:  2us
>> __SPI_ASYNC to DMA_PREPARE:   99us
>> dma_prepare start/end:        40us
>> dma_prepare_end to CS DOWN:    4us
>> CS DOWN to CS UP:             16us (real transfer)
> 
> This is making me question the use of DMA at all here, this looks like
> the situation of a lot of drivers where they switch to PIO mode for
> small transfers since the cost of managing DMA is too great.  I'm also
> curious which parts of the DMA preparation are expensive - is it the
> building of the datastructure for DMA or is it dealing with the
> coherency issues for the DMA controller?  The dmaengine API currently
> needs transfers rebuilding each time I believe...
> 
> Also how does this scale for larger messages?
> 
> I appreciate that you want to push the entire queue down into hardware,
> I'm partly thinking of the costs for drivers that don't go and do any of
> the precooking here.

Well - if you look at the above example:
it takes 99us to get from SPI_ASYNC (post-message check) to dma_prepare
inside transfer_one_message. And this is the time spent in the framework
AND scheduler!

The DMA prepare takes 40us you still can run the DMA Prepare in the same 
time and be faster... 8OK, I did not account for the "teardown" time, but
that should be faster, as it just walks the list and returns it to the
dmapool.

Also keep in mind that this message is (as said) actually comprised of 5
spi_transfers in a chain with two of those having set CHANGE_CS, so it is
already a "bigger" spi_message than you would typically see.

A write_then_read (2 transfers) has a "setup time of about 23us. 
But as this transfer happens during "setup" the dmapool may not have any 
pages allocated yet, which increases the overhead for allocation and thus
adds a bias to this measurement.

You also see that this driver is already trying to keep the latencies as
short as possible by using the chaining instead of issuing each of those
CS-enable sequence separately.

So in the end we can give a rough estimate that it takes 10us/transfer to
process - so that means we are getting the same thru-put delay for DMA 
(prepared) versus interrupt-driven (Polling being the worsted) for 
spi_messages which are <=10 transfers. Which is already a pretty big spi
message...

But then you should not forget that whith the Interrupt driven approach, 
you have delays between each transfer due to interrupts, when you 
reconfigure the data registers. And I am 100% certain that you will not
be able to achieve a 1us delay between 2 transfers (of arbitrary size)
in interrupt driven mode. This is again CPU cycles (which I am not even
sure are correctly accounted for in sys_cpu). And then on top of that -
at least the spi-bcm2835 interrupt makes a wakeup call to the workqueue
whenever it finishes a single transfer (and again the scheduler gets 
involved...). That is also why I have reported a high Interrupts and 
context-switch rate for this driver.

So you think this is really cheaper from the CPU perspective to take the
Interrupt-driven approach?

(note also that I have set the spi_queue to run with RT-priority, so the
spi_pump has a huge advantage getting CPU...

> 
> Exactly, but this is largely orthogonal to having the client drivers
> precook the messages.  You can't just discard the thread since some
> things like clock reprogramming can be sleeping but we should be using
> it less (and some of the use that is needed can be run in parallel with
> the transfers of other messages if we build up a queue).
Ok - the clock setting is possibly valid for some SPI devices, but not for
all. And we all know that a "one-size fits all" does not scale in 
performance.

Also it may be a possibility that different devices have different 
feature-sets, where in part we need to drop back to something else (thread).

At some point I had been playing with the idea of doing just that - having
a spi_pump to handle things like delay and to work around stupid "bugs"...
but still trying to do as much as possible within the DMA I found something
that solves everything that we have as abilities in the API right now.

But I have to admit, that it required a lot of effort coming up with 
something that works (and it also produces a lot more code) and also
it uncovers a lot more HW issues than expected (I found 4 so far)...

> The 40us is definitely somewhat interesting though I'd be interested to
> know how that compares with PIO too.


Some basic facts ONLY about the SPI transfers themselves:
CS down to last CS up for a message of 5 
71us transfer of the 5 transfer message with PIPO 
16us for DMA.
So that is ONLY measured on the SPI Bus!
The time lost here is between spi_transfers, which is typically between
8-12us.
But also a lot of time is lot of time is lost between the last clock and 
CS up for PIPO: 19us

So the way that the driver is written, its interrupt gets called when it 
is finished with the transfer (or if the FIFO buffer needs to get filled)
and in case of no more data, it will wake up the message pump.
Which is in this case surprisingly fast which schedules the message.

You see where PIPO looses its time?

That is also the reason why i want to move back to the "transmit" interface
and see how much this improves the driver.

Ciao,
	Martin




--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                                   ` <ED58E869-A9F6-4BB2-8EC6-D71F946509DC-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-13 19:33                                                                                                                                     ` Mark Brown
       [not found]                                                                                                                                       ` <20131113193320.GE878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-13 19:33 UTC (permalink / raw)
  To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 2285 bytes --]

On Wed, Nov 13, 2013 at 07:35:27PM +0100, Martin Sperl wrote:

> You also see that this driver is already trying to keep the latencies as
> short as possible by using the chaining instead of issuing each of those
> CS-enable sequence separately.

It's going to be more sensible to work to eliminate the extra overhead
for doing multiple messages than to faff around in individual client
drivers, solve the problem at the right level - this will play more
nicely when two devices are hitting the same bus as well.

> So you think this is really cheaper from the CPU perspective to take the
> Interrupt-driven approach?

Well, if we ignore the scheduling (which we ought to be able to get rid
of anyway) then the DMA is starting to look like overhead for what
appears to be a small transfer.

> > Exactly, but this is largely orthogonal to having the client drivers
> > precook the messages.  You can't just discard the thread since some
> > things like clock reprogramming can be sleeping but we should be using
> > it less (and some of the use that is needed can be run in parallel with
> > the transfers of other messages if we build up a queue).

> Ok - the clock setting is possibly valid for some SPI devices, but not for
> all. And we all know that a "one-size fits all" does not scale in 
> performance.

> Also it may be a possibility that different devices have different 
> feature-sets, where in part we need to drop back to something else (thread).

Of course, we will need to allow the driver to control when it punts
back to the thread.  What we don't want to have to do is reimplement the
mechanics of all this in every single driver since that's the situation
we have now for everything that isn't bitbanging.  Broadly speaking most
of the hardware we have now falls into a few classes and most of it
isn't being driven anywhere near as effectively as it could be.

> You see where PIPO looses its time?

Not really, I mean the data transfers wind up being slower because
they're bigger than they looked from the timings (you haven't mentioned
how big they are...) but I don't really see why PIO (assuming that's
what you mean by "PIPO" which is an acronym I don't recall hearing
before) impacts the time taken to raise /CS.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                                       ` <20131113193320.GE878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-13 21:31                                                                                                                                         ` Martin Sperl
  0 siblings, 0 replies; 57+ messages in thread
From: Martin Sperl @ 2013-11-13 21:31 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

>> 
>> You see where PIPO looses its time?
> 
> Not really, I mean the data transfers wind up being slower because
> they're bigger than they looked from the timings (you haven't mentioned
> how big they are...) but I don't really see why PIO (assuming that's
> what you mean by "PIPO" which is an acronym I don't recall hearing
> before) impacts the time taken to raise /CS.

The Transfers are: 1+1CS+2+3CS+3 (numbers = bytes, CS=CS_CHANGE)

But what you ignore is the fact that under load
the spi-bcm-2835 is running at 90+% System CPU.
while the DMA driver runs at 80-85% System load.
(Not to mention number of interrupts, context switches,...)

And DMA is not fully optimized yet - so do not compare it to something
that was used as a first basis to get the issues of DMA solved before 
optimizing the full way.

You are essentially comparing a "production" code (in kernel)
to a R&D code (the dma version).

Take the "case" of DMA as a prototype to see how it does with the 
one-message interface.
And then let us see how pipelined DMA works by comparison...

And then add the "prepare" message into the mix as further
possible optimizations to see how far this brings us away from
the original driver with its high system_CPU, interrupts and context
switch count.

Ciao,
	Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                       ` <52823E73.503-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  2013-11-12 17:59                                                                                                                         ` Martin Sperl
  2013-11-13 15:11                                                                                                                         ` Mark Brown
@ 2013-11-14  1:50                                                                                                                         ` Mark Brown
       [not found]                                                                                                                           ` <20131114015009.GB26614-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  2 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-14  1:50 UTC (permalink / raw)
  To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 10220 bytes --]

On Tue, Nov 12, 2013 at 03:42:59PM +0100, Martin Sperl wrote:

> This "chaining" is _obviously_ not possible with the transfer_one
> message BUS-master-API (or it would require some "ugly hacks" to
> make use of transfer_one_message API)

Meh, it doesn't seem that urgently different to be honest.  The basic
flow of interactions with the device is very much typical of SPI
controllers.

In any case while it's not entirely clearly explained here I *think* the
major thing that's being done differently here is that the register
writes to the controller are being done using DMA.  Is that the case?

That's certainly a bit of an innovation and I don't think it's actually
all that device specific - while reusing the same channel for things
other than the actual data transfer is a bit cute I can't see any
obvious reason why it shouldn't work and it's something that I'd expect
a great proportion of DMA controllers would be able to support.  The
delays are a bit dodgy for that but for untimed delays could probably be
implemented with dummy memcpy() type transfers easily enough.

I'd expect this to integrate nicely with the general stuff to split out
the pushing of transfers and configuration of transfers that I'd been
kicking around - transfer_one() is intended to be split into two for
this, one bit to configure the transfer and another bit to actually do
it.  The original idea was more around recognising and exploiting
scatter/gather patterns but pipelining the DMAs for configuration would
allow that to be extended a bit.

One thing that does concern me is the overhead of all the DMA work for
smaller transfers - once there's a queue built up it's more clearly
useful but without a queue it's not obvious that the effort is going to
be worth it.

> Well - as said I am wondering if the DMA engine can really can help
> - especially if you want to keep the "prepare" code fast. I am using
> dmapool for memory alloc/free, but I fear we would need to run 2 mallocs
> if I was using DMA-engine and more setup/tear-down overhead, as well
> as more code that needs to get executed to go via DMA-engine.

The big win from dmaengine is that it means that you don't need to
implement essentially the same DMA code in each and every driver, all
the DMA drivers end up with a uniform API.  Since by and by large the
process of mapping a SPI message for DMA is the same no matter what the
controller is this ends up being a win and we don't require each and
every master driver author to jump through the same hoops.

If it doesn't do what's desired we should fix it to make it better (same
as here).

> As said: I will keep it as is for now and first implement the full DMA
> chaining approach. then we can see how much the move to DMA-Engine
> is costing...

That's fine for experimentation but for an actual merged API I think it
does need to be generally applicable so that it's worth the effort for
client drivers to support it.

> not the prime "candidates" for the improvement in the first place.
> If they want to run with minimal latencies, they HAVE to run spi_async.
> Anything else is not very efficient as it requires several context
> switches by the way of the current design.

Well, the benefits of submitting things asynchronously are as much about
giving the stack something to play with and reducing turnaround time as
anything else - it means the stack can submit the next transfer to the
hardware before it tells the caller that the current one completed which
reduces the idle time on the bus and means that there's much more scope
for reducing needless context switching.  Right now we're not doing as
good a job of that as we can in most drivers though frankly even the
current state is an improvement on what we had.

> >The other bit of easy to implement that concerns me is making sure that
> >it's easy to implement for the stack as a whole - one of the big

> If you can get away from the idea that prepare is ONLY related to
> the Master, then it may provide additional "advantages". Also when
> you do have a non-prepared message, you can still can call the
> prepare code from the framework and afterwards unprepare it - like I
> had proposed.

> That way any "checks" the driver would need to do to see if it can
> handle the message, it can do immediately in the "prepare" code.

I'm having a really hard time getting excited about the improvements
from saving the checks though, they just seem so small for the effort by
themselves.  You are right, they do exist, but they seem so small in
relation to the overall transfer time.  That said if a patch were to
appear which did that and didn't provide any master side hooks I think
that'd probably be OK.

> whatever the driver may find helpful for its own purpose. One
> candidate here could be also implement dma_map_single in the prepare
> call...

Of course this reduces the range of situations where the calling driver
can reuse a prepared message since it means if they want to access the
data they have to unprepare and then reprepare the message which rather
seems to defeat the point of having them worry about doing that.  This
is why I was asking about nailing down the sematics, I was worried about
this.

> So separating things out immediately reducing code of the individual
> drivers if they move to the newer interface of schedule_message...
> if they do not need prepare/unprepare, then it is (almost) a NOP.

Except for all the actually configuring the hardware stuff on
configuration changes if the register writes aren't DMAed (which is
going to be the common case unless it's frameworked).

> >Adding new interfaces is a change too, on the driver side as well.

> Which is optional and would get first used only by device-drivers
> that need high tru-put. And those device-drivers that need this,
> should already be heavily optimized to avoid memory allocations
> (even if it is just on the heap) and message setup for every call.

Equally well is it really worth their while if you don't get much win
without a specially optimised controller driver?  Extra advantages with
a good driver are one thing but requiring it makes it a harder sell to
drivers using the API.

> >These seem rather artificial and I'm not sure they'd tell us much about
> >why the effects were being observed.  What would be more interesting
> >would be seeing where the time is being spent at the minute for real
> >loads (for example by using trace points like those that are currently
> >being added in mainline for examining the behaviour of the message pump)

> well - my use case is exactly that CAN issue where I have message loss
> when only sending 5 packets in sequence in 30% of all cases with
> stock drivers. That is why I started investigating the issue in the
> first place - so do not tell me it is academic!

Sure, there's room for improvement but let's make that as general an
improvement as possible.  Something that flows better for existing
drivers is much more of a win there.  I'm getting alarm bells that the
gains for this one system might be hard to realise on other systems.

> The proposed SPI-test driver is academic to some extent, but it
> still tries to simulate "real" device traffic with optimal thru-put.
> Or do you think that reading an ADC as fast as possible is an
> academic exercise? (the test driver would essentially mimic that and
> allow for
> scheduling mechanism to work that allow high thruput (if you
> configure multiple messages)

I'm suspicous of benchmarks which limit so that you end up running so
fast you don't have time to actually do anything with the data, you can
end up tuning things so that more realistic users suffer (spending too
much time in interrupt context for example).

> >- if we really are spending a noticeable proportion of the time in setup
> >then that's obviously a good case for optimising it but right now I
> >don't feel I understand which bits are slow enough to register in the
> >context of the overall transaction and hence if the suggested approach
> >is the best way to optimise them.

> well - you see the setup I have to make to transfer 2 writes 10 reads?
> that is 10+ calls to dmapool_alloc, then filling in the structure.
> That takes some time (cant remember how many usec right now).
> And then tearing it down I need to do the same again - ok mostly
> dmapool_release...

Yes, but then this is because the driver is doing lots of really small
DMAs when most only DMA for larger transfers and then only the data.  I
can see can be a benefit if you have a big enough DMA queue to be busy
for extended periods but that's not always going to be the case and then
you get into wondering about how much you can keep mapped and so on.

Like I say I can see this being a technique which is useful over a range
of drivers but there's going to be a point where it's simpler to just do
the I/O.

> >An implementation that avoided the need to have the master drivers
> >updated in order to get benefits would be a lot more interesting since
> >that would make this much more generally useful, and of course concrete
> >patches would be good too, but the main thing I'm missing is clarity on
> >how the gains are being made.

> a master driver will never be able to profit from the new interface,
> besides the "small" gain of not having to validate the messages
> every time in __spi_async (if the device-driver does call prepare)

What about all the DMA pipelining stuff for the register I/O and so on?
That seems to be the biggest part of the win here.

> And as said: all the spi_sync methods are candidates to optimization
> of this - at the cost of some memory used for caching multiple
> messages.
> (say having 16 times 1xspi_message+1xspi_transfer structures for
> transfer length 1 to 16 in spi_write. Similar for spi_read. Or 256
> 1xspi_message+2xspi_transfer to transfer 1-16 writes and 1-16 reads.
> plus some memory copy for the reads and writes)

But then that's a memory plus cache management cost (per master), not to
mention complexity.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                           ` <20131114015009.GB26614-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-14 19:47                                                                                                                             ` Martin Sperl
       [not found]                                                                                                                               ` <9640F4C7-7F82-453E-9D83-5875A1559A20-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-14 19:47 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

Hi Mark!
> Meh, it doesn't seem that urgently different to be honest.  The basic
> flow of interactions with the device is very much typical of SPI
> controllers.
> 

Well - if it was not for the requirement to call 
spi_finalize_current_message prior to leaving transfer_one message to get
the next one - I might have continued this route with pipelining.
But hacking my way around those limitations (by taking the next item of the 
list, etc.) seemed a bit absurd to be honest.

> In any case while it's not entirely clearly explained here I *think* the
> major thing that's being done differently here is that the register
> writes to the controller are being done using DMA.  Is that the case?

yes, that is true

> That's certainly a bit of an innovation and I don't think it's actually
> all that device specific - while reusing the same channel for things
> other than the actual data transfer is a bit cute I can't see any
> obvious reason why it shouldn't work and it's something that I'd expect
> a great proportion of DMA controllers would be able to support.  The
> delays are a bit dodgy for that but for untimed delays could probably be
> implemented with dummy memcpy() type transfers easily enough.
> 
> I'd expect this to integrate nicely with the general stuff to split out
> the pushing of transfers and configuration of transfers that I'd been
> kicking around - transfer_one() is intended to be split into two for
> this, one bit to configure the transfer and another bit to actually do
> it.  The original idea was more around recognising and exploiting
> scatter/gather patterns but pipelining the DMAs for configuration would
> allow that to be extended a bit.
> 
> One thing that does concern me is the overhead of all the DMA work for
> smaller transfers - once there's a queue built up it's more clearly
> useful but without a queue it's not obvious that the effort is going to
> be worth it.

Well - if you think this "approach" is also applicable to other devices,
then this is one more argument to allow device drivers to prepare its 
messages (with the is_dma_mapped requirement to keep cost low). 
Note that if you do that and want to leverage the "prepare_message",
then I would recommend you add an additional flag field to these calls.
Because then you may pass some hints to the engine (like if it can
use GFP_KERNEL or GFP_ATOMIC during allocations). And prepare 
could get done with GFP_KERNEL, while a "prepare" during spi_async
would need to be done via GFP_ATOMIC.

>> Well - as said I am wondering if the DMA engine can really can help
>> - especially if you want to keep the "prepare" code fast. I am using
>> dmapool for memory alloc/free, but I fear we would need to run 2 mallocs
>> if I was using DMA-engine and more setup/tear-down overhead, as well
>> as more code that needs to get executed to go via DMA-engine.
> 
> The big win from dmaengine is that it means that you don't need to
> implement essentially the same DMA code in each and every driver, all
> the DMA drivers end up with a uniform API.  Since by and by large the
> process of mapping a SPI message for DMA is the same no matter what the
> controller is this ends up being a win and we don't require each and
> every master driver author to jump through the same hoops.
> If it doesn't do what's desired we should fix it to make it better (same
> as here).
Well the sequencing is the trick for most parts, but I also keep extra 
data adjacent to my real control-blocks to avoid doing double allocations
and keeping data close to each other. For my "usage" I even have a method
where I can put 8 bytes for configs in this structure as well - which
allows me to avoid another set of memory allocations for just that...

In principe it should be possible to generalize that.

My basic "internal" prepare function looks like this:
static struct bcm2835dma_dma_cb *bcm2835dma_add_cb(
	struct bcm2835dma_spi *bs, /* contains dma pool parameters and 
					dma channels */
	struct bcm2835dma_spi_dma *dma, /* a descriptor to the DMA 
					channel */
	struct list_head *list, /* list where to chain the cbs to */
	struct spi_message * msg, /* the message we handle */
	unsigned long info, /* DMA flags for the engine */
	dma_addr_t src, /* the source DMA address, if -1 take from struct*/
	dma_addr_t dst, /* the dest. DMA address, if -1 take from struct*/
	unsigned long length, /* length of dma */
	unsigned long stride, /* stride - not really used as it is buggy*/
	u8 link_to_last /* if we should link this cb to the last on the
				same dma channel*/
);

This will allocate from pool, then fill in the structure
linking it to the last in the chain of the same channel
and return it.

So this might be pretty generic already - and putting dmaengine in there
might be a possibility...
> That's fine for experimentation but for an actual merged API I think it
> does need to be generally applicable so that it's worth the effort for
> client drivers to support it.
First of all it is a bus-driver that should be working.
(also to avoid the expected "short transfer issue" for some use-cases I 
would propose to run it in parallel to the existing version)

Then I believe we can start discussing any approach to refactor things
after we get that far and we see the performance improvements and 
drawbacks for each of those steps.


>> not the prime "candidates" for the improvement in the first place.
>> If they want to run with minimal latencies, they HAVE to run spi_async.
>> Anything else is not very efficient as it requires several context
>> switches by the way of the current design.
> 
> Well, the benefits of submitting things asynchronously are as much about
> giving the stack something to play with and reducing turnaround time as
> anything else - it means the stack can submit the next transfer to the
> hardware before it tells the caller that the current one completed which
> reduces the idle time on the bus and means that there's much more scope
> for reducing needless context switching.  Right now we're not doing as
> good a job of that as we can in most drivers though frankly even the
> current state is an improvement on what we had.

Well, actually if you do it right - like I did in my version of the
mcp2515 driver, I am actually scheduling 2 spi_messages back to back.
and only the first has a callback. 

The first reads the status registers on which to base decisions what
to do and also pulls the interrupt high.

While the second message does a speculative transfer of the 
first inbound message buffer and reads some error counters.

So ideally while the callback of the 1st message is running the 2nd
message is running in HW.

And this callback will schedule acknowledgement of the read rx buffer
as well as schedule the read of the second hw buffer if needed and 
finally reenabling interrupts.
This 3rd message also contains a callback to push the previously read 
HW buffers to the network stack and create error frames if necessary...

So it is in itself geared towards minimizing interrupt "latencies".
If everything works as fast as possible (interrupts,...) the whole
set of messages would look like a giant chain without gaps on the 
SPI bus totally reducing the time from start to finish.

And if everything is done in interrupt context, then one such 
interrupt from the device would trigger 2 additional interrupts 
(one to trigger the 1st and the other to trigger the 3rd message)
and be done with it in a total of 3 interrupts and 0 context switches.

And if you have all the messages "prepared" then the async_spi call
is a cheap thing as well:
*  message->spi=spi;
*  schedule message in dma by just chaining it to a currently running 
     chain.

If I would estimate such an ideal situation, then the whole transfer
including callbacks (maybe except for the last) would be handled in:
* 18us from interrupt to first spi_async end scheduling DMA
* 17us for the first message (1W1R+2W3R+3W)
* 29us for the second message (2W2R+2W/13R+1W1R)
* 25us for the third message (1W+1W/13R+3W)

in total that should be finished in 89us.

Again - the final callback is not included - SPI Bus-speed: 8MHz,
these timings have been taken from actual samples measured and just 
added up.

Also note that this example assumes both Message buffer being full
otherwise you can remove the 1W/13R from the 3rd message and thus 
reduce the time there by 17us.

By comparison the shortest possible CAN message (@500KHz, standard ID
and 0 data bytes) is between 94us and 111us (depending on the ID bit-
pattern)

This is obviously still theoretical and not proven.
But to get to all those gains to work the bus driver has to work DMA
streaming, the API has to implement prepare_message and the 
device-driver needs to be also very "efficient" in its scheduling.

And that is the tricky part - the device driver needs to be designed
to have good scheduling to reduce its latencies.

So if you get that far into the game of optimizing your driver, then
* you will prepare your messages in the init code
* dma mapping the transfers becomes easy, because you can easily 
  allocate the data you need for your transfers from DMA
* and then adding the optional "call" to spi_prepare_message is the
  simplest exercise.

For any driver that is _not_ optimized for latencies and where one 
needs low latency then a rewrite is required.

Also just put the prepare_message api into perspective with other
apis that are rarely used:
extern int spi_bus_lock(struct spi_master *master);
extern int spi_bus_unlock(struct spi_master *master);

There we have 2 users of that interface in the kernel:
drivers/iio/adc/ad_sigma_delta.c
drivers/mmc/host/mmc_spi.c

So we have interfaces for rare uses - why not have another one
that is equally light-weight on the hot path of spi_async?
And which gives a slight benefit to all drivers by avoiding the 
checks every time...
> I'm having a really hard time getting excited about the improvements
> from saving the checks though, they just seem so small for the effort by
> themselves.  You are right, they do exist, but they seem so small in
> relation to the overall transfer time.  That said if a patch were to
> appear which did that and didn't provide any master side hooks I think
> that'd probably be OK.
> 
...
> 
> Sure, there's room for improvement but let's make that as general an
> improvement as possible.  Something that flows better for existing
> drivers is much more of a win there.  I'm getting alarm bells that the
> gains for this one system might be hard to realise on other systems.
> 

Unfortunately 1+us is not that small for a low-end device - it may be small
for a multi-core system, but then you just hide it behind more CPU...
And as said - if you run 10000 spi messages/s that is >10ms of time spent
just doing the API and that is >1% system load on a single core.

I believe I should finish my POC to get fully up to my expectations
and then I can show you the advantage of having this "minimal" change,
which has also an overall positive impact (even though slight).


>> The proposed SPI-test driver is academic to some extent, but it
>> still tries to simulate "real" device traffic with optimal thru-put.
>> Or do you think that reading an ADC as fast as possible is an
>> academic exercise? (the test driver would essentially mimic that and
>> allow for
>> scheduling mechanism to work that allow high thruput (if you
>> configure multiple messages)
> 
> I'm suspicous of benchmarks which limit so that you end up running so
> fast you don't have time to actually do anything with the data, you can
> end up tuning things so that more realistic users suffer (spending too
> much time in interrupt context for example).
Well - maybe we should make this a separate thread...

On my proposed list would be something:
* like the generic version I had talked about
* the mcp2515 CAN-driver that I have described above
* the enc28j60 network driver
* the spi-mmc sd card driver

For all of them we can provide well-defined performance pattern
(READ/Write/s or Packets/s) plus some kind of ADC testing


>> And as said: all the spi_sync methods are candidates to optimization
>> of this - at the cost of some memory used for caching multiple
>> messages.
>> (say having 16 times 1xspi_message+1xspi_transfer structures for
>> transfer length 1 to 16 in spi_write. Similar for spi_read. Or 256
>> 1xspi_message+2xspi_transfer to transfer 1-16 writes and 1-16 reads.
>> plus some memory copy for the reads and writes)
> 
> But then that's a memory plus cache management cost (per master), not to
> mention complexity.

Well, that is something we would need to look into - but it should not be
_that_ complicated and the SPI_master could allocate those via a framework
call or a bit in the spi_master structure - say, if it supports message,
then set it up during registering the device and remove it afterwards...
That is very generic and not a lot more overhead that spi_sync_write/read
does it right now.

But let us first show how far we can go with this first bus-driver that 
does everything in dma and then we can compare results with and without
the "optimize"...

Ciao,
		Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                               ` <9640F4C7-7F82-453E-9D83-5875A1559A20-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-15 11:15                                                                                                                                 ` Martin Sperl
       [not found]                                                                                                                                   ` <5286026B.2090903-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  2013-11-15 13:33                                                                                                                                 ` Mark Brown
  1 sibling, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-15 11:15 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

Hi Mark!

I thought a bit about what you had said about DMA and building-blocks,
and I have come up with a possible vision of the "future".

You essentially gave me the idea that we could reduce restrictions
on the proposed spi_prepare_message by adding some flags of in what way
the structure may change. So if you got a "fully" fixed structure,
then we could add "flags" here to indicate if the structure also is
fixed with regards to:
* Length of Transfers
* Change in Pointer to Transfers
* providing DMA data or not (this could get indicated via is_dma_mapped)

Then the bus-driver (or framework) could start to use different
construction schemes to build the DMA chain based on different flag
variations and thus optimizing the time it takes to build the DMA (which
is currently the worsted case in my driver).

This would definitely "improve" the performance of those sync calls
(if we make a prepare in the sync call somehow for the hot-path)

Also the driver can start to use (pre-built/cached) "building-blocks"
that make up a lot of the DMA control blocks that are needed.
(configuring the SPI controller and such)

For more complex drivers we may also defer the DMA-preparation (and
scheduling thereof) of messages to a worker thread which would produce
DMA chains to schedule.

But if the message is prepared it would get scheduled from spi_async
context immediately, unless there is something in the prepare queue,
which then would have to wait so that no reordering occurs.

This comes with a potential "trap": We would need to define if a
"reorder" of SPI messages between two different SPI Bus devices with
different CS is a valid operation or not.
(the question is mostly: would this "queuing" happen on a per bus-driver
basis or on a per CS basis)

The timing sequence would be like this (for scheduling spi_async
messages at the same time):
driver A:
  spi_async(spi0,msg0_unprepared); /* getting put into the "queue" */
driver B:
  spi_async(spi1,msg1_prepared); /* getting scheduled immediately */

On the SPI Bus you would see the following:
  msg1_prepared;
  msg0_unprepared;

Here we have the requirements of a "definition" of scheduling policies.

As far as I can tell (and I may be wrong) the issue described would only
impact if a single device-driver would require two or more different CS
to make a device work properly.

So I would say that "reordering" is essentially allowed if we talk about
two different drivers talking to different devices (or multiple
instances of a single driver talking to distinct devices of the same type).

Note that this could show some strange "concurrency behavior" in some
drivers due to implicit assumptions which are not defined until now and
which would never show up on the current setup.

That is unless you find a board with 2 SPI buses and you connect an
identical device on each SPI Bus. But if one looks at this from this
direction, then such an assumption in the driver would be a bug in the
device-driver that would need to get fixed in the driver itself.

But if a driver exists for a device that requires two (or more) Chip
selects with the additional "requirement" of "proper" ordering, then:

Either the driver needs to make both messages "prepared".
We may need to provide an extra flag to the "prepare" to force "global
ordering" on the SPI-bus.
or we make it a requirement that - in such cases - the spi_lock_bus
interface is used to block the whole bus.

But the first question is: are there any drivers currently that require
multiple CS to do its work?

All of these ideas will also help us in the future and may then may get
used for other bus-drivers as well - we then could even start using a
"DMA-sequencing" engine inside the framework to reduce the code on the
driver (how to define those would be tricky though - we would need to
see how "flexible" other devices are compared to bcm2835).

But first we need to get one working driver with "optimal" performance
that we may confirm it is working in principle and then continue to see
where optimizations can get made and finally refactor this into a
"generic" solution sharing code in the SPI framework and it may
eventually move into dmaengine...

Another thing that came to my mind is a means to give spi_async a hint
if it runs in an interrupt context or not.

This would give the opportunity to allow memory allocations in spi_async
and transfer using GFP_KERNEL instead of GFP_ATOMIC for cases that run
in non_interrupt cases.

We could make this a flag in the spi_message structure or use different
functions: spi_async_threaded and spi_async_interrupt while having the
spi_async map to spi_async_interrupt for compatibility reasons with
older drivers.

Both are valid approaches - I believe it is a matter of taste which one
gets used...

Another thing that you may have mentioned could be a message that would
be like "read(8|16)bitReadXbyte". You first read (1/2) bytes and then
read X bytes that come from the first one or two bytes. The problem here
may be that such a "scheme" would not (not necessarily) work with a
fully DMA pipelined approach - especially endian-nes might be different
(possibly could work around this with 1 byte DMA transfers, if the DMA
engine supports it). But for some HW-devices length might be mixed with
other control-bits, which would require some "and+shift" logic, which
would become impossible to implement in DMA for most DMA-engines (the
mcp2515 would be such a case). The question here is really: how many
such devices do exists that provide such, that would make it worth doing
that in a generic way?

Finally I have a proposal to avoid the "naming" conflict as we have seen
in our discussions so far:

why not call the new interface:
int spi_(pre)build_message(struct spi_device*,struct spi_message,
	unsigned long flags);
int spi_unbuild_message(struct spi_device*,struct spi_message);

with the corresponding equivalent methods in spi_master?

Other naming ideas are welcome (I just would not say DMA, as it might
get used for other things as well)

So I think all the above ideas give us a lot to ponder how/where we may
want to move in the future...

In the meantime I will try to move to the SPI bus-driver to the transfer
interface and I will probably add a module parameter to allow the
selection of which interface is used.

This way we have then many more scenarios to compare to see how they
fare with different situations/use-cases/...

But in the end I would guess that going with the "direct" DMA chaining
approach would give the best results...

Martin

P.s: maybe we should start renaming the subject of this thread?
--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                               ` <9640F4C7-7F82-453E-9D83-5875A1559A20-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  2013-11-15 11:15                                                                                                                                 ` Martin Sperl
@ 2013-11-15 13:33                                                                                                                                 ` Mark Brown
       [not found]                                                                                                                                   ` <20131115133312.GE26614-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  1 sibling, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-15 13:33 UTC (permalink / raw)
  To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 4102 bytes --]

On Thu, Nov 14, 2013 at 08:47:35PM +0100, Martin Sperl wrote:
> > Meh, it doesn't seem that urgently different to be honest.  The basic
> > flow of interactions with the device is very much typical of SPI
> > controllers.

> Well - if it was not for the requirement to call 
> spi_finalize_current_message prior to leaving transfer_one message to get
> the next one - I might have continued this route with pipelining.
> But hacking my way around those limitations (by taking the next item of the 
> list, etc.) seemed a bit absurd to be honest.

The thing to do there is extend the interface, you're treating things
like you're working on a proprietary OS where the APIs are fixed.

> Well - if you think this "approach" is also applicable to other devices,
> then this is one more argument to allow device drivers to prepare its 
> messages (with the is_dma_mapped requirement to keep cost low). 

No, this isn't something which should be being open coded in individual
drivers - it's far too much redundant work especially once you get into
doing the tradeoffs for no queue vs queue cases.  Using regmap MMIO may
be helpful but we need to do that in a way that allows us to integrate
it into the DMA pipeline...

This is part of what I keep saying about doing things in a way that
makes them generally applicable.

> Note that if you do that and want to leverage the "prepare_message",
> then I would recommend you add an additional flag field to these calls.
> Because then you may pass some hints to the engine (like if it can
> use GFP_KERNEL or GFP_ATOMIC during allocations). And prepare 
> could get done with GFP_KERNEL, while a "prepare" during spi_async
> would need to be done via GFP_ATOMIC.

Or in the async case we just punt the prepare to the thread.

> > That's fine for experimentation but for an actual merged API I think it
> > does need to be generally applicable so that it's worth the effort for
> > client drivers to support it.

> First of all it is a bus-driver that should be working.
> (also to avoid the expected "short transfer issue" for some use-cases I 
> would propose to run it in parallel to the existing version)

> Then I believe we can start discussing any approach to refactor things
> after we get that far and we see the performance improvements and 
> drawbacks for each of those steps.

It'll definitely need some experimentation and tuning.

> Well, actually if you do it right - like I did in my version of the
> mcp2515 driver, I am actually scheduling 2 spi_messages back to back.
> and only the first has a callback. 

> The first reads the status registers on which to base decisions what
> to do and also pulls the interrupt high.

> While the second message does a speculative transfer of the 
> first inbound message buffer and reads some error counters.

> So ideally while the callback of the 1st message is running the 2nd
> message is running in HW.

Yes, this is exactly the sort of thing I'm talking about - it's cutting
out dead time on the bus.

> For any driver that is _not_ optimized for latencies and where one 
> needs low latency then a rewrite is required.

Right, and the big concern with the way you're currently proposing to
implement the optimisation is that it's leaving essentially the whole
thing up to the individual driver.  This is going to result in limited
adoption and is going to mean there's no sharing of learning and fixes
between drivers.

> I believe I should finish my POC to get fully up to my expectations
> and then I can show you the advantage of having this "minimal" change,
> which has also an overall positive impact (even though slight).

To repeat yet again, something that is just a tunnel through to this one
heavily optimised master driver isn't good.  The problem with the
proposed implementation is that it is essentially just tunnelling
through to a heavily optimised master, it's not generic enough.  If
other devices can make use of it it's more interesting but then those
uses ought to be being factored out.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                                   ` <20131115133312.GE26614-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-15 14:52                                                                                                                                     ` Martin Sperl
       [not found]                                                                                                                                       ` <0BA2243C-2F22-492A-B517-76E243535549-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-15 14:52 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA


On 15.11.2013, at 14:33, Mark Brown wrote:
> No, this isn't something which should be being open coded in individual
> drivers - it's far too much redundant work especially once you get into
> doing the tradeoffs for no queue vs queue cases.  Using regmap MMIO may
> be helpful but we need to do that in a way that allows us to integrate
> it into the DMA pipeline...
> 
> This is part of what I keep saying about doing things in a way that
> makes them generally applicable.

Well - we first need to see what is the real "common denominator" of
boards, that can do DMA to configure SPI itself. So we would need at
least 2 or 3 to figure that out (and also all the bugs we may expect to 
see and find work-arrounds for those)

but in principle it should be possible to factor processing out like
this:
* main transfer loop
* while (!transfer_last) {
*   prepareSPIRegisters()
*   prepareAdditionalTXDMA
*   do {
*     append RX/TX transfers
*   } while(do_inner) 
*   append wait_for_clock
*   append delay_us
*   if cs_change
*     append CS_UP
*     append wait_for_clock
* }
* if message->callback
*   append DMAInterrupt_and_position

I believe that might be a generic enough approach that it should work
if one would factor out the "appends" plus the do_inner loop exit 
condition into the master structure.

But my guess is that you would not even allow to get something generic as
this merged unless there are multiple drivers that would be using that
interface.

So we would be in a chicken/egg from this direction as well...

> 
>> Note that if you do that and want to leverage the "prepare_message",
>> then I would recommend you add an additional flag field to these calls.
>> Because then you may pass some hints to the engine (like if it can
>> use GFP_KERNEL or GFP_ATOMIC during allocations). And prepare 
>> could get done with GFP_KERNEL, while a "prepare" during spi_async
>> would need to be done via GFP_ATOMIC.
> 
> Or in the async case we just punt the prepare to the thread.
again: that means an uneccessary delay
especially when you think about running the message_pump without
real-time priority introduces 10us additional delay...

> Yes, this is exactly the sort of thing I'm talking about - it's cutting
> out dead time on the bus.

> Right, and the big concern with the way you're currently proposing to
> implement the optimisation is that it's leaving essentially the whole
> thing up to the individual driver.  This is going to result in limited
> adoption and is going to mean there's no sharing of learning and fixes
> between drivers.

Well, any driver that is using spi_sync can not get improved without a 
rewrite anyway - the context switching becomes prohibitive.

So what some drivers seem to do is create a separate thread, which 
they wake up from their interrupt handler, which then runs the interrupt
handling and uses SPI_sync communication in this thread.
Which means at least 2 context switches one from SPI interrupt (finished)
to wake the message pump, which then will call complete and wake up the 
real thread.

Whatever you do on the framework side you cannot improve the performance
of such drivers without touching the driver itself and converting them
to something else.

It is a whole other story if the driver is already doing async
scheduling of its messages. Then the spi_async thing streamlines the
"next" message to some extent and does not allow the message_pump to
sleep which gives it better response-times.
But you still waste time between the interrupt handler runs and wakes
up the message_pump - especially as the callback is allowed to run from
interrupt context.

As for learning & fixing:
Any driver will go unfixed until someone finds a bug or a performance 
issue and spends time on improving it.

What about writing some "documentation" to show how a driver can get
improved to get higher tru-put? 

And then you can also make the framework do more work for the driver
and thus avoiding replication of code. But then you would again get into 
a need to create a new API for just that covering all those cases.
Maybe by using a code-generator like the samba team did for the smb 
messages.

That is a huge separate project with lots or R&D required, which is 
way outside of the scope of writing a DMA pipelined bus-driver and
proposing things that may improve the performance of optimized drivers.

> To repeat yet again, something that is just a tunnel through to this one
> heavily optimised master driver isn't good.  The problem with the
> proposed implementation is that it is essentially just tunnelling
> through to a heavily optimised master, it's not generic enough.  If
> other devices can make use of it it's more interesting but then those
> uses ought to be being factored out.
Well - as I understand this, it means that:
If there were more bus-drivers that could drive the whole SPI transfer of
a message via DMA, then it would become acceptable to get that API in 
place until then if there is only a single bus and device driver, then it
is not acceptable.

So please define what would be your "critical" mass at which you would 
accept that a API-extension is worth the effort to get integrated?
(Besides metric that shows how much it improves the situation, which also
would need to get defined - we do not want to have performance-regressions
either)

Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                                       ` <0BA2243C-2F22-492A-B517-76E243535549-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-16 12:56                                                                                                                                         ` Mark Brown
  0 siblings, 0 replies; 57+ messages in thread
From: Mark Brown @ 2013-11-16 12:56 UTC (permalink / raw)
  To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 3081 bytes --]

On Fri, Nov 15, 2013 at 03:52:42PM +0100, Martin Sperl wrote:

> Well - we first need to see what is the real "common denominator" of
> boards, that can do DMA to configure SPI itself. So we would need at
> least 2 or 3 to figure that out (and also all the bugs we may expect to 
> see and find work-arrounds for those)

There's a bunch of fairly clear patterns for drivers here - the main
constraints are going to be having to punt to task context for sleeping
operations and DMA controllers that just can't have transfers to or from
the register space mixed in with the command stream (all of which fits
into the general patterns we've got already).

> But my guess is that you would not even allow to get something generic as
> this merged unless there are multiple drivers that would be using that
> interface.

> So we would be in a chicken/egg from this direction as well...

We're already going in this direction, that's why I'm going through
splitting up the operations that actually talk to the hardware from the
bits that implement generic logic and queue driving.  The major blocker
previously was that I didn't know what you were proposing to do in your
driver.

> > Or in the async case we just punt the prepare to the thread.

> again: that means an uneccessary delay
> especially when you think about running the message_pump without
> real-time priority introduces 10us additional delay...

For the cases you're concerned about this shouldn't be an issue, the
thread should be behind the I/O or the client driver will have prepared.
I'd also expect to be able to fall back to PIO (which I suspect we'll
want to keep as an option for idle cases anyway).

> > Right, and the big concern with the way you're currently proposing to
> > implement the optimisation is that it's leaving essentially the whole
> > thing up to the individual driver.  This is going to result in limited
> > adoption and is going to mean there's no sharing of learning and fixes
> > between drivers.

> Well, any driver that is using spi_sync can not get improved without a 
> rewrite anyway - the context switching becomes prohibitive.

No, I'm talking about requiring things to be done on the master side not
the client side.  That's key for making it worthwhile for client drivers
to use things like preparing messages and makes it much easier to do the
adaptions required in the master drivers.

> Well - as I understand this, it means that:
> If there were more bus-drivers that could drive the whole SPI transfer of
> a message via DMA, then it would become acceptable to get that API in 
> place until then if there is only a single bus and device driver, then it
> is not acceptable.

No, like I said in an earlier e-mail now that I understand how it can be
generally useful rather than just being a driver specific thing I'm OK
with adding the user visible prepare API (but not the calls into the
master driver).  It's the master driver interface that still needs work
so the logic isn't open coded in individual drivers.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                                   ` <5286026B.2090903-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-16 14:23                                                                                                                                     ` Mark Brown
       [not found]                                                                                                                                       ` <20131116142356.GY15393-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-16 14:23 UTC (permalink / raw)
  To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 2235 bytes --]

On Fri, Nov 15, 2013 at 12:15:55PM +0100, Martin Sperl wrote:

> You essentially gave me the idea that we could reduce restrictions
> on the proposed spi_prepare_message by adding some flags of in what way
> the structure may change. So if you got a "fully" fixed structure,
> then we could add "flags" here to indicate if the structure also is
> fixed with regards to:

Adding flags seems like a good idea, probably easiest to start with
everything is fixed and then add the ability to relax things later.

> This comes with a potential "trap": We would need to define if a
> "reorder" of SPI messages between two different SPI Bus devices with
> different CS is a valid operation or not.
> (the question is mostly: would this "queuing" happen on a per bus-driver
> basis or on a per CS basis)

I'd expect us to continue to do things per bus, at least for the time
being.  This is definitely a stage two consideration though, getting
something working 

> But the first question is: are there any drivers currently that require
> multiple CS to do its work?

No.

> All of these ideas will also help us in the future and may then may get
> used for other bus-drivers as well - we then could even start using a
> "DMA-sequencing" engine inside the framework to reduce the code on the
> driver (how to define those would be tricky though - we would need to
> see how "flexible" other devices are compared to bcm2835).

The only thing that you're talking about that sounds unusual is the
ability to add timed delays in the hardware, otherwise this all sounds
very standard assuming DMA controllers will play nice with mixing in
writes to registers (which I expect to be common).

> We could make this a flag in the spi_message structure or use different
> functions: spi_async_threaded and spi_async_interrupt while having the
> spi_async map to spi_async_interrupt for compatibility reasons with
> older drivers.

Separate functions would be a bit more idiomatic.

> why not call the new interface:
> int spi_(pre)build_message(struct spi_device*,struct spi_message,
> 	unsigned long flags);
> int spi_unbuild_message(struct spi_device*,struct spi_message);

Seems fine.  Or init/done.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                                       ` <20131116142356.GY15393-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-19 13:11                                                                                                                                         ` Martin Sperl
       [not found]                                                                                                                                           ` <528B6370.9000903-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Martin Sperl @ 2013-11-19 13:11 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

Hi Mark!

>> why not call the new interface:
>> int spi_(pre)build_message(struct spi_device*,struct spi_message,
>> 	unsigned long flags);
>> int spi_unbuild_message(struct spi_device*,struct spi_message);
> 
> Seems fine.  Or init/done.
> 

So I will call it:
int spi_init_message(struct spi_device*,struct spi_message,
 	unsigned long flags);
int spi_done_message(struct spi_device*,struct spi_message);

(or you want to call it spi_message_init/done instead?)

I will post a patch for this soon.

Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                                           ` <528B6370.9000903-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
@ 2013-11-19 15:02                                                                                                                                             ` Mark Brown
       [not found]                                                                                                                                               ` <20131119150204.GA2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Mark Brown @ 2013-11-19 15:02 UTC (permalink / raw)
  To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 429 bytes --]

On Tue, Nov 19, 2013 at 02:11:12PM +0100, Martin Sperl wrote:

> So I will call it:
> int spi_init_message(struct spi_device*,struct spi_message,
>  	unsigned long flags);
> int spi_done_message(struct spi_device*,struct spi_message);

> (or you want to call it spi_message_init/done instead?)

The latter might be better for consistency with spi_message_add_tail()
and similar.

> I will post a patch for this soon.

Excellent.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver
       [not found]                                                                                                                                               ` <20131119150204.GA2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2013-11-19 15:13                                                                                                                                                 ` Martin Sperl
  0 siblings, 0 replies; 57+ messages in thread
From: Martin Sperl @ 2013-11-19 15:13 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA

Actually I found that there is some ambiguity with spi_message_init
- it exists and is used by every spi-device-driver...

So I have started to think of why not call it:
spi_message_(un)compile
or
spi_message_(un)optimize

To make it totally clear that this is an optimization.

So for now I have used the first.

I will give measurements for timing-"differences" when using the
stock spi-bcm2835 on my use-case with reduced traffic by moving to only
125khz CAN-bus-speed, to show there is no regression.

Martin

On 19.11.2013, at 16:02, Mark Brown wrote:

> On Tue, Nov 19, 2013 at 02:11:12PM +0100, Martin Sperl wrote:
> 
>> So I will call it:
>> int spi_init_message(struct spi_device*,struct spi_message,
>> 	unsigned long flags);
>> int spi_done_message(struct spi_device*,struct spi_message);
> 
>> (or you want to call it spi_message_init/done instead?)
> 
> The latter might be better for consistency with spi_message_add_tail()
> and similar.
> 
>> I will post a patch for this soon.
> 
> Excellent.

--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2013-11-19 15:13 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <F70C4469-6325-48D5-A7CA-52CC1FC404BD@sperl.org>
     [not found] ` <CACRpkdb6y=o4__snBs2DR1f=xW_u7KdkHg3fb7XN5e2gicBJeg@mail.gmail.com>
     [not found]   ` <CACRpkdb6y=o4__snBs2DR1f=xW_u7KdkHg3fb7XN5e2gicBJeg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-29 16:59     ` Fwd: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver Linus Walleij
     [not found]       ` <CACRpkdb7y88oq7XyVFc_0Nx4pXtaebPe7KB2yizBRJGwWLqJig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-29 17:56         ` Mark Brown
2013-10-29 21:18         ` Martin Sperl
     [not found]           ` <06C7F4D3-EC91-46CF-90BE-FC24D54F2389-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-10-29 22:53             ` Mark Brown
     [not found]               ` <20131029225353.GB11424-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-10-30 17:11                 ` Martin Sperl
     [not found]                   ` <F64AD25A-C7EC-4A0D-9169-850C12F4D8A3-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-10-30 21:51                     ` Mark Brown
2013-10-30  8:40             ` Martin Sperl
     [not found]               ` <02BFF0F6-3836-4DEC-AA53-FF100E037DE9-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-10-30 17:19                 ` Mark Brown
     [not found]                   ` <20131030171902.GL2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-10-30 18:33                     ` Martin Sperl
     [not found]                       ` <8D8B0BAD-0E00-4147-B4C8-FBB18F060C96-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-10-30 21:57                         ` Mark Brown
     [not found]                           ` <20131030215716.GV2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-10-30 22:52                             ` Martin Sperl
     [not found]                               ` <3342FD19-4438-463B-89B2-A83D3475AC22-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-10-31  0:10                                 ` Mark Brown
     [not found]                                   ` <20131031001004.GW2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-04 17:33                                     ` Martin Sperl
     [not found]                                       ` <18639D9A-630E-44F3-AA7A-ADFF5D5E8B56-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-04 18:45                                         ` Mark Brown
     [not found]                                           ` <20131104184511.GR2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-04 21:43                                             ` Martin Sperl
     [not found]                                               ` <5A55A832-5313-499C-A483-BF5A6649D69D-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-05  1:03                                                 ` Linus Walleij
2013-11-06  9:48                                                 ` Mark Brown
     [not found]                                                   ` <20131106094854.GF11602-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-06 11:28                                                     ` Martin Sperl
     [not found]                                                       ` <844EDAEA-3FDC-48D0-B59E-CECC0A83761E-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-06 11:32                                                         ` Mark Brown
     [not found]                                                           ` <20131106113219.GJ11602-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-06 12:10                                                             ` Martin Sperl
     [not found]                                                               ` <C6C68042-63A0-40FD-8363-B4553ECB4774-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-06 16:24                                                                 ` Mark Brown
     [not found]                                                                   ` <20131106162410.GB2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-06 19:54                                                                     ` Martin Sperl
     [not found]                                                                       ` <3B0EDE3F-3386-4879-8D89-2E4577860073-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-06 23:26                                                                         ` Mark Brown
     [not found]                                                                           ` <20131106232605.GC2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-07  0:43                                                                             ` Martin Sperl
     [not found]                                                                               ` <72D635F5-4229-4D78-8AA3-1392D5D80127-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-07 20:31                                                                                 ` Mark Brown
     [not found]                                                                                   ` <20131107203127.GB2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-08 14:16                                                                                     ` Martin Sperl
     [not found]                                                                                       ` <86AE15B6-05AF-4EFF-8B8F-10806A7C148B-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-08 16:19                                                                                         ` Mark Brown
     [not found]                                                                                           ` <20131108161957.GP2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-08 17:31                                                                                             ` Martin Sperl
     [not found]                                                                                               ` <5F70E708-89B9-4DCF-A31A-E688BAA0E062-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-08 18:09                                                                                                 ` Mark Brown
     [not found]                                                                                                   ` <20131108180934.GQ2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-08 19:18                                                                                                     ` Martin Sperl
     [not found]                                                                                                       ` <C375DEE6-1AEC-4AFB-A9D6-583DCB4476A3-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-09 18:30                                                                                                         ` Mark Brown
     [not found]                                                                                                           ` <20131109183056.GU2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-10 10:54                                                                                                             ` Martin Sperl
     [not found]                                                                                                               ` <6C7903B3-8563-490E-AD7D-BA5D65FFB9BC-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-12  1:19                                                                                                                 ` Mark Brown
     [not found]                                                                                                                   ` <20131112011954.GH2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-12 14:42                                                                                                                     ` Martin Sperl
     [not found]                                                                                                                       ` <52823E73.503-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-12 17:59                                                                                                                         ` Martin Sperl
     [not found]                                                                                                                           ` <2252E63E-176C-43F7-B259-D1C3A142DAFE-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-13 15:43                                                                                                                             ` Mark Brown
     [not found]                                                                                                                               ` <20131113154346.GT878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-13 18:35                                                                                                                                 ` Martin Sperl
     [not found]                                                                                                                                   ` <ED58E869-A9F6-4BB2-8EC6-D71F946509DC-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-13 19:33                                                                                                                                     ` Mark Brown
     [not found]                                                                                                                                       ` <20131113193320.GE878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-13 21:31                                                                                                                                         ` Martin Sperl
2013-11-13 15:11                                                                                                                         ` Mark Brown
     [not found]                                                                                                                           ` <20131113151102.GS878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-13 15:48                                                                                                                             ` Martin Sperl
     [not found]                                                                                                                               ` <77070979-0CE4-4C76-B12E-DA94B2577172-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-13 16:59                                                                                                                                 ` Mark Brown
2013-11-14  1:50                                                                                                                         ` Mark Brown
     [not found]                                                                                                                           ` <20131114015009.GB26614-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-14 19:47                                                                                                                             ` Martin Sperl
     [not found]                                                                                                                               ` <9640F4C7-7F82-453E-9D83-5875A1559A20-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-15 11:15                                                                                                                                 ` Martin Sperl
     [not found]                                                                                                                                   ` <5286026B.2090903-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-16 14:23                                                                                                                                     ` Mark Brown
     [not found]                                                                                                                                       ` <20131116142356.GY15393-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-19 13:11                                                                                                                                         ` Martin Sperl
     [not found]                                                                                                                                           ` <528B6370.9000903-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-19 15:02                                                                                                                                             ` Mark Brown
     [not found]                                                                                                                                               ` <20131119150204.GA2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-19 15:13                                                                                                                                                 ` Martin Sperl
2013-11-15 13:33                                                                                                                                 ` Mark Brown
     [not found]                                                                                                                                   ` <20131115133312.GE26614-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-15 14:52                                                                                                                                     ` Martin Sperl
     [not found]                                                                                                                                       ` <0BA2243C-2F22-492A-B517-76E243535549-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-16 12:56                                                                                                                                         ` Mark Brown
2013-11-10 11:05                                                                                                             ` Mark Brown
     [not found]                                                                                                               ` <20131110110524.GA878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-10 16:41                                                                                                                 ` Martin Sperl
     [not found]                                                                                                                   ` <3361A01A-C7D0-4689-AFBD-085D3E62A67C-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-11-11 11:18                                                                                                                     ` Mark Brown
     [not found]                                                                                                                       ` <20131111111842.GE2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2013-11-11 11:44                                                                                                                         ` Martin Sperl
     [not found] ` <F70C4469-6325-48D5-A7CA-52CC1FC404BD-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>
2013-10-29 19:09   ` Fwd: " Linus Walleij

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).