* Fwd: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <CACRpkdb6y=o4__snBs2DR1f=xW_u7KdkHg3fb7XN5e2gicBJeg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-10-29 16:59 ` Linus Walleij [not found] ` <CACRpkdb7y88oq7XyVFc_0Nx4pXtaebPe7KB2yizBRJGwWLqJig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Linus Walleij @ 2013-10-29 16:59 UTC (permalink / raw) To: linux-spi-u79uwXL29TY76Z2rM5mHXA, Mark Brown, Martin Sperl Let us discuss this on the new mailing list, BTW. On Mon, Oct 28, 2013 at 11:42 AM, Martin Sperl <martin-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> wrote: > (... ) I thought of moving away from spi_transfer_one_message and back to > the simpler transfer interface, where the preprocessing would get done > (DMA control block-chain generation) and then appending it to the > existing (possibly running) DMA chain. OK quite a cool idea. But I hope that you have the necessary infrastructure using the dmaengine subsystem for this, or that changes requires will be proposed to that first or together with these changes. As you will be using dmaengine (I guess?) maybe a lot of this can actually be handled directly in the core since that code should be pretty generic, or in a separate file like spi-dmaengine-chain.c? > But just yesterday I was looking thru the code and came to the message: > "master is unqueued, this is depreciated" (drivers/spi/spi.c Line 1167). > This came in with commit ffbbdd21329f3e15eeca6df2d4bc11c04d9d91c0 and got included in 3.4. > > So I am wondering why you would depreciate this interface Simply because none of the in-kernel users was doing what you are trying to do now. And noone said anything about such future usecases, so how could we know? > Now this brings me to different question: > Could we implement some additional "functions" for preparing > an SPI message (...) > The interface could looks something like this: > int spi_prepare_message(struct_spi_dev*, struct spi_message *); > int spi_unprepare_message(struct_spi_dev*, struct spi_message *); Maybe? I cannot tell from the above how this would look so I think it is better if you send a patch showing how this improves efficiency. Yours, Linus Walleij -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <CACRpkdb7y88oq7XyVFc_0Nx4pXtaebPe7KB2yizBRJGwWLqJig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Fwd: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <CACRpkdb7y88oq7XyVFc_0Nx4pXtaebPe7KB2yizBRJGwWLqJig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-10-29 17:56 ` Mark Brown 2013-10-29 21:18 ` Martin Sperl 1 sibling, 0 replies; 57+ messages in thread From: Mark Brown @ 2013-10-29 17:56 UTC (permalink / raw) To: Linus Walleij; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, Martin Sperl [-- Attachment #1: Type: text/plain, Size: 2012 bytes --] On Tue, Oct 29, 2013 at 09:59:23AM -0700, Linus Walleij wrote: > Let us discuss this on the new mailing list, BTW. And with the maintainer... I have zero context for this so I'm not 100% sure what the original proposal was. > On Mon, Oct 28, 2013 at 11:42 AM, Martin Sperl <martin-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> wrote: > > > (... ) I thought of moving away from spi_transfer_one_message and back to > > the simpler transfer interface, where the preprocessing would get done > > (DMA control block-chain generation) and then appending it to the > > existing (possibly running) DMA chain. > OK quite a cool idea. How is this going to interact with /CS manipulation? We need /CS to rise and fall as requested by the client, if we coalesce adjacent messages that's going to be an issue... > As you will be using dmaengine (I guess?) maybe a lot of this can > actually be handled directly in the core since that code should be > pretty generic, or in a separate file like spi-dmaengine-chain.c? FWIW this is one of the things I've been aiming to tackle in my ongoing refactoring, the initial two things were making the core able to do the DMA mapping (probably adding a transfer API which passses in sg_list, it can then handle things like blocks that aren't physically contiguous and coalescing scatter/gather for the controller) and trying to overlap the cache flushing of the next transfer with the I/O for the current transfer. > > Now this brings me to different question: > > Could we implement some additional "functions" for preparing > > an SPI message (...) > > The interface could looks something like this: > > int spi_prepare_message(struct_spi_dev*, struct spi_message *); > > int spi_unprepare_message(struct_spi_dev*, struct spi_message *); > Maybe? I cannot tell from the above how this would look so > I think it is better if you send a patch showing how this improves > efficiency. The current SPI API has exactly those functions... [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <CACRpkdb7y88oq7XyVFc_0Nx4pXtaebPe7KB2yizBRJGwWLqJig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-10-29 17:56 ` Mark Brown @ 2013-10-29 21:18 ` Martin Sperl [not found] ` <06C7F4D3-EC91-46CF-90BE-FC24D54F2389-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 1 sibling, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-10-29 21:18 UTC (permalink / raw) To: Linus Walleij; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, Mark Brown Hi! > But I hope that you have the necessary infrastructure using the dmaengine > subsystem for this, or that changes requires will be proposed to that > first or together with these changes. > > As you will be using dmaengine (I guess?) maybe a lot of this can > actually be handled directly in the core since that code should be > pretty generic, or in a separate file like spi-dmaengine-chain.c? I have to admit - I have not been using that infrastructure so far - a bit uncertain how to make it work yet - I first wanted the prototype working. Also the kernel on the raspberry Pi (on which I do the development) is not fully upstream (I know) and also some parts are not really implemented. I know of people who are only using the upstream kernel on a RPI, but there are other limitations there with regards to some drivers. Also there would be a bit of experimentation required to get it in working order - time I do not want to spend momentarily... Finally I am also not sure how dmaengine can work when 2 DMAs are required working in parallel (one for RX the other for TX). All this means that I have to schedule the RX DMA from the TX-DMA and vice versa - in addition to configuring the SPI registers via DMA and working arround those HW-bugs. If the DMA engine would be able to support all of this I have no Idea yet. But with all that said, I can do a single SPI_MESSAGE with 5 transfers , 2 of which have CS_CHANGE only with DMA without any interrupts - besides the final interrupt to trigger the wakeup of the spi_message_pump thread. So obviously from there it is not that much more complicated coalescing multiple spi_messages into a single running DMA thread - it would be just adding the additional transfers to the DMA chain and making sure the DMA is still running after the fact that we have added it. The complication then comes mostly in the form of memory management (especially releasing DMA Control blocks), locking,... - which one does not have to take too much care of with the _transfer_one_message interface. So to put it into perspective: My main goal is to get an efficient CAN driver for the mcp2515 chip which sits on the SPI bus of the RPI. I can right now receive about 3200 messages from the CAN Bus per second (close to the absolute maximum for messages with 8 bytes in length - I could more than double the packet count by reducing the packet size to 0 - with this driver (plus an version of the mcp2515 that uses the async interface and advanced scheduling of messages to "leverage" concurrency - also self written) With the "stock" spi-bcm2835 driver that is upstream it uses around 50k interrupts and a similar amount of context switches and still looses packets. With the current incarnation of the spi-bcm2835dma driver (using the transfer_one_message interface) I run at around 16500 interrupts/s and 22000 context switches. Still what is biting me the most from the transfer perspective is the fact that there are still too many interrupts and context-switches, which introduces too much latency-unnecessarily. So with the spi-bcm2835dma with transfer interface I would estimate that I would get the interrupts down to 6400 interrupts and 0 context switches. All this should also have a positive impact on CPU utilization - no longer 80% System load due to scheduling/dma overhead... As for the prepare_spi_message - I was asking for something different than prepare_transfer_hardware (unless there is new code there in 3.12 that already includes that - or it is in a separate tree). One would prepare the HW in some way - say waking up a separate thread,... While what i would like to see would be similar to prepared statements in SQL - prepare the DMA control blocks once for a message (you may not change the structure/addresses, but you may change the data you are transferring) and then when the message gets submitted via spi_async and then via transfer to the driver, it would then make use of the prepared DMA chain to get attached to the DMA queue. This would shorten the need to calculate those DMA control blocks every time - in my case above I run the same computations 3200 times/s - including dmapool_alloc/dmapool_free/... Obviously it ONLY makes sense for SPI-transfers that have is_dma_mapped=1 - otherwise I would have to go thru the loops of dma_map_single_attrs every time.... So this is to fill in the context for my questions regarding why the "transfer" interface is depreciated and to provide a rational for why I would want to use it. Ciao, Martin P.s: and for completeness: yes, I can do speed_hz and delay_usecs on a per spi_transfer basis as well in the DMA loop - probably coming closer to the "real" requested delay than the steps interrupt->interrupt-handler->wakeup pump_thread (inside the transfer_one_message handler) -> process the other arguments of xfer that are applied after the transfer-> udelay(xfer->delay_usec) On the RPI something like this takes about 0.1ms from one message to the next to get deliverd - ok, it includes some overhead (like calculating the DMA control-blocks), but still it shows the order of magnitude which you can expect when you have to do wait for the message pump to get scheduled (with realtime priority). So a delay_usec=100 would already be in the range of what we would see naturally from the design for a low-power device with a single core - and that would result in waiting way too long. While with DMA I can get the timing correct to +-5% of the requested value (jitter due to other memory transfers that block the memory bus - but this can possibly get ) I will not mention that I believe that with this SPI-DMA driver it opens up the possibility to read 2 24-ADCs at a rate of 200k with minimal jitter with this simple hardware at 20MHz SPI-bus speed. (some calibration may be required...) -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <06C7F4D3-EC91-46CF-90BE-FC24D54F2389-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <06C7F4D3-EC91-46CF-90BE-FC24D54F2389-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-10-29 22:53 ` Mark Brown [not found] ` <20131029225353.GB11424-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-10-30 8:40 ` Martin Sperl 1 sibling, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-10-29 22:53 UTC (permalink / raw) To: Martin Sperl; +Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 1786 bytes --] On Tue, Oct 29, 2013 at 10:18:55PM +0100, Martin Sperl wrote: Please fix your mailer to word wrap within paragraphs, it will greatly improve readability of your messages. > Finally I am also not sure how dmaengine can work when 2 DMAs are > required working in parallel (one for RX the other for TX). All this > means that I have to schedule the RX DMA from the TX-DMA and vice > versa - in addition to configuring the SPI registers via DMA and > working arround those HW-bugs. There's plenty of examples of this in mainline for SPI and other subsystems, mostly this is fairly simple to implement - for example just enable the DMAs prior to enablig the SPI controller then the SPI controller triggerrs the actual transfers. > But with all that said, I can do a single SPI_MESSAGE with 5 transfers > , 2 of which have CS_CHANGE only with DMA without any interrupts - > besides the final interrupt to trigger the wakeup of the > spi_message_pump thread. How exactly does this work in concrete terms? I'm particularly interested in how the /CS manipluation is done. The basic idea doesn't sound terribly different to things a lot of devices can do with driving the transfers from the IRQ handler (possibly hard IRQ handler) without waking the tasklet, I'd hope we can make this work as a special case of that. > As for the prepare_spi_message - I was asking for something different > than prepare_transfer_hardware (unless there is new code there in 3.12 > that already includes that - or it is in a separate tree). You should be working against the latest development code, we're almost at the merge window for v3.13, the v3.12 code is about 3 months old at this point. Also bear in mind that this is open source, you can extend the frameworks if they don't do what you want. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131029225353.GB11424-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131029225353.GB11424-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-10-30 17:11 ` Martin Sperl [not found] ` <F64AD25A-C7EC-4A0D-9169-850C12F4D8A3-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-10-30 17:11 UTC (permalink / raw) To: Mark Brown; +Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA >> But with all that said, I can do a single SPI_MESSAGE with 5 transfers >> , 2 of which have CS_CHANGE only with DMA without any interrupts - >> besides the final interrupt to trigger the wakeup of the >> spi_message_pump thread. > > How exactly does this work in concrete terms? I'm particularly > interested in how the /CS manipluation is done. The basic idea doesn't > sound terribly different to things a lot of devices can do with driving > the transfers from the IRQ handler (possibly hard IRQ handler) without > waking the tasklet, I'd hope we can make this work as a special case of > that. > Essentially what the DMA-looks like is : * if clock_divider changed: Write 4 byte to CS-register config resetting TX/RX * Write 4 byte to CS-register config resetting TX/RX * write 4 byte to LEN-register configuring length of the next transfer * write 4 bytes to CS-register setting up the transfer via DMA (for the length above) * Write to DMA-RX-Controlblock the next "target" Controlblock for the RX-DMA * Write to DMA-RX-Controlregister to start DMA reading the next control-block * Add the following transfers in an inner loop adding similar transfers to the RX/TX DMA queues until we are reaching the end of xfers or we reach a reconfigured xfer (Cs_change,delay_us) speed_hz is slightly different - we need to do a look-ahead in that case and abort as well *Note here the TX queue stopps after the last transfer - we continue on the RX DMA chain * schedule some "idle" DMA cycles to RX queue to wait until the clock is finished * if delay_usec is reached schedule some more idle cycles to RX queue * if cs_change write 4 bytes to CS-register resetting the "transfer flags also add some idle cycles so that CS does stay high for half a clock cycle * Write to DMA-TX-Controlblock the next "target" Controlblock for the TX-DMA * Write to DMA-TX-Controlregister to start DMA reading the next control-block * Create a transfer that keeps us informed which message, which transfer and how many bytes we have transferred * if not finished and there are more xfers go back to the first step * last message gets the send IRQ flag as a final step (if callback is set on the message) So it is a bit complicated, but this is the way it works in principle and this is why I need some CPU cycles to set it up - and for some of those little more complicated messages i need a total of about 60 control-blocks. But - hey: it works! And it is the reason why I am asking for a "prepared" SPI-message option to reduce this kind of overhead and get down to "real" transfer costs that are handled by HW... I believe in most drivers, spi_messages are potentially "prerendered" if speed is important and are not created and torn down every time. Otherwise - setting up the spi message itself becomes very computationally intensive... So adding this additional call "spi_message_prepare" would come cheap... >> As for the prepare_spi_message - I was asking for something different >> than prepare_transfer_hardware (unless there is new code there in 3.12 >> that already includes that - or it is in a separate tree). > > You should be working against the latest development code, we're almost > at the merge window for v3.13, the v3.12 code is about 3 months old at > this point. Also bear in mind that this is open source, you can extend > the frameworks if they don't do what you want. That is a bit more effort for me - if the RPI came with upstream code already, then it would be much easier, but it does not... Device-tree came just before the RPI tree could get integrated, so I am left somewhat behind, because my main project relies on things that are not upstream and might take a long (RPI camera,...) Still I want to get these things upstream, because I hope that it will eventually happen... So I will try to write this "spi_message_prepare interface" and provide a patch for inclusion, so that it may filter down again to the RPI in due time... Ciao, Martin-- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <F64AD25A-C7EC-4A0D-9169-850C12F4D8A3-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <F64AD25A-C7EC-4A0D-9169-850C12F4D8A3-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-10-30 21:51 ` Mark Brown 0 siblings, 0 replies; 57+ messages in thread From: Mark Brown @ 2013-10-30 21:51 UTC (permalink / raw) To: Martin Sperl; +Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 1930 bytes --] On Wed, Oct 30, 2013 at 06:11:45PM +0100, Martin Sperl wrote: > * schedule some "idle" DMA cycles to RX queue to wait until the clock is finished I don't follow this, sorry. What does 'schedule some "idle" DMA cycles to RX queue' involve? Please remember that I've never seen anything about this specific hardware before... > * if cs_change write 4 bytes to CS-register resetting the "transfer flags > also add some idle cycles so that CS does stay high for half a clock cycle If you're writing to registers doesn't that mean you need to do this after the transfer has completed rather than having the hardware do things autonomously (which seemed to be what you were describing)? With the exception of these 'idle cycles' this all sounds like fairly standard hardware with a DMA controller capable of chaining transfers for scatter/gather. > > You should be working against the latest development code, we're almost > > at the merge window for v3.13, the v3.12 code is about 3 months old at > > this point. Also bear in mind that this is open source, you can extend > > the frameworks if they don't do what you want. > That is a bit more effort for me - if the RPI came with upstream code already, > then it would be much easier, but it does not... > Device-tree came just before the RPI tree could get integrated, so I am left > somewhat behind, because my main project relies on things that are not > upstream and might take a long (RPI camera,...) > Still I want to get these things upstream, because I hope that it will > eventually happen... Device tree is totally orthogonal to any of this, this is all about the SPI core. I can't think of anythning in the SPI core that is even particularly dependant on advanced changes in the rest of the kernel, you could probably just drop a new version into an old kernel with minimal fixups for integration with the old kernel. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <06C7F4D3-EC91-46CF-90BE-FC24D54F2389-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-10-29 22:53 ` Mark Brown @ 2013-10-30 8:40 ` Martin Sperl [not found] ` <02BFF0F6-3836-4DEC-AA53-FF100E037DE9-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 1 sibling, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-10-30 8:40 UTC (permalink / raw) To: Linus Walleij; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, Mark Brown Maybe one more argument for the spi_prepare_message interface: For my use-case it takes about: 0.053ms to get from CS disabled+interrupt+interrupthandler+wakeup to the point where I am starting to do the "real work" - in my case preparing dma chains. And this then takes - depending on the number of transfers and features requested for this message - another 0.026ms to compute the DMA chain. and then finally another 0.001ms to the point where CS really gets "driven low" by the DMA. So if we can "optimize" away this "CPU-task" with preparing the messages, then we are faster and there is a definite save on CPU cycles... Some quick analysis shows that the dma_pool_alloc takes quite a bit of time out of those (now 0.029ms - due to additional CS up/down): Total time spent to create the chain of 48 dma-control-blocks is 29us Time spent in alloc_dma_pool is 9.1us The same, but also including the "initialization" of the control-block increases to 18.6us So the setup of DMA is definitely expensive in itself. But if I compare with SPI-BCM2835 "apples to apples", I get: for the spi-bcm2835dma.c (from start of DMA calc till final CS) a total time of: 58us. for the spi-bcm2835.c (from CS down to CS UP) a total time of: 71us. So there is an advantage to the DMA driver from the bus-efficiency perspective. And it would be much bigger when we would have the chains "prepared" already - in my example above it should go down to 27us, which is _almost_ optimal from the CPU perspective... I can share some images from the logic analyzer if required for future reference. Ciao, Martin P.s: the spi_message presented in the above is of the following format: * 4 spi_transfers ** 2 bytes write ** 2 bytes read+CS_CHANGE ** 2 bytes write ** 13 bytes read (implicit CS_change because of end of message) also my test-workload produces the following statistics on vmstat: on spi-bcm2835: 79000 interrupts/s and 22000 context-switches and 70% System load on spi-bcm2835dma: 15700 interrupts/s and 20000 context-switches and 75% System load On 29.10.2013, at 22:18, Martin Sperl wrote: > Hi! > >> But I hope that you have the necessary infrastructure using the dmaengine >> subsystem for this, or that changes requires will be proposed to that >> first or together with these changes. >> >> As you will be using dmaengine (I guess?) maybe a lot of this can >> actually be handled directly in the core since that code should be >> pretty generic, or in a separate file like spi-dmaengine-chain.c? > > I have to admit - I have not been using that infrastructure so far - a bit uncertain how to make it work yet - I first wanted the prototype working. > > Also the kernel on the raspberry Pi (on which I do the development) is not fully upstream (I know) and also some parts are not really implemented. > I know of people who are only using the upstream kernel on a RPI, but there are other limitations there with regards to some drivers. Also there would be a bit of experimentation required to get it in working order - time I do not want to spend momentarily... > > Finally I am also not sure how dmaengine can work when 2 DMAs are required working in parallel (one for RX the other for TX). > All this means that I have to schedule the RX DMA from the TX-DMA and vice versa - in addition to configuring the SPI registers via DMA and working arround those HW-bugs. > > If the DMA engine would be able to support all of this I have no Idea yet. > > But with all that said, I can do a single SPI_MESSAGE with 5 transfers , 2 of which have CS_CHANGE only with DMA without any interrupts - besides the final interrupt to trigger the wakeup of the spi_message_pump thread. > > So obviously from there it is not that much more complicated coalescing multiple spi_messages into a single running DMA thread - it would be just adding the additional transfers to the DMA chain and making sure the DMA is still running after the fact that we have added it. > > The complication then comes mostly in the form of memory management (especially releasing DMA Control blocks), locking,... - which one does not have to take too much care of with the _transfer_one_message interface. > > So to put it into perspective: > > My main goal is to get an efficient CAN driver for the mcp2515 chip which sits on the SPI bus of the RPI. > I can right now receive about 3200 messages from the CAN Bus per second (close to the absolute maximum for messages with 8 bytes in length - I could more than double the packet count by reducing the packet size to 0 - with this driver (plus an version of the mcp2515 that uses the async interface and advanced scheduling of messages to "leverage" concurrency - also self written) > > With the "stock" spi-bcm2835 driver that is upstream it uses around 50k interrupts and a similar amount of context switches and still looses packets. > With the current incarnation of the spi-bcm2835dma driver (using the transfer_one_message interface) I run at around 16500 interrupts/s and 22000 context switches. > Still what is biting me the most from the transfer perspective is the fact that there are still too many interrupts and context-switches, which introduces too much latency-unnecessarily. > > So with the spi-bcm2835dma with transfer interface I would estimate that I would get the interrupts down to 6400 interrupts and 0 context switches. All this should also have a positive impact on CPU utilization - no longer 80% System load due to scheduling/dma overhead... > > As for the prepare_spi_message - I was asking for something different than prepare_transfer_hardware (unless there is new code there in 3.12 that already includes that - or it is in a separate tree). > One would prepare the HW in some way - say waking up a separate thread,... > While what i would like to see would be similar to prepared statements in SQL - prepare the DMA control blocks once for a message (you may not change the structure/addresses, but you may change the data you are transferring) and then when the message gets submitted via spi_async and then via transfer to the driver, it would then make use of the prepared DMA chain to get attached to the DMA queue. This would shorten the need to calculate those DMA control blocks every time - in my case above I run the same computations 3200 times/s - including dmapool_alloc/dmapool_free/... Obviously it ONLY makes sense for SPI-transfers that have is_dma_mapped=1 - otherwise I would have to go thru the loops of dma_map_single_attrs every time.... > > So this is to fill in the context for my questions regarding why the "transfer" interface is depreciated and to provide a rational for why I would want to use it. > > Ciao, > Martin > > P.s: and for completeness: yes, I can do speed_hz and delay_usecs on a per spi_transfer basis as well in the DMA loop - probably coming closer to the "real" requested delay than the steps interrupt->interrupt-handler->wakeup pump_thread (inside the transfer_one_message handler) -> process the other arguments of xfer that are applied after the transfer-> udelay(xfer->delay_usec) > On the RPI something like this takes about 0.1ms from one message to the next to get deliverd - ok, it includes some overhead (like calculating the DMA control-blocks), but still it shows the order of magnitude which you can expect when you have to do wait for the message pump to get scheduled (with realtime priority). > So a delay_usec=100 would already be in the range of what we would see naturally from the design for a low-power device with a single core - and that would result in waiting way too long. While with DMA I can get the timing correct to +-5% of the requested value (jitter due to other memory transfers that block the memory bus - but this can possibly get ) > > I will not mention that I believe that with this SPI-DMA driver it opens up the possibility to read 2 24-ADCs at a rate of 200k with minimal jitter with this simple hardware at 20MHz SPI-bus speed. (some calibration may be required...) > > -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <02BFF0F6-3836-4DEC-AA53-FF100E037DE9-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <02BFF0F6-3836-4DEC-AA53-FF100E037DE9-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-10-30 17:19 ` Mark Brown [not found] ` <20131030171902.GL2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-10-30 17:19 UTC (permalink / raw) To: Martin Sperl; +Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 225 bytes --] On Wed, Oct 30, 2013 at 09:40:21AM +0100, Martin Sperl wrote: > Maybe one more argument for the spi_prepare_message interface: Did you see my replies yesterday - I mentioned that we already have this interface in mainline? [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131030171902.GL2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131030171902.GL2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-10-30 18:33 ` Martin Sperl [not found] ` <8D8B0BAD-0E00-4147-B4C8-FBB18F060C96-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-10-30 18:33 UTC (permalink / raw) To: Mark Brown; +Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA Hi Mark! I have seen your email and I have checked 3.12-rc7: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/linux/spi/spi.h?id=refs/tags/v3.12-rc7 (Assuming this is the latest, but I would say that the version of Linus would be authoritative) What I have found is: int (*prepare_transfer_hardware)(struct spi_master *master); This one I have mentioned before, but this does prepare the hardware (spi_master) prior to every time we start after an empty queue, which does NOT do what I was asking for. It is useful for starting up clocks and other power-management tasks... It does not prepare a spi_message once allowing it to get processed faster on calls to spi_async(...)! So can you tell me which interface you are talking about that is supposed to be in the mainline, which would solve my needs? I would need an interface addition similar in intent to this: The following would go into struct spi_master: int (*prepare_message)(struct spi_device *dev,struct spi_message *mesg); int (*unprepare_message)(struct spi_device *dev,struct spi_message *mesg); static int spi_message_prepare(struct spi_device *spi, struct spi_message* msg) { return (dev->master->prepare_message)? dev->master->prepare_message(spi,msg):0; } static int spi_message_unprepare(struct spi_device *spi, struct spi_message*msg) { return (dev->master->unprepare_message)? dev->master->unprepare_message(spi,msg):0; } Plus some additional information in spi_message similar to this: void *prepared; struct spi_driver *prepared_for_device; The second more to avoid "problems" when drivers prepare for one spi-device and then use it on a different SPI device - mostly to croak about misuse of the prepared message... If this field is in there, then the "unprepare" call could even do without the argument for SPI_device, as it would be already stored in the message itself... Thanks, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <8D8B0BAD-0E00-4147-B4C8-FBB18F060C96-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <8D8B0BAD-0E00-4147-B4C8-FBB18F060C96-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-10-30 21:57 ` Mark Brown [not found] ` <20131030215716.GV2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-10-30 21:57 UTC (permalink / raw) To: Martin Sperl; +Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 764 bytes --] On Wed, Oct 30, 2013 at 07:33:30PM +0100, Martin Sperl wrote: > I have seen your email and I have checked 3.12-rc7: > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/linux/spi/spi.h?id=refs/tags/v3.12-rc7 > (Assuming this is the latest, but I would say that the version of Linus would > be authoritative) No, look in -next (or the SPI tree which you can see in MAINTAINERS): git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git > The following would go into struct spi_master: > int (*prepare_message)(struct spi_device *dev,struct spi_message *mesg); > int (*unprepare_message)(struct spi_device *dev,struct spi_message *mesg); Which, like I say, is exactly what's there in terms of a driver API. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131030215716.GV2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131030215716.GV2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-10-30 22:52 ` Martin Sperl [not found] ` <3342FD19-4438-463B-89B2-A83D3475AC22-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-10-30 22:52 UTC (permalink / raw) To: Mark Brown; +Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA Hi Mark! After a lot of digging I found the API you are referring to in: http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/include/linux/spi/spi.h So when will this go in? How would you recommend to connect your prepared structure to a specific message? There is no flag/field in spi_message to link that. Would this mean that I would keep an internal list with the address of the SPI message any my "prepared data" and I need to iterate over that list every time I need it? This search might - worsted case - become a scaling issue if too many prepared statements are created... Ok, there are other constructs to store data in making lookup faster, but does this not make the driver more complex than necessary? It could lead to all drivers starting "simple" and then at some point it becomes an issue as one driver makes heavy use of this showing it becomes a bottle-neck and then all spi-bus-drivers would need to get touched to optimize things... Why not provide a lookup function for the prepared data immediately and make sure the drivers use this one. Then we only have to change it later in one place and reduce code-replication across drivers. (obviously you would also need put/delete methods to make that interface complete.) Also those function pointers do not have wrappers functions for ease of use, so that they can get used in the individual drivers that know that they can prepare the spi_message, as they do not change the transfer structure.... Thanks, Martin On 30.10.2013, at 22:57, Mark Brown wrote: > On Wed, Oct 30, 2013 at 07:33:30PM +0100, Martin Sperl wrote: > >> I have seen your email and I have checked 3.12-rc7: >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/linux/spi/spi.h?id=refs/tags/v3.12-rc7 >> (Assuming this is the latest, but I would say that the version of Linus would >> be authoritative) > > No, look in -next (or the SPI tree which you can see in MAINTAINERS): > > git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git > >> The following would go into struct spi_master: >> int (*prepare_message)(struct spi_device *dev,struct spi_message *mesg); >> int (*unprepare_message)(struct spi_device *dev,struct spi_message *mesg); > > Which, like I say, is exactly what's there in terms of a driver API. -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <3342FD19-4438-463B-89B2-A83D3475AC22-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <3342FD19-4438-463B-89B2-A83D3475AC22-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-10-31 0:10 ` Mark Brown [not found] ` <20131031001004.GW2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-10-31 0:10 UTC (permalink / raw) To: Martin Sperl; +Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 3521 bytes --] On Wed, Oct 30, 2013 at 11:52:33PM +0100, Martin Sperl wrote: Please fix your mailer to word wrap within paragraphs, it makes things much more legible. > After a lot of digging I found the API you are referring to in: > http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/include/linux/spi/spi.h > So when will this go in? During the v3.13 merge window. > How would you recommend to connect your prepared structure to a specific message? > There is no flag/field in spi_message to link that. I'm working on the basis that we should add what we need to the structure as we add code using it, right now there's no real usage so nothing to do. > Would this mean that I would keep an internal list with the address of > the SPI message any my "prepared data" and I need to iterate over that > list every time I need it? This search might - worsted case - become > a scaling issue if too many prepared > statements are created... Like I say just add stuff to the core as needed. If we're getting scaling problems due to too many transfers in flight we can probably arrange to limit how many we do. Hopefully that won't be too much of an issue, though - we're mostly concerned with the last few entries and with the ones currently in flight with the hardware. I'd like to see all this stuff done in the framework as much as possible, like I say a lot of this is very common and even where we can't offload everything to hardware we can push a lot down into hard IRQ context which will help avoid scheduling time. We should build things up piece by piece though rather than trying to do everything in one fell swoop. My general plan had been to start off by factoring out the prepare and unprepare of DMA transfers for dmaengine based devices, including things like the cache flushing, then add support for coalscing scatter/gather operations - right now I think only one driver can do any form of gather though a large proportion of the hardware with DMA has that support. > It could lead to all drivers starting "simple" and then at some point it becomes > an issue as one driver makes heavy use of this showing it becomes a bottle-neck > and then all spi-bus-drivers would need to get touched to optimize things... That's why I want to get as much into the core as possible, like was done with Linus' work on the queue. dmaengine helps a lot here since it means all the DMA controllers look the same and hopefully we can also model the operations we want to do at the hardware interaction level so all the fancy queue/transfer management stuff is shared in the core. > Why not provide a lookup function for the prepared data immediately and make > sure the drivers use this one. Then we only have to change it later in one place > and reduce code-replication across drivers. > (obviously you would also need put/delete methods to make that interface complete.) That's the sort of stuff I'm talking about, yes (though I'm not 100% sure what the "prepared data" is here). > Also those function pointers do not have wrappers functions for ease of use, > so that they can get used in the individual drivers that know that they can > prepare the spi_message, as they do not change the transfer structure.... This is part of pushing things into the core - the idea is that we should have little if any code using these things in the driver and instead the core could have the code for working out when to do things. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131031001004.GW2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131031001004.GW2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-04 17:33 ` Martin Sperl [not found] ` <18639D9A-630E-44F3-AA7A-ADFF5D5E8B56-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-04 17:33 UTC (permalink / raw) To: Mark Brown, Linus Walleij Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren Hi! (after some emails that went offline) I have created the "prepared" message interface for the spi-bcm2835dma driver on the 3.10 kernel I am working off right now. (so I had to do a bit of "linking" trickery) So here a link of an example showing how much the prepared spi_messages really can improve the SPI thruput - even without changing anything (besides) preparing the messages for direct DMA use. http://www.raspberrypi.org/phpBB3/viewtopic.php?f=44&t=19489&p=448328#p448328 The link goes to the RPI forum and includes the measurements plus some images showing both cases (configured via a module parameter for now) Quick summary: For my test-usecase by just enabling "prepared messages" I have reduced the time for a "simple" transfer from 390us (without prepare) to 230us (with prepare) and the driver is still using the threaded message_pump. OK - the "default" spi-bcm2835.c driver currently in the mainline takes 245us for the same thing, but it runs at 130k interrupts/s and 95% System load, which is quite impractical to do anything else. So getting this down to 80% System load plus shorter responses is already quite an improvement. It is a drawback hat DMA driver needs more overhead for the unprepared spi-message case. But then if the driver that is perfromance critical does not make use of prepared messages yet and there is a performance issue, then it needs to get modified to call spi_prepare message during setup. Anyway, if a driver cares about thru-put, it better avoids allocating new memory to create (and release) an SPI message in the interrupt handler or callback in the first place. Then adding the single extra statement to prepare and unprepare the message comes cheap. There is one hard assumptions: each and every xfer has to be configured with a DMA-address for source and destination (unless NULL) Obviously the message is "static" as soon as prepare has been called. Otherwise the overhead for iterating the messages and calling dma_(un)map_single becomes the limiting factor and the difference in code (compared to creating the whole chain from scratch) is minimal besides additional allocations for the memory - we have to "walk", we have to parse, ... (also think about how it works if memory itself is fragmented on the bus address then - depending on the alignment of data a different amount of DMA-transfers would be needed - thus it seems quite impractical to implement) So from my experience I would recommend adding something like this to spi.h, so that it can get used for real and not export-linked way, like I had to do it for this proof of concept. static int bcm2835dma_spi_prepare_message(struct spi_device *spi, struct spi_message *msg) { if(spi->prepare_message) { return spi->prepare_message(spi,msg); } else { return 0; } } static int bcm2835dma_spi_unprepare_message(struct spi_device *spi, struct spi_message* msg) { if(spi->unprepare_message) { return spi->unprepare_message(spi,msg); } else { return 0; } } and finally also some management functionality for "finding" those prepared messages - better to implement something and making it opaque, then every driver implementing its own thing and then having to track that. Already my mcp2515a driver is allocating something like 14 prepared messages already, so walking that list is not as expensive, but at some point a binary tree might be better from the performance-perspective struct spi_prepared_message { /* the list in which we store the message */ struct list_head prepared_list; /* the identification data for matching */ struct spi_device *spi; struct spi_message *message; }; something like this: struct list_head prepared_messages_list; plus a spinlock protecting the above in spi_master and the following prototypes: static struct spi_prepared_message *spi_find_prepared_message_nolock( struct spi_device *spi, struct spi_message *message); static struct spi_prepared_message spi_find_prepared_message( struct spi_device *spi, struct spi_message *message); static int spi_add_prepared_message( struct spi_prepared_message * prepared); static struct spi_prepared_message *spi_remove_prepared_message( struct spi_device *spi, struct spi_message *message); I got the above prototypes implemented (but with the data sitting in the master-private data structure, so I am not sure if it is worth sharing them) Maybe we can get those into the "next" tree so that it can get pulled into 3.13? Thanks, Martin P.s: Note: I have not switched to the dmaengine interface for the driver. There are too many things needed for me to learn to make that switch now. Also (as far as Mark said during our offline discussions) there seem to be some gaps that would require extending the DMA engine, which would making the driver work at first hard. Finally I fear (from my limited knowledge) that scheduling via the DMA engine itself would require a prepare and also I would have to keep up with double allocations to link everything in place further reducing the "thruput" for the case of non-prepared drivers. The driver itself can get found at: https://github.com/msperl/spi-bcm2835 -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <18639D9A-630E-44F3-AA7A-ADFF5D5E8B56-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <18639D9A-630E-44F3-AA7A-ADFF5D5E8B56-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-04 18:45 ` Mark Brown [not found] ` <20131104184511.GR2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-04 18:45 UTC (permalink / raw) To: Martin Sperl Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren [-- Attachment #1: Type: text/plain, Size: 302 bytes --] On Mon, Nov 04, 2013 at 06:33:19PM +0100, Martin Sperl wrote: > The driver itself can get found at: https://github.com/msperl/spi-bcm2835 Please post code/patches normally rather than linking to git, it's much less likely that people will look at some random gitweb thing and much harder to respond. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131104184511.GR2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131104184511.GR2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-04 21:43 ` Martin Sperl [not found] ` <5A55A832-5313-499C-A483-BF5A6649D69D-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-04 21:43 UTC (permalink / raw) To: Mark Brown Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren Hi Mark! I just found out that I am in a chicken or egg situation: There is no DMA support whatsoever in the upstream RPI support. So sending any patches for spi with DMA becomes impossible right now. Setting up a system that runs the upstream kernel without patches will be some effort on my side so that may take quite some time for me to start. Until then I will need to "work" with the tainted kernel and APIs from the foundation using their special APIs, which you obviously can not accept for moving them upstream. That is why I just wanted to share with you what may get achieved with your new interface and what is missing from my perspective to make it work in a practical manner. And that is why I have added only the link to the driver. I am in a sort of situation that makes it hard to move in either direction. I believe we all got our interests: * you want your prepared interface tested in a driver * I want to get the spi-dma driver in the main-line * some other people want the dma engine work to get started for the RPI, so that they can get to real productive work on USB and others. All I can give you right now is the feedback from my experience with this out of kernel driver, which is not a candidate for merging in its current state due to limitations mentioned above. The results look promising and will improve further! The only thing I can think of that would help me (and others) in the long run is getting those missing parts of your prepare API into the kernel early to avoid code duplication (see earlier email with the basics). By then I may have the driver up and running quite well on the Foundation RPI kernel and hopefully some people should also have tested it with more devices. As soon as that is stable and I get a "vanilla" kernel running with the improvements I had recommended from my experience, I can start some basics of dmaengine for the RPI (allocation and release DMA to start keeping everything else in the driver for now) and then we can really start talking about patches for the spi-driver to send. again: first the dma-engine part would need to be upstream before I assume you would accept to merge the driver, as the functionality would NOT work otherwise. So we talk about 3.15 at the very earliest... So is the presented idea an approach to move forward on? Or do you have a better approach which we could follow to get us out of this tie? Thanks, Martin P.s: here the pieces of code for which I have earlier only shown the prototypes - these are untested and I am not sure they will even compile as is because I had to modify the driver specific version here in this email to make them work with the list/lock variables in the master-structure - the driver specific code works. Also no documentation for the functions. That is the best I can do for now. /* generic wrapper functions for device drivers to use */ static int bcm2835dma_spi_prepare_message(struct spi_device *spi, struct spi_message *msg) { if(spi->master->prepare_message) { return spi->master->prepare_message(spi,msg); } else { return 0; } } static int bcm2835dma_spi_unprepare_message(struct spi_device *spi, struct spi_message* msg) { if(spi->master->unprepare_message) { return spi->master->unprepare_message(spi,msg); } else { return 0; } } in struct spi_master /* note that it misses initialization for those two fields */ struct list_head prepared_messages_list; spinlock_t prepared_messages_lock; /* functions used to store/find/remove prepared objects * for now putting them in a simple list. * only bus drivers should call these really in their (un)prepare code * later on it may be worthwhile changing this structure to * be in a binary tree - the big question is: how many prepared messages * may we expect on a single SPI bus? * alternatively we could also move it of to the spi_device structure. */ /* the structure to embed in the bus specific implementation */ struct spi_prepared_message { /* the list in which we store the prepared messages */ struct list_head prepared_list; /* the identification data for searching the list*/ struct spi_device *spi; struct spi_message *message; /* maybe some search-statistics, so that we can reorder/optimize * the list to make the searches for the most commonly used * messages faster by being in "front" */ }; static struct spi_prepared_message *spi_find_prepared_message_nolock( struct spi_device *spi, struct spi_message *message) { struct spi_master *master=spi->master; /* ideally this would be in master */ struct spi_prepared_message *prepared; /* now loop the entries */ list_for_each_entry(prepared, &master->prepared_messages_list, prepared_list) { /* if we match, then return */ if ((prepared->spi==spi) && (prepared->message==message)) return prepared; } /* return not found */ return NULL; } static struct spi_prepared_message *spi_find_prepared_message( struct spi_device *spi, struct spi_message *message) { struct spi_prepared_message *prepared; struct spi_master *master=spi->master; unsigned long flags; /* lock */ spin_lock_irqsave(&master->prepared_messages_lock,flags); /* try to find it */ prepared=spi_find_prepared_message_nolock(spi,message); /* and unlock and return */ spin_unlock_irqrestore(&master->prepared_messages_lock,flags); return prepared; } static int spi_add_prepared_message( struct spi_prepared_message * prepared) { struct spi_device *spi=prepared->spi; struct spi_master *master=spi->master; struct spi_message *message=prepared->message; unsigned long flags; /* lock */ spin_lock_irqsave(&master->prepared_messages_lock,flags); /* find the entry and croak if it has beeen prepared before */ if (spi_find_prepared_message_nolock(spi,message)) { spin_unlock_irqrestore(&master->prepared_messages_lock,flags); dev_err(&spi->dev,"SPI message has already been prepared once\n"); return -EPERM; } /* now add it to the list at tail*/ list_add_tail(&prepared->prepared_list,&master->prepared_messages_list); /* unlock and return */ spin_unlock_irqrestore(&master->prepared_messages_lock,flags); return 0; } struct spi_prepared_message *spi_remove_prepared_message( struct spi_device *spi, struct spi_message *message ) { struct spi_prepared_message *prep=NULL; struct spi_master *master=spi->master; unsigned long flags; /* lock */ spin_lock_irqsave(&master->prepared_messages_lock,flags); /* find the entry */ prep=spi_find_prepared_message_nolock(spi,message); /* now unlink the prepared item - if found - we could croak otherwise */ if (prep) { list_del(&prep->prepared_list); } else { dev_err(&spi->dev,"SPI message has not been prepared\n"); } /* unlock and return the version removed from the list*/ spin_unlock_irqrestore(&master->prepared_messages_lock,flags); return prep; } On 04.11.2013, at 19:45, Mark Brown wrote: > On Mon, Nov 04, 2013 at 06:33:19PM +0100, Martin Sperl wrote: > >> The driver itself can get found at: https://github.com/msperl/spi-bcm2835 > > Please post code/patches normally rather than linking to git, it's much > less likely that people will look at some random gitweb thing and much > harder to respond. -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <5A55A832-5313-499C-A483-BF5A6649D69D-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <5A55A832-5313-499C-A483-BF5A6649D69D-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-05 1:03 ` Linus Walleij 2013-11-06 9:48 ` Mark Brown 1 sibling, 0 replies; 57+ messages in thread From: Linus Walleij @ 2013-11-05 1:03 UTC (permalink / raw) To: Martin Sperl; +Cc: Mark Brown, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren On Mon, Nov 4, 2013 at 10:43 PM, Martin Sperl <martin-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> wrote: > I just found out that I am in a chicken or egg situation: > There is no DMA support whatsoever in the upstream RPI support. > So sending any patches for spi with DMA becomes impossible right now. I wrote in my first reply: "But I hope that you have the necessary infrastructure using the dmaengine subsystem for this, or that changes requires will be proposed to that first or together with these changes." So now you are back at this point :-) Anyway, doing dmaengine support is a venture on its own and as usual it falls upward to anyone who first needs it to do that. I suggest you look at some of the newer dmaengines in drivers/dma for inspiration, and get memcpy in place first so you can test the driver stand-alone with drivers/dma/dmatest.c Yours, Linus Walleij -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <5A55A832-5313-499C-A483-BF5A6649D69D-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-05 1:03 ` Linus Walleij @ 2013-11-06 9:48 ` Mark Brown [not found] ` <20131106094854.GF11602-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 1 sibling, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-06 9:48 UTC (permalink / raw) To: Martin Sperl Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren [-- Attachment #1: Type: text/plain, Size: 1080 bytes --] On Mon, Nov 04, 2013 at 10:43:03PM +0100, Martin Sperl wrote: > again: first the dma-engine part would need to be upstream before I assume > you would accept to merge the driver, as the functionality would NOT work > otherwise. So we talk about 3.15 at the very earliest... Well, v3.14 should be possible - we're just coming up to the merge window for v3.13 so v3.14 is still wide open. > So is the presented idea an approach to move forward on? > Or do you have a better approach which we could follow to get us out of this > tie? I think the current approach seems fine. > /* generic wrapper functions for device drivers to use */ > static int bcm2835dma_spi_prepare_message(struct spi_device *spi, > struct spi_message *msg) > { > if(spi->master->prepare_message) { > return spi->master->prepare_message(spi,msg); > } else { > return 0; > } > } Like I said earlier I'm concerned about the idea of drivers calling these functions directly, this should all be factored out of drivers since there's nothing really driver specific here. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131106094854.GF11602-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131106094854.GF11602-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-06 11:28 ` Martin Sperl [not found] ` <844EDAEA-3FDC-48D0-B59E-CECC0A83761E-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-06 11:28 UTC (permalink / raw) To: Mark Brown Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren Hi Mark! I am currently setting up a second RPI for my work on upstream kernels. This still will take some more time, as I can only access it via a serial console, as there is no upstream USB driver (which I need for the on-board network card to work)... >> /* generic wrapper functions for device drivers to use */ >> static int bcm2835dma_spi_prepare_message(struct spi_device *spi, >> struct spi_message *msg) >> { >> if(spi->master->prepare_message) { >> return spi->master->prepare_message(spi,msg); >> } else { >> return 0; >> } >> } > > Like I said earlier I'm concerned about the idea of drivers calling > these functions directly, this should all be factored out of drivers > since there's nothing really driver specific here. Actually it is VERY driver specific! The thing is that you can not easily detect if the same spi_message has been used and in some way changed without having to walk and compare the structure to an existing checksum - and such computation would take say 1/3rd of the time it takes to prepare the message again. Then if it has been found to be different, you would first need to erase the old structure before creating the new one. So this kind of heuristics increases the CPU load resulting in unnecessary computations at minimal benefits. So there is the need to "identify" a message somehow as "prepare-able" (and thus conforming to some restrictions - like the one I had mentioned) in the driver itself. If you want to do this via a separate boolean in the spi_message structure or if you do this by explicitly calling from the driver IMO does not make much of a difference - the code needs to get optimized for this anyway. The advantage of the explicit call is that those prepared messages can get released immediately when the driver is unloaded. Otherwise they can only get released when the master is unloaded (or some LRU logic). And if you do not release on driver unload, then in the worsted case you could load the driver again and the spi_message would get created in the same location but the xfers and their pointers might point elsewhere. And that would _definitely_ be a problem. (You could add cookies,... but then you would have more computations still.) For spi_async all the above checks could happen in an IRQ context, so the code-path should _really_ be optimized! Anyway: this mostly applies to drivers that need to be fast and have high thru-put. And those would be the only ones that need "special" handling. For those, that just use the "simple" sync API (variations of spi_write_then_read and similar), the framework may already have done the prepare for its "hot-path" situation. That way those "simple" drivers can get a benefit as well. I do not think it is worth coding "complex" heuristics, like checking X times for a message to be "identical" and only then taking the short-cut. That can lead to some very strange situations to debug. So as far as I am concerned it is better to explicitly say in the driver: yes we can. It might be worth thinking of extending the interface for "typical" transfers cases so that more use-cases get handled automatically (with prepared code). But that would require reviewing/rewriting existing drivers to make use of those interfaces. Finally: with the preare/unprepare calls in the master structure, the thing you risk is that each driver writes is own: if (master->prepared_message) master->prepared_message(spi,message); so better make that wrapper available... Ciao, Martin P.s: just looking at __spi_async already gives me some thought: we are already iterating over the stucture once for the sanity-checks. And that takes time already - cycles that could be saved for the "prepared" message case entirely by short-cutting those "sanity-checks" and just call master->transfer(spi->message) immediately. So integrating the "search" for prepared statements here and skipping the tests in this case would also improve the thru-put... Otherwise you would risk that a high-thruput driver directly calls message->spi=spi;master->transfer(spi,message); to avoid those unnecessary CPU cycles - especially if it is optimized. Ok, you could reject the patch, but then you would ignore the root cause WHY people do this! One way around this could be that you would call a default "prepared" statement code that would do the checks in the structure. And that way any driver that does prepare its messages would get the benefit independent of if the bus driver itself implements prepared messages or not. It could boil down to something like this: (this would require a flag marking the message as prepared) static int __spi_sanity_check_message(struct spi_device *spi, struct spi_message *message) { /* message sanity check code here */ } /* set prepared_message to this in spi_register_master, * if prepare_message is NULL (not included) */ int __spi_prepare_message_default(struct spi_device *spi, struct spi_message *message) { int ret=__spi_sanity_check_message(spi,message); if (!ret) message->is_prepared=1; return ret; } static int __spi_async(struct spi_device *spi, struct spi_message *message) { int ret=0; /* the non prepared path */ if (! message->is_prepared) { /* sanity checks here */ ret=__spi_sanity_check_message(spi,message); if (ret) return ret; } /* and the "normal" call of transfer */ message->spi = spi; message_status = -EINPROGRESS; return master->transfer(spi,message); } -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <844EDAEA-3FDC-48D0-B59E-CECC0A83761E-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <844EDAEA-3FDC-48D0-B59E-CECC0A83761E-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-06 11:32 ` Mark Brown [not found] ` <20131106113219.GJ11602-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-06 11:32 UTC (permalink / raw) To: Martin Sperl Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren [-- Attachment #1: Type: text/plain, Size: 837 bytes --] On Wed, Nov 06, 2013 at 12:28:15PM +0100, Martin Sperl wrote: > >> /* generic wrapper functions for device drivers to use */ > >> static int bcm2835dma_spi_prepare_message(struct spi_device *spi, > > Like I said earlier I'm concerned about the idea of drivers calling > > these functions directly, this should all be factored out of drivers > > since there's nothing really driver specific here. > Actually it is VERY driver specific! > The thing is that you can not easily detect if the same spi_message has > been used and in some way changed without having to walk and compare the > structure to an existing checksum - and such computation would take say > 1/3rd of the time it takes to prepare the message again. I'm sorry but the above makes no sense to me. Why on earth would you need to do these checks? [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131106113219.GJ11602-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131106113219.GJ11602-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-06 12:10 ` Martin Sperl [not found] ` <C6C68042-63A0-40FD-8363-B4553ECB4774-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-06 12:10 UTC (permalink / raw) To: Mark Brown Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren > I'm sorry but the above makes no sense to me. Why on earth would you > need to do these checks? How can you check that the code does not do something like this? struct spi_message msg; struct spi_transfer a,b; spi_message_init(&msg); spi_message_add_tail(&msg,&a); spi_sync(spi,&msg); spi_message_add_tail(&msg,&b); spi_sync(spi,msg); The above is an allowed sequence, that must work - otherwise we would break an existing API. So assuming that the framework is executing those prepare calls itself - say inside the spi_async function - how would it detect such a change without any checksumming the message? Think of 10000 interrupts/s each scheduling the identical SPI message If each check takes 5us to verify the structure is identical to the last call, then that alone sums up to 50ms (or 5% CPU) without doing anything "helpfull". The explicit one time call spi_prepare_message states that the driver delegates ownership over the message object to the SPI bus-master until it takes ownership back with spi_unprepare_message. So it may not free or change the structures in any way, but it still may use the message (spi_async) and read write in the data bytes (not the location). It may even delegate it to multiple masters, but may only use it in one spi bus at a time (this second part should be obvious, as it is in the current API in the form of an implicit requirement from the queue structure member, also 2 channels writing to the same memory location does not make sense - but theoretically would for transmits only). Martin -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <C6C68042-63A0-40FD-8363-B4553ECB4774-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <C6C68042-63A0-40FD-8363-B4553ECB4774-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-06 16:24 ` Mark Brown [not found] ` <20131106162410.GB2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-06 16:24 UTC (permalink / raw) To: Martin Sperl Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren [-- Attachment #1: Type: text/plain, Size: 1125 bytes --] On Wed, Nov 06, 2013 at 01:10:32PM +0100, Martin Sperl wrote: > How can you check that the code does not do something like this? > struct spi_message msg; > struct spi_transfer a,b; > spi_message_init(&msg); > spi_message_add_tail(&msg,&a); > spi_sync(spi,&msg); > spi_message_add_tail(&msg,&b); > spi_sync(spi,msg); > The above is an allowed sequence, that must work - otherwise we would > break an existing API. This would be broken anyway with existing drivers; if we want to support that we need to take copies of both the messages and the transfers but really that just looks crazy. > The explicit one time call spi_prepare_message states that the driver > delegates ownership over the message object to the SPI bus-master until > it takes ownership back with spi_unprepare_message. So it may not free > or change the structures in any way, but it still may use the message > (spi_async) and read write in the data bytes (not the location). No, modifying an in flight transfer is not supported or sensible. There's a whole bunch of problems there, including synchronisation with DMA. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131106162410.GB2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131106162410.GB2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-06 19:54 ` Martin Sperl [not found] ` <3B0EDE3F-3386-4879-8D89-2E4577860073-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-06 19:54 UTC (permalink / raw) To: Mark Brown Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren >> The above is an allowed sequence, that must work - otherwise we would >> break an existing API. > > This would be broken anyway with existing drivers; if we want to support > that we need to take copies of both the messages and the transfers but > really that just looks crazy. The sequence shown is not broken - see that I have been using spi_sync not spi_async! (think also of people using this via SPIDEV, where you do not have such control as you would in kernel space) So it is in principle a valid use of the API - it is not a "beautiful" implementation, but it is something you have to be able to handle. Unless you _want_ to break the API. But then you should write a more explicit explanation of what the API allows and what it does not and what are the other assumptions. (and you have to track down all the bugs in existing drivers) Also see the scenario like this: * kalloc memory for message * add xfers to message * call spi_sync * kfree xfers * kfree message * kalloc memory for message ** here we may have a high likelihood of getting the same pointer again * add xfers to message * call spi_sync * kfree xfers * kfree message so we would need again a means to detect such changes. The pointers are the same, but the data may be different. As said: we would need some "random" token in the message to identify a prepared message - we can do that... Also you would need to track down all drivers that might have such a pattern. But you need some cleanup mechanism for your prepared statements, as you would need to keep them somewhere (a list or similar). So now imagine the following: for(i=0;i<10000;i++) * kalloc memory for message * add xfers to message * call spi_sync * kfree xfers * kfree message that would mean that you would create lots of prepared messages filling up kernel memory (possibly allocated with GPF_ATOMIC, as we may run in irq-context) and then only after some time a separate thread would need to run to clean it up. Not to mention that you have a long list to walk to find the prepared "structure" again killing lots of time. So that is why I say that making it explicit in your driver should be a possibility - especially that way those prepares would happen at setup time and not while in an interrupt handler... If you find a solution to the problems described in the "framework at some point, then it is probably easier to remove the spi_prepare_message functions from those drivers that already make use of it (or keeping it as a hint and make them a no-ops...) than figuring out all the issues that such "behind the scenes" optimizations in the framework may introduce. Martin -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <3B0EDE3F-3386-4879-8D89-2E4577860073-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <3B0EDE3F-3386-4879-8D89-2E4577860073-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-06 23:26 ` Mark Brown [not found] ` <20131106232605.GC2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-06 23:26 UTC (permalink / raw) To: Martin Sperl Cc: Linus Walleij, linux-spi-u79uwXL29TY76Z2rM5mHXA, Stephen Warren [-- Attachment #1: Type: text/plain, Size: 1491 bytes --] On Wed, Nov 06, 2013 at 08:54:10PM +0100, Martin Sperl wrote: > >> The above is an allowed sequence, that must work - otherwise we would > >> break an existing API. > > This would be broken anyway with existing drivers; if we want to support > > that we need to take copies of both the messages and the transfers but > > really that just looks crazy. > The sequence shown is not broken - see that I have been using spi_sync > not spi_async! (think also of people using this via SPIDEV, where you do > not have such control as you would in kernel space) Oh, if you're using spi_sync() it's fine but then what's the issue? The stack isn't going to remember anything about old messages when they're resubmitted so why would there ever be any problem? Even if there were a problem I can't see how having the core manage things is going to be an issue. > Also see the scenario like this: > * kalloc memory for message > * add xfers to message > * call spi_sync > * kfree xfers > * kfree message > * kalloc memory for message > ** here we may have a high likelihood of getting the same pointer again > * add xfers to message > * call spi_sync > * kfree xfers > * kfree message > so we would need again a means to detect such changes. > The pointers are the same, but the data may be different. Why would anything even notice a problem? Like I say by the time the transfer is returned to the user the stack isn't going to remember anything about it. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131106232605.GC2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131106232605.GC2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-07 0:43 ` Martin Sperl [not found] ` <72D635F5-4229-4D78-8AA3-1392D5D80127-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-07 0:43 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA Hi Mark! The problem is that you intend to do prepare/unprepare transparently. OK - but I see problems with heuristics on when it is safe to assume that a message has not been changed since the last call (either xfers or tx/rx_buff pointer). If it has not changed then we can use the cached computations from the previous step ( equivalent to prepared). Figuring out if the message has changed or not is consuming too much CPU and that is - as far as I am understanding - the main reason why we want to have the option to prepare/unprepare a message: save CPU cycles. See my "quick&dirty" modification to _spi_async how prepare could help improve avoid spending CPU cycles even for "normal" cases that do not run on HW where preparing would not help except for avoiding running the sanity checks on the message every time. As said: running a single transfer 10000 times/s will run the following code: /* Half-duplex links include original MicroWire, and ones with * only one data pin like SPI_3WIRE (switches direction) or where * either MOSI or MISO is missing. They can also be caused by * software limitations. */ if ((master->flags & SPI_MASTER_HALF_DUPLEX) || (spi->mode & SPI_3WIRE)) { unsigned flags = master->flags; list_for_each_entry(xfer, &message->transfers, transfer_list) { if (xfer->rx_buf && xfer->tx_buf) return -EINVAL; if ((flags & SPI_MASTER_NO_TX) && xfer->tx_buf) return -EINVAL; if ((flags & SPI_MASTER_NO_RX) && xfer->rx_buf) return -EINVAL; } } /** * Set transfer bits_per_word and max speed as spi device default if * it is not set for this transfer. */ list_for_each_entry(xfer, &message->transfers, transfer_list) { if (!xfer->bits_per_word) xfer->bits_per_word = spi->bits_per_word; if (!xfer->speed_hz) xfer->speed_hz = spi->max_speed_hz; if (master->bits_per_word_mask) { /* Only 32 bits fit in the mask */ if (xfer->bits_per_word > 32) return -EINVAL; if (!(master->bits_per_word_mask & BIT(xfer->bits_per_word - 1))) return -EINVAL; } } 10000 times/s. With a "prepared" message by the driver instead, this message would get prepared (maybe in this case better call it sanity-checked) once and from then on this piece of code does not get run at all, which would reduce CPU load immediately. But if you have to loop over the message to see if it has changed (by calculating a checksum or comparing to something else), then you have lost this opportunity to optimize out the above code... Is the above described scenario of 10k spi messages/s realistic? Yes - definitely. The can Driver for which I have started my investigation is processing about 3200 can-messages/second. Each generates an interrupt - so 3200 interrupts and each is sending 3 spi_messages with multiple transfers in sequence (all of which can get and are already prepared...) So we are essentially at the example above with 9600 SPI messages/s and the code getting run 9600 times unnecessarily. Obviously this is not so important for drivers that run maybe 10 SPI messages/s, but it becomes an important factor at higher volumes. That is the reason why I am so persistent about this subject and trying to get a solution that does not require a lot of CPU cycles. Ciao, Martin P.s: And my guess is that avoiding this "sanity-check" alone would reduce the system load for my test-case by 2-3% alone that would be down from 77% to 74% CPU. (and I am sure I can bring it down further if I am moving away from the scheduling overhead that is immanent to the "transfer_one_message" worker interface. The preparing of the DMA Structure reduced the System load by about 7-8%. I can try to quantify this number if you really want by NOT using spi_async but calling message->spi=spi; message->status=-EINPROGRESS; status=spi->master->transmit(spi,message); directly. But not tonight... -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <72D635F5-4229-4D78-8AA3-1392D5D80127-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <72D635F5-4229-4D78-8AA3-1392D5D80127-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-07 20:31 ` Mark Brown [not found] ` <20131107203127.GB2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-07 20:31 UTC (permalink / raw) To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 1347 bytes --] On Thu, Nov 07, 2013 at 01:43:39AM +0100, Martin Sperl wrote: > The problem is that you intend to do prepare/unprepare transparently. Please be more concrete and specific, I'm still really struggling to understand what your concern is. Please do also reply to what's being said rather than deleting the entire mail, it makes it easier to follow the thread of conversation. > OK - but I see problems with heuristics on when it is safe to assume > that a message has not been changed since the last call (either xfers > or tx/rx_buff pointer). If it has not changed then we can use the cached > computations from the previous step ( equivalent to prepared). Why would the SPI core assume that a message just passed into it is a previously submitted message which has not been changed? This seems like something which would add extra complexity to the implementation over the simpler approach of just treating each new message as a new one. Bear in mind that the goals here are to refactor the code so that the DMA mapping is factored out of the drivers (and can be moved so that it runs in parallel with transfers), factor out the handling of /CS and in general the handling of message breaks and avoid all the drivers having to implement the logic required to push transfers from interrupt context where possible. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131107203127.GB2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131107203127.GB2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-08 14:16 ` Martin Sperl [not found] ` <86AE15B6-05AF-4EFF-8B8F-10806A7C148B-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-08 14:16 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA Hi Mark! On 07.11.2013, at 21:31, Mark Brown wrote: > On Thu, Nov 07, 2013 at 01:43:39AM +0100, Martin Sperl wrote: > >> The problem is that you intend to do prepare/unprepare transparently. > > Please be more concrete and specific, I'm still really struggling to > understand what your concern is. Please do also reply to what's being > said rather than deleting the entire mail, it makes it easier to follow > the thread of conversation. ... > Bear in mind that the goals here are to refactor the code so that the > DMA mapping is factored out of the drivers (and can be moved so that it > runs in parallel with transfers), factor out the handling of /CS and in > general the handling of message breaks and avoid all the drivers having > to implement the logic required to push transfers from interrupt context > where possible. OK - I believe I start to see where we are diverging with our ideas about prepare_message. You want to refactor the message pump mechanism reducing driver code size, which is commendable (I like the idea of having a "generic" scheduler on interrupt from any GPIO pin, but there are others that come to my mind as well). And part of this you want to make drivers move away also from the transfer_one_message interface. And for that you need the drivers to have prepare and unprepare functions, which are called from the spi_pump (hence the description, that those are called from threaded context). While I am more concerned about the cost of running a prepare EVERY time we schedule a message - and that becomes significant CPU utilization if you look at my report (doing the setup calculation for the DMA of a message, with 2x2 Byte (write_then_read) transfer at 8MHz takes about as long as the actual transfer over the bus. (if more data gets transmitted the ratio changes to the transfer side) So my idea was to create something like SQL prepared statement API that allows the driver to say: Hey, here is a message, that I will not modify from a structure perspective - I still retain the "ownership" and I am still responsible for releasing it before releasing the memory itself. So this api, I would now call it driver_prepared_message to make a distinction to your case. This specific API call would do the following: * run its own sanity checks once (the ones in __spi_async) * run the prepare_message function of the bus-driver (we would need to keep some data in the spi_message structure that the driver may assign for future reference: "message->prepared_data=kalloc(...);" ) * mark the message as prepared - say message->is_driver_prepared=1; Would this description above describe your and my "points of views"? (not in all details, but close?) If so then, the __spi_async call could look like this (the functions are not in the correct order for the compiler not to complain): struct spi_message { ... /* set by spi_prepare_message() */ unsigned is_driver_prepared:1; /* an opaque structure for prepared messages * allocated on spi_prepare_message() * freed in spi_unprepare_message() */ void *prepared_data; ... }; static int __spi_async(struct spi_device *spi, struct spi_message *message) { int ret; /* the non driver-prepared path */ if (! message->is_driver-prepared) { /* prepare the message */ ret=__spi_prepare_message(spi,message); if (ret) return ret; } /* setting up some fields in message */ message->spi = spi; message_status = -EINPROGRESS; /* here we might also want to have a driver function to schedule * the message immediately say to DMA, but still use the spi_pump * worker for some part of the driver like memory cleanup,... * if (spi->master->schedule_prepared_message_irqcontext) * spi->master->schedule_prepared_message_irqcontext(spi,message); */ /* and the "normal" call of transfer, which typically would call * the spi message pump. */ return master->transfer(spi,message); } static int __spi_prepare_message(struct spi_device *spi, struct spi_message *message) { struct spi_master *master = spi->master; struct spi_transfer *xfer; /* the parts from __spi_async that handle message verification */ /* Half-duplex links include original MicroWire, and ones with * only one data pin like SPI_3WIRE (switches direction) or where * either MOSI or MISO is missing. They can also be caused by * software limitations. */ if ((master->flags & SPI_MASTER_HALF_DUPLEX) || (spi->mode & SPI_3WIRE)) { unsigned flags = master->flags; list_for_each_entry(xfer, &message->transfers, transfer_list) { if (xfer->rx_buf && xfer->tx_buf) return -EINVAL; if ((flags & SPI_MASTER_NO_TX) && xfer->tx_buf) return -EINVAL; if ((flags & SPI_MASTER_NO_RX) && xfer->rx_buf) return -EINVAL; } } /** * Set transfer bits_per_word and max speed as spi device default if * it is not set for this transfer. */ list_for_each_entry(xfer, &message->transfers, transfer_list) { if (!xfer->bits_per_word) xfer->bits_per_word = spi->bits_per_word; if (!xfer->speed_hz) xfer->speed_hz = spi->max_speed_hz; if (master->bits_per_word_mask) { /* Only 32 bits fit in the mask */ if (xfer->bits_per_word > 32) return -EINVAL; if (!(master->bits_per_word_mask & BIT(xfer->bits_per_word - 1))) return -EINVAL; } } /* and the custom prepare call of the bus-driver */ if (spi->master->prepare_message) return spi->master->prepare_message( spi->master,message); return 0; } /* * the function that allows the driver to prepare a message * for later use without further processing overhead * the driver will retain ownership of the memory, * but is NOT allowed to make any changes to the structure * of the message in especially not: * * list of transfers * * length of an individual transfer * * bits/speed/... * * tx_buff, tx_dma * * rx_buff, rx_dma * the driver may modify the data inside * tx_buff, rx_buff * the calling driver has also the obligation to call * spi_unprepare_message prior to releasing the memory * erasing the message structure. */ static int spi_prepare_message(struct spi_device *spi, struct spi_message *message, bool is_prepared) { int ret; /* call the internal code but mark message as prepared */ ret=__spi_prepare_message(spi,message); /* mark the message as prepared if there was no error*/ if (!ret) message->is_driver_prepared=1; return ret; } static int __spi_unprepare_message(struct spi_master *master, struct spi_message *message) { int ret; /* if the message is not prepared, then call the handler */ if (master->unprepare_message) ret=master->unprepare_message(master,message); /* and clear the memory structures */ message->is_driver_prepared=0; message->prepared_data=NULL; return ret; } static int spi_unprepare_message(struct spi_device *spi, struct spi_message *message) { return __spi_unprepare_message(spi->master,message); } void spi_finalize_current_message(struct spi_master *master) { struct spi_message *mesg; unsigned long flags; spin_lock_irqsave(&master->queue_lock, flags); mesg = master->cur_msg; master->cur_msg = NULL; queue_kthread_work(&master->kworker, &master->pump_messages); spin_unlock_irqrestore(&master->queue_lock, flags); /* call the unprepare message if case the message is not driver prepared */ if (!message->is_driver_prepared) spi_unprepare_message(mesg->spi,mesg); mesg->state = NULL; if (mesg->complete) mesg->complete(mesg->context); } there is only one thing to this: As mentioned above, you have described the prototypes for prepare/unprepare message as getting called from threaded context. This obviously is not (necessarily) true if you handle the prepare in the __spi_async function. But waiting for the spi_pump thread to wake up is wasting valuable time waiting for the scheduler - especially if you could submit the message to DMA immediately. But if you get that far, then the DMA driver would also be running the complete calls from the DMA-IRQ handler to avoid the "scheduling" latencies in the complete path as well... So I hope I have summarized your ideas and you have now also a better understanding for what I want to achieve and where I see places that can reduce CPU cost and latencies... I believe we could come to a common solution that fits both sides. Thanks, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <86AE15B6-05AF-4EFF-8B8F-10806A7C148B-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <86AE15B6-05AF-4EFF-8B8F-10806A7C148B-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-08 16:19 ` Mark Brown [not found] ` <20131108161957.GP2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-08 16:19 UTC (permalink / raw) To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 1456 bytes --] On Fri, Nov 08, 2013 at 03:16:57PM +0100, Martin Sperl wrote: Please fix the formatting in your mails, odd line wrapping makes things harder to read. > While I am more concerned about the cost of running a prepare EVERY time we > schedule a message - and that becomes significant CPU utilization if you > look at my report (doing the setup calculation for the DMA of a message, > with 2x2 Byte (write_then_read) transfer at 8MHz takes about as long as the > actual transfer over the bus. (if more data gets transmitted the ratio > changes to the transfer side) I'd be rather surprised if there's a meaningful benefit from doing such a transfer using DMA in the first place, with most hardware you are better off doing small transfers with PIO and using a completion interrupt for the transfer. Talking to the DMA controller tends to end up being more work than it's worth. Are you sure that you are optimising the right thing here? I'd want to see strong numbers from a real use case showing that the complexity of trying to do this was worth it. > But waiting for the spi_pump thread to wake up is wasting valuable > time waiting for the scheduler - especially if you could submit the > message to DMA immediately. This is the sort of thing we can do sensibly enough once we don't have to open code the logic in individual controller drivers, it doesn't seem terribly connected with any external API changes though? [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131108161957.GP2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131108161957.GP2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-08 17:31 ` Martin Sperl [not found] ` <5F70E708-89B9-4DCF-A31A-E688BAA0E062-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-08 17:31 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA Hi Mark! On 08.11.2013, at 17:19, Mark Brown wrote: > > I'd want to see strong numbers from a real use case showing that the > complexity of trying to do this was worth it. > I remember having shared all sorts on values in my earlier posts regarding to absolute measurements. * from CPU utilization to receive 3000 CAN messages/s * to latency perspective (interrupt to SPI Message) * to time spent "preparing" a message. If you go back to my mail on the 4th November you will find this: On 04.11.2013, at 18:33, Martin Sperl wrote: > > So here a link of an example showing how much the prepared spi_messages > really can improve the SPI thruput - even without changing anything > (besides) preparing the messages for direct DMA use. > > http://www.raspberrypi.org/phpBB3/viewtopic.php?f=44&t=19489&p=448328#p448328 > The link goes to the RPI forum and includes the measurements plus some images > showing both cases (configured via a module parameter for now) > > Quick summary: > For my test-usecase by just enabling "prepared messages" I have reduced the > time for a "simple" transfer from 390us (without prepare) to 230us (with > prepare) and the driver is still using the threaded message_pump. > > OK - the "default" spi-bcm2835.c driver currently in the mainline takes 245us > for the same thing, but it runs at 130k interrupts/s and 95% System load, > which is quite impractical to do anything else. > So getting this down to 80% System load plus shorter responses is already > quite an improvement. I included a lot more "hard" data in my previous emails - if you want to go back to those... The above link also includes images images taken with the logic analyzer comparing prepared and unprepared code-base (by simply enabling/disabling prepare). It also does not yet make use of "DMA chaining", for which I need to move back to the transfer interface to get the most out of DMA... Is this "strong" enough data for a "real" (worsted) case? And this 3200 messages/s on the can BUS is not really the worsted case, as the message is 8 bytes with extended ID at 500KHz. It is more like the "best" "worsted" case, as such a message would take between 258 and 316us (depending on bit-stuffing). This translates to 3164 to 3875 CAN-messages per second. But If I was just using 11 bit IDs and 0 byte data a single message would be taking 94 to 111us, which translates to between 9009 and 10638 CAN messages/s. And then you can double those values again if you go to 1MHz CAN Bus Speed. Also you can easily "trigger" such a situation, if you: * have a device that sends a message (and its controller is set to resend the message if there is no recipient, which would be typical) * have the CAN controller on the RPI configured for "listen only mode" (that means it does not acknowledge messages, which is a prerequisite) * and you connect those two devices together with no other device. Just start up the "sending" device and you get 80% SYSTEM CPU with the DMA driver (95% with the PIPO driver + packet-loss)! The effect is similar to a packet storm on a network with an older Network card with a driver that can not switch from "interrupt" to "polling" mode... The only difference to normal traffic is that the bus would not be saturated 100% of the time over any measured interval - you would see idle gaps. Need more evidence? If so please tell me what you would like to see... Ciao, Martin P.s: and I can not understand why I can read 1000MHz CAN-bus saturated with 0 byte length messages on a simple 8-bit AVR at 16MHz and 4kb memory... (with the same CAN-controller). And that microcontroller still has the time to write the whole stream to a SD card (with all the non-deterministic latencies that a SD card introduces - but without a filesystem, I have to admit, so similar to "dd of=/dev/sdd") So that is why I am trying to optimize the linux driver as well to get to "better" performance and still have some cycles left for doing the "real" work...-- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <5F70E708-89B9-4DCF-A31A-E688BAA0E062-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <5F70E708-89B9-4DCF-A31A-E688BAA0E062-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-08 18:09 ` Mark Brown [not found] ` <20131108180934.GQ2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-08 18:09 UTC (permalink / raw) To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 995 bytes --] On Fri, Nov 08, 2013 at 06:31:37PM +0100, Martin Sperl wrote: > On 08.11.2013, at 17:19, Mark Brown wrote: > > I'd want to see strong numbers from a real use case showing that the > > complexity of trying to do this was worth it. > I remember having shared all sorts on values in my earlier posts > regarding to absolute measurements. > * from CPU utilization to receive 3000 CAN messages/s > * to latency perspective (interrupt to SPI Message) > * to time spent "preparing" a message. This sounds like an artificial benchmark attempting to saturate the bus rather than a real world use case, as does everything else you mention, and the contributions of the individual changes aren't broken down so it isn't clear what specifically the API change delivers. I'd like to see both a practical use case and specific analysis showing that changing the API is delivering a benefit as opposed to the parts which can be done by improving the implementation of the current API. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131108180934.GQ2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131108180934.GQ2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-08 19:18 ` Martin Sperl [not found] ` <C375DEE6-1AEC-4AFB-A9D6-583DCB4476A3-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-08 19:18 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA On 08.11.2013, at 19:09, Mark Brown wrote: > On Fri, Nov 08, 2013 at 06:31:37PM +0100, Martin Sperl wrote: >> On 08.11.2013, at 17:19, Mark Brown wrote: > >>> I'd want to see strong numbers from a real use case showing that the >>> complexity of trying to do this was worth it. > >> I remember having shared all sorts on values in my earlier posts >> regarding to absolute measurements. >> * from CPU utilization to receive 3000 CAN messages/s >> * to latency perspective (interrupt to SPI Message) >> * to time spent "preparing" a message. > > This sounds like an artificial benchmark attempting to saturate the bus > rather than a real world use case, as does everything else you mention, > and the contributions of the individual changes aren't broken down so it > isn't clear what specifically the API change delivers. > As explained - it is a reasonable use-case that you can easily trigger. For example: updating the firmware of a 128KB Flash via the CAN bus requires about: 22528 8 byte can packets to transfer just the data and some signaling. With 3200 CAN-messages/s as the upper limit for these 8 byte messages this requires 7.04seconds to transfer all the Flash data. As CAN is a broadcast medium every node on the bus will see this data-rate. Even if it is NOT involved in the real communication. So if you are just listening while this happens between 2 other devices, then you still would be running into a performance bottleneck while the Flash is getting written. And if you can not keep up with that rate (packet-loss,...) then you might be missing out on other messages that are more important and are directed at you. OK, there would be gaps every 44 packets while a flash page gets written. But even then, at that time other devices that are blocked, will send their messages as the Firmware update is idle. So with more nodes under such a situation the bus becomes very likely saturated for 10 seconds. So it is IMO a realistic load simulation to take the "automatic re-broadcast" as a repeatable scenario. > I'd like to see both a practical use case and specific analysis showing > that changing the API is delivering a benefit as opposed to the parts > which can be done by improving the implementation of the current API. I have already shared at some point and also it shows in the forum: Without prepare I see: * 14.6k interrupts * 17.2k context-switches * 88%CPU-System load While with prepare I see: * 29.2k interrupts * 34.5k context-switches * 80%CPU-System load The reason why we see more interrupts is because without the "prepared" messages we are close to packet loss as 44% of all packets are fetched from the second buffer of the CAN controller - if this reaches 50% then we start to have packet-loss... So we get 1 Controller interrupt for 2 messages and that triggers some more interrupts for other parts. While the "prepared" case has only 4% of all packets coming from the 2nd buffer - that is why we have doubled the number of interrupts, as the number of interrupts from the GPIO device has almost doubled. How did I measure? The difference is a different module parameter to the SPI bus driver Basically looking like this from the code-perspective: static bool allow_prepared = 1; module_param(allow_prepared, bool, 0); MODULE_PARM_DESC(allow_prepared, "Run the driver with spi_prepare_message support"); int bcm2835dma_spi_prepare_message(struct spi_device *spi, struct spi_message *message) { struct bcm2835dma_prepared_message *prep; struct bcm2835dma_dma_cb *cb; int status=0; /* return immediately if we are not supposed to support prepared spi_mes sages */ if (!allow_prepared) return 0; ... } and the above mentioned values have been measured like this: cat /proc/net/dev; vmstat 10 6; cat /proc/net/dev Does this answer your question and convince you of this being realistic? Also my next work is moving to DMA scheduling multiple messages via "transfer". This should bring down the CPU utilization even further and it should also decrease the context switches as the spi_pump thread goes out of the picture... (and that will probably decrease the number of overall interrupts as well...) How far I can optimize this way is a good question and then we can have a look how much "penalty" the move to DMA engine will produce. (I have just seen that someone has just started to post an initial DMA engine for the RPI...) Thanks, Martin-- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <C375DEE6-1AEC-4AFB-A9D6-583DCB4476A3-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <C375DEE6-1AEC-4AFB-A9D6-583DCB4476A3-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-09 18:30 ` Mark Brown [not found] ` <20131109183056.GU2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-09 18:30 UTC (permalink / raw) To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 3337 bytes --] On Fri, Nov 08, 2013 at 08:18:25PM +0100, Martin Sperl wrote: > On 08.11.2013, at 19:09, Mark Brown wrote: > > On Fri, Nov 08, 2013 at 06:31:37PM +0100, Martin Sperl wrote: > > This sounds like an artificial benchmark attempting to saturate the bus > > rather than a real world use case, as does everything else you mention, > > and the contributions of the individual changes aren't broken down so it > > isn't clear what specifically the API change delivers. > As explained - it is a reasonable use-case that you can easily trigger. > For example: updating the firmware of a 128KB Flash via the CAN bus requires > about: 22528 8 byte can packets to transfer just the data and some signaling. > With 3200 CAN-messages/s as the upper limit for these 8 byte messages > this requires 7.04seconds to transfer all the Flash data. You mentioned that systems could be constructed but you didn't mention examples of doing that; in general prolonged bus saturation tends to be something that people avoid when designing systems. > OK, there would be gaps every 44 packets while a flash page gets written. > But even then, at that time other devices that are blocked, will send their > messages as the Firmware update is idle. So with more nodes under such a > situation the bus becomes very likely saturated for 10 seconds. > So it is IMO a realistic load simulation to take the "automatic re-broadcast" > as a repeatable scenario. What I'm missing with this example is the benefit of having an API for pre-cooked messages, how it will deliver a performance improvement? Flashing should be an infrequent operation and I'd expect it to involve little reuse of messages which I'd expect to be the main thing that we could gain from the API change. I'd also not expect the controller specific work to be especially expensive. > > I'd like to see both a practical use case and specific analysis showing > > that changing the API is delivering a benefit as opposed to the parts > > which can be done by improving the implementation of the current API. > I have already shared at some point and also it shows in the forum: You've been doing a bunch of other performance improvements as well, it's not clear to me how much of this is coming from the API change and how much of this is due to changes which can also be done by improving the implementation and without requiring drivers to be specifically updated to take advantage of it. > Does this answer your question and convince you of this being realistic? It's still not clear to me exactly how this works and hence if the benefit comes from the API change itself. > Also my next work is moving to DMA scheduling multiple messages via "transfer". > This should bring down the CPU utilization even further and it should also > decrease the context switches as the spi_pump thread goes out of the picture... > (and that will probably decrease the number of overall interrupts as well...) Right, and simply driving transfers from interrupt rather than task context probably gets an awfully long way there. This is the sort of improvement which will benefit all users of the API - the main reason I'm so cautious about changing the API is that I don't want to make it more complex to implement this sort of improvement. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131109183056.GU2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131109183056.GU2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-10 10:54 ` Martin Sperl [not found] ` <6C7903B3-8563-490E-AD7D-BA5D65FFB9BC-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-10 11:05 ` Mark Brown 1 sibling, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-10 10:54 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA Hi Mark! You seem to be very hard to convince! On 09.11.2013, at 19:30, Mark Brown wrote: > You mentioned that systems could be constructed but you didn't mention > examples of doing that; in general prolonged bus saturation tends to be > something that people avoid when designing systems. > True, but CAN in principle can work saturated when only 2 nodes talk to each other and the others are quiet. Collision detection is also in the protocol and the one "winning" the Collision can continue transmitting its message without an abort. And in the Firmware upgrade case the CAN ID is typically 7 configured with the most recessive bits (so highest possible CAN ID), so that it will loos any arbitration and thus allow other nodes to send their data with higher priority and thus effectively slowing down the firmware upgrade without impacting other devices. So a saturated bus is not "ideal", but it is possible and would not impact devices. Also note that some CAN controllers, like the mcp2515, can have message HW-filters configured to limit system load. But these are typically limited to a certain set of filter-pattern (2 Masks and 2-3 bit pattern per mask), so for a "central" controller these do not seem to be a good match, as you would run out of filters, and a central CPU of a vehicle will want to get all data on which it may base its decision and the limited amount of filters is not scaling for this. Also one can estimate from 100% CAN-Bus Utilization with 90% System load. to 50% Bus utilization and 45% System load to 25% Bus utilization and 22.5% System load And even the 22.5% CPU load as already too much in my opinion when you can get it down further with a simple API extension and a driver optimized for speed. Note also, that the message-processing overhead is in principle just (mostly) dependent on the number of messages/interval, while the possible message rate on the CAN-bus are primarily dependent on the length of the message. So with those 2 receive buffers in HW, I have the following time to react to a two buffers full + avoid an overflow situation: 29-bit IDs and 8 byte/frame at 500khz: 774us to 948 us (depending on data bytes, as there is bit-stuffing happening after 5 identical bits in sequence) 11 bit IDs and 0 bytes/frame at 500kHz: 282us to 333us These shows the distinct best to worsted case form a bit-length perspective. So if you got 3 distinct nodes that all want to send the worsted case (single frame of 0 bytes length , what would be typically called a RTR frame) within a window of 188us you can easily trigger the situation of the RPI not being able to keep up with the request rate and will drop the packet with the lowest priority - and that is just 3 distinct packages. Actually I have a simple case, where a single node sends a sequence of 5 CAN messages of 8 byte size and then stays quiet for 10 seconds (so an effective bus utilization of 0.015%) where the "stock" in-kernel drivers (mcp251x and spi-bcm2835) are already running into buffer overruns and in 20% of all cases only report 3 out of those 5 frames. (Not to mention that the in tree mcp251x driver stops to work after such a situation because of a race confition due to edge-interrupts getting "reactivated" too late after the INT line has already gone again. This happens on a Foundation based kernel somewhere between version 3.2 and 3.6, with 3.6 showing this behaviour explicitly) So you see that the "bus saturated" is just a "prolonged" situation of my 5 frame scenario which I can easily reproduce in real-live - and I doubt you can say, that this scenario would count as "prolonged bus situation". > What I'm missing with this example is the benefit of having an API for > pre-cooked messages, how it will deliver a performance improvement? > Flashing should be an infrequent operation and I'd expect it to involve > little reuse of messages which I'd expect to be the main thing that we > could gain from the API change. I'd also not expect the controller > specific work to be especially expensive. It will deliver a performance improvement by not having to create identical structures or running sanity thest every time you transmit the prepared message. And for the DMA chains I have to create DMA chains that are of a length 10-20 - depending on the number of transfers. So just running this piece of code 3000 times/ second minutes or hours is overhead that we could definitely safe - IMO it is a low-hanging fruit. This does not mean that every driver has to implement it necessarily - the first candidates are the ones that need very high SPI-message thru-put and/or low scheduling latencies. We can even look into optimizing spi_write, spi_read and spi_write_then_read for the "hot" path by spending a little more memory for "typical" transfer sizes. Then the "generic" the "generic" drivers would also require less CPU resources - even if the bus driver does not implement prepare_message itself. The "sanity" checks in __spi_async would not get executed every time. I would say any little less CPU utilization helps if it is easy to implement. >> >> Also my next work is moving to DMA scheduling multiple messages via "transfer". >> This should bring down the CPU utilization even further and it should also >> decrease the context switches as the spi_pump thread goes out of the picture... >> (and that will probably decrease the number of overall interrupts as well...) > > Right, and simply driving transfers from interrupt rather than task > context probably gets an awfully long way there. This is the sort of > improvement which will benefit all users of the API - the main reason > I'm so cautious about changing the API is that I don't want to make it > more complex to implement this sort of improvement. Also as an example if you would want the framework to provide "generic" GPIO interrupt handling to send a (prepared) message without the driver having to write the code itself, then on the "registration" call of the message to send in case of a triggered Interrupt, the framework can also execute this preparation call in the framework already, so the drivers would not even care for that! >>> I'd like to see both a practical use case and specific analysis showing >>> that changing the API is delivering a benefit as opposed to the parts >>> which can be done by improving the implementation of the current API. > >> I have already shared at some point and also it shows in the forum: > > You've been doing a bunch of other performance improvements as well, > it's not clear to me how much of this is coming from the API change and > how much of this is due to changes which can also be done by improving > the implementation and without requiring drivers to be specifically > updated to take advantage of it. I have told you, that I have created a module parameter which enables/disables the prepare code. then I ran the loading of the modules, configuring can speeds,... and ran vmstat 10 6 then removed the modules. and then I ran the same again with the parameter to disable prepare. and we get the different values... 88% System versus 80% system (and the 80% are actually too high, as the system got so much faster, that it does not need to handle 2 packets at a time can handle each packet individually. Nothing else changed besides this enable/disable parameter. >> Does this answer your question and convince you of this being realistic? > > It's still not clear to me exactly how this works and hence if the > benefit comes from the API change itself. There is no change in API - there is only an optional extension... If you are not convinced yet, then here a proposal for measurements I can try to gather for you to hopefully convince you: I would implement the previously posted patches to modify the spi_async call to allow us to avoid running the "sanity checks" in the __spi_async function. Here the bus-drivers I would use for testing: * spi-bcm2708 (from the raspbien kernel using the transfer interface) * spi-bcm2835 (from the "stock" kernel using the transfer_one_message interface with an patch to allow for real time priority in the message pump - enabled via a module parameter). * run with and without RT priority * spi-bcm2835dma with the message_pump interface * run with and without RT priority (module parameter as well) * spi-bcm2835dma with the transfer interface and implementing the message queue in DMA itself without any processing And 2 device-drivers: * A (hypothetical) ADC read driver, where we want to achieve the highest possible thru-put. I would implement this as "2 messages" that get scheduled in sequence. both of them having a callback scheduling itself again immediately in the callback call. Why not not work with a single message, you may ask. Well, with a single message and a callback scheduling the message again the "limiting" factor becomes the processing of the callback and subsequent. while with this 2 message approach, when one message is "finished" the other message could get executed while the callback of the first message is running. This makes a difference for the pipelined DMA driver as we can leverage "the DMA/ARM code concurrency". (not to work effectively in the end it might require more than 2 messages, as there might still be a situation where the DMA driver would stall because there is no message in the pipeline. Important metrics are: * higher means better: * spi-message thru-put * lower means better: * System CPU% * interrupts * context switches * maybe also a version of the above that would mimic reading an max187 ADC, which requires clock idle after CS down for 8.5us during which it does the conversion is run - only then data is valid and clock may start for the transfer. * the optimized mcp2515 driver that I have been writing running in loopback mode connected to a 100% saturated CAN-bus@500kHz due to retransmits because of unacknowledged CAN-messages. Important metrics are: * higher means better: * total number of messages processed (there is a top limit) * lower means better: * ratio of messages processed from second buffer versus first HW buffer * number of buffer overruns resulting in packet-loss * number of overruns not even detected due to "latencies that miss a second frame" * System CPU% * interrupts * context switches Both drivers would have a parameter to enable/disable the spi_prepare_message(spi,message) calls during their initial setup based on a module parameter. Would these be good enough scenarios for you to finally convince you if the numbers are favorable? Martin P.s: You know, it would be a LOT of work to get such values!-- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <6C7903B3-8563-490E-AD7D-BA5D65FFB9BC-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <6C7903B3-8563-490E-AD7D-BA5D65FFB9BC-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-12 1:19 ` Mark Brown [not found] ` <20131112011954.GH2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-12 1:19 UTC (permalink / raw) To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 5439 bytes --] On Sun, Nov 10, 2013 at 11:54:16AM +0100, Martin Sperl wrote: > On 09.11.2013, at 19:30, Mark Brown wrote: > > What I'm missing with this example is the benefit of having an API for > > pre-cooked messages, how it will deliver a performance improvement? > > Flashing should be an infrequent operation and I'd expect it to involve > > little reuse of messages which I'd expect to be the main thing that we > > could gain from the API change. I'd also not expect the controller > > specific work to be especially expensive. > It will deliver a performance improvement by not having to create identical > structures or running sanity thest every time you transmit the prepared message. But is that a 0.01% improvement or a 50% improvement and is that because we're pushing a silly implementation into a corner or because there's actually substantial work that needs doing there? If the error checking and so on is really that expensive this sounds like we need to fix that anyway, even if it's just quick things like making some of the validation debug only or pushing some of the work out to the message update functions. The messages and transfers ought to be totally reusable already, the only bit that should need to be recreated is any data the driver creates internally, most likely that's only going to be data for DMA. We do already have an interface for drivers to do the DMA mapping though it's not very well loved at the minute... > And for the DMA chains I have to create DMA chains that are of a length 10-20 > - depending on the number of transfers. What's the expense here - the mapping or something else? This does also seem like a lot of scatter/gather, why is the data so spread out and is there room to do better in the client driver? We ought to modernise this interface to be more in tune with what dmaengine is doing anyway (probably taking scatterlists for example) - you were also talking about DMA a lot in some of your earlier messages and it is where I'd expect the cost to be so I'm wondering if this isn't actually more to do with this interface being poor and if we don't need dmaengine changes too to realise the benefits. > I would say any little less CPU utilization helps if it is easy to implement. I'm not clear on the easy to implement bit here (though obviously there haven't been patches posted for review yet), especially the interaction with DMA if that's a part of this, or what exactly you're expecting to be allowed to happen to the message while it's been partially initalised by the driver (I wouldn't use the term prepared for what you're suggesting by the way, I'd expect preparation to be interacting with the hardware) since that will affect how widely clients could take advantage of this. The other bit of easy to implement that concerns me is making sure that it's easy to implement for the stack as a whole - one of the big problems with the existing stack is that there is too little factoring out of code into the core so we end up with individual drivers implementing good ideas that are widely useful and most drivers being basic so drivers are more complex to review and the benefits are not widely distributed. This makes me worry about an API that just delegates everything to the driver and which requires both master and client drivers to specifically support the feature. > > You've been doing a bunch of other performance improvements as well, > > it's not clear to me how much of this is coming from the API change and > > how much of this is due to changes which can also be done by improving > > the implementation and without requiring drivers to be specifically > > updated to take advantage of it. > I have told you, that I have created a module parameter which enables/disables > the prepare code. Right, and what I'm saying is that I'm not clear what's being done in that code and hence why moving it makes a difference. > >> Does this answer your question and convince you of this being realistic? > > It's still not clear to me exactly how this works and hence if the > > benefit comes from the API change itself. > There is no change in API - there is only an optional extension... Adding new interfaces is a change too, on the driver side as well. > Would these be good enough scenarios for you to finally convince you if the > numbers are favorable? These seem rather artificial and I'm not sure they'd tell us much about why the effects were being observed. What would be more interesting would be seeing where the time is being spent at the minute for real loads (for example by using trace points like those that are currently being added in mainline for examining the behaviour of the message pump) - if we really are spending a noticeable proportion of the time in setup then that's obviously a good case for optimising it but right now I don't feel I understand which bits are slow enough to register in the context of the overall transaction and hence if the suggested approach is the best way to optimise them. An implementation that avoided the need to have the master drivers updated in order to get benefits would be a lot more interesting since that would make this much more generally useful, and of course concrete patches would be good too, but the main thing I'm missing is clarity on how the gains are being made. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131112011954.GH2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131112011954.GH2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-12 14:42 ` Martin Sperl [not found] ` <52823E73.503-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-12 14:42 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA Hi Mark! >> It will deliver a performance improvement by not having to create identical >> structures or running sanity thest every time you transmit the prepared message. > > But is that a 0.01% improvement or a 50% improvement and is that because > we're pushing a silly implementation into a corner or because there's > actually substantial work that needs doing there? well - in the case of the DMA driver of mine it is an 8% improvement so from 88% System load down to 80% System load. (and the current code is still walking some lists to find a match - to avoid changing spi.h for now) And that is with the transfer_one_message approach. I will try to find some time to patch spi.c so that it will toggle the CS lines while it is doing the "verification" code. then I can look with my logic analyzer to see how many ns it takes (in 64ns step granularity). And then I will take the stock spi-bcm2835 driver from upstream and the enc28j60 network driver also from upstream and report on those numbers. Note that - as far as I have seen - the enc28j60 is mostly using write_then_read for its SPI-transactions > > If the error checking and so on is really that expensive this sounds > like we need to fix that anyway, even if it's just quick things like > making some of the validation debug only or pushing some of the work > out to the message update functions. > > The messages and transfers ought to be totally reusable already, the > only bit that should need to be recreated is any data the driver creates > internally, most likely that's only going to be data for DMA. We do > already have an interface for drivers to do the DMA mapping though it's > not very well loved at the minute... Well - that is what I have been asking for - a means to shortcut the computation and a set of 2 fields (one flag and one void pointer both of which are defined as fully owned by the bus driver - unlike state and queue, which is ONLY in the ownership of the driver/framework between calls to spi_async() and complete()) What DMA-interface are you talking about? at least in spi.h in 3.10 I find nothing that would indicate an extra DMA interface... >> And for the DMA chains I have to create DMA chains that are of a length 10-20 >> - depending on the number of transfers. > > What's the expense here - the mapping or something else? This does also > seem like a lot of scatter/gather, why is the data so spread out and is > there room to do better in the client driver? Well - that is a bit long to answer, so here a short description first: the driver runs ALL configs of the SPI-register needed also via DMA. The main reason for this is that I want to be able to pipeline the DMAs without any interaction from the CPU - besides the scheduling/chaining to existing DMA and in case the message defines a complete() callback, then trigger an interrupt for that portion. (with the driver still using the spi_message_pump and transmit_one_message, this this means that as long as you just schedule spi_async messages, then the DMA is doing everything on its own. So (with some extension to the SPI API to allow for cyclic messages) you may even run a LCD display that gets updated 25 times per second streaming from two frame buffers to the display. And that without any CPU interaction at all (besides the complete() if you need to get notified of the buffer "swap"). But that is an idea for the future... So here the long description on what we need to do for say a message that implements write_then_read (2 bytes write, 10 bytes read) showing the RX and TX DMA: RX-DMA TX-DMA configure CDIV register (= SPI Clock Divider) reset RX/TX-FIFOs configure number of Bytes in TX/RX-DMA (2) configure Control-register to start SPI configure next Control block for TX-DMA trigger start of TX-DMA controlblock Transfer 2 bytes to SPI-FIFO transfer 2 bytes from SPI-FIFO (with DST_IGNORE) >> if delay_usec is set, then: configure some delay cycles in DMA by (tranfer X bytes with DMA_IDLE_CYCLES=31 and SRC_IGNORE and DST_IGNORE) >> if the transfer would have been a multiple of 4 and cs_change is NOT set and no change in CDIV we would skip the following steps: >> if change_cs is set for the transfer: configure Control_register to bring CS up configure some delay so that CS is high for about half a SPI-clock cycle (otherwise devices might not recognize it) >> now the second transfer: >> if cdif has changed, then: configure CDIV register (= SPI Clock Divider) reset RX/TX-FIFOs configure number of Bytes in TX/RX-DMA (10) configure Control-register to start SPI configure next Control block for TX-DMA trigger start of TX-DMA controlblock >> here we continue (in case the first transfer is a multiple of 4) Transfer 10 bytes to SPI-FIFO (with src_ignore, so 0) transfer 10 bytes from SPI-FIFO >> if we need to trigger a DMA (because of callback()) configure next Control block for TX-DMA trigger start of TX-DMA controlblock transfer information which message we have finished and trigger DMA Interrupt So this includes "all" options that we have currently implemented in SPI (change_cs, speed_hz, delay_usec) and which the device supports. And as mentioned in the "short" description the next message can get chained immediately to the last control block of the RX-DMA chain. And all that with minimal delay and CPU overhead (as soon as the chain has been set up). This "chaining" is _obviously_ not possible with the transfer_one message BUS-master-API (or it would require some "ugly hacks" to make use of transfer_one_message API) The code right now does not make use of dmaengine, but is using dmapool for memory management of the control blocks+extra management data. And this is taking some time already without even going thru the dmaengine layer, which possibly would mean two allocations instead of a single And that setup cost of this is what I would like to avoid to run for every time we call spi_async for a spi_message that has not changed. > We ought to modernise this interface to be more in tune with what > dmaengine is doing anyway (probably taking scatterlists for example) - > you were also talking about DMA a lot in some of your earlier messages > and it is where I'd expect the cost to be so I'm wondering if this isn't > actually more to do with this interface being poor and if we don't need > dmaengine changes too to realise the benefits. You might look at my example above (and at some time at the driver code itself - when it becomes stable - unless you want to peak at git-hub..) to see how we may generalize it. And at least in the ARM arena it should be possible to apply the same principle to other SPI controllers+DMA as well. But you might need to work around some HW bugs that differ from the 4 that I have found in the bcm2835 SOC (DMA-Interrupt only on chain end, CLEAR-FIFO-TX will bring CS high in 5% of all cases, CLEAR-FIFO-XX+start transfer will trigger a bug that some SCK does not change while MOSI does change, ADCS needs to be set for "the described" way to do SPI-DMA on the SOC - which is totally different from what I do to get all features into the DMA chain), so other "scheduling plans" might be needed. So I would not try to come up with a generic approach when we have a single driver that can do that right now. Well - as said I am wondering if the DMA engine can really can help - especially if you want to keep the "prepare" code fast. I am using dmapool for memory alloc/free, but I fear we would need to run 2 mallocs if I was using DMA-engine and more setup/tear-down overhead, as well as more code that needs to get executed to go via DMA-engine. As said: I will keep it as is for now and first implement the full DMA chaining approach. then we can see how much the move to DMA-Engine is costing... > I'm not clear on the easy to implement bit here (though obviously there > haven't been patches posted for review yet), especially the interaction > with DMA if that's a part of this, or what exactly you're expecting to > be allowed to happen to the message while it's been partially initalised > by the driver (I wouldn't use the term prepared for what you're > suggesting by the way, I'd expect preparation to be interacting with the > hardware) since that will affect how widely clients could take advantage > of this. well, my naming comes form the analogy of prepared SQL statements, which also takes a SQL and precompiles an optimization plan for it once. The big difference is that comparing 2 SQL-strings is cheaper than comparing 2 memory structures and DBs typically run on much faster HW than an ARM at 700Mhz, so the prepared SQL is not so important any more. It is still kept as a hint to the execution planner (also it reduces bytes to transfer over the network...). As for "unmodified" drivers taking advantage of this: I believe that any driver that only uses the sync interfaces spi_read, spi_write, spi_write_then_read can get improved (at the cost of memory) by having optimized versions of those functions that keep the structures in a cache for the "hot" path. But then all those drivers that make use of the sync interface are not geared towards thru-put in the first, as they are blocking, so these are not the prime "candidates" for the improvement in the first place. If they want to run with minimal latencies, they HAVE to run spi_async. Anything else is not very efficient as it requires several context switches by the way of the current design. > The other bit of easy to implement that concerns me is making sure that > it's easy to implement for the stack as a whole - one of the big > problems with the existing stack is that there is too little factoring > out of code into the core so we end up with individual drivers > implementing good ideas that are widely useful and most drivers being > basic so drivers are more complex to review and the benefits are not > widely distributed. This makes me worry about an API that just > delegates everything to the driver and which requires both master and > client drivers to specifically support the feature. If you can get away from the idea that prepare is ONLY related to the Master, then it may provide additional "advantages". Also when you do have a non-prepared message, you can still can call the prepare code from the framework and afterwards unprepare it - like I had proposed. That way any "checks" the driver would need to do to see if it can handle the message, it can do immediately in the "prepare" code. So essentially start to think of "prepare_message" as: do any verifications on the message that is needed to set it up correctly and find unsupported situations as well as optimize the message into whatever the driver may find helpful for its own purpose. One candidate here could be also implement dma_map_single in the prepare call... Then the change would look like this: static int __spi_prepare_message (struct spi_device *spi, struct spi_message *mesg) { /* sanity check message and copy the fields - see __spi_async */ /* call master prepare if it exists */ if (spi->master->prepare_message) return spi->master->prepare_message(spi->master,mesg); return 0; } static int spi_prepare_message (struct spi_device *spi, struct spi_message *mesg) { status=__spi_prepare_message(spi,mesg); if (!status) mesg->is_prepared=1; return status; } static int __spi_unprepare_message (struct spi_device *spi, struct spi_message *mesg) { /* call master prepare if it exists */ if (spi->master->unprepare_message) return spi->master->unprepare_message(spi->master,mesg); return 0; } static int spi_unprepare_message (struct spi_device *spi, struct spi_message *mesg) { mesg->is_prepared=0; return __spi_unprepare_message(spi,mesg); } static int __spi_async(struct spi_device *spi, struct spi_message *mesg) { int status; /* prepare the message using the internal prepare */ if (!message->is_prepared) { status=__spi_prepare_message(spi,messsage); if (status) return status; message->spi = spi; message->status = -EINPROGRESS; return master->transfer(spi, message); } And this is what I would envision as a total change you would have to do... (besides the 2 fields as mentioned...) So now if you want to start refactoring the transfer_one into a generic version - as far as I know it could get reduced to: static int spi_transfer_one(struct spi_master *master, struct spi_message *message) { int status; /* schedule the message in the Controller */ status=spi->master->schedule_message(master,message); if (status) goto error: /* wait for completion */ if (wait_for_completion_timeout( &bs->done, /* wherever you put complete */ msecs_to_jiffies(SPI_TIMEOUT_MS)) == 0) { status=-ETIMEDOUT; } error: if (!message->is_prepared) { status=__spi_prepare_message(spi,messsage); if (status) return status; spi_finalize_current_message(master); return status; } with this the bcm2835_spi_transfer_one function could (mostly) get reduced to: bcm2835_spi_start_transfer (and some of the additional xfer-scheduling moved to the interrupt handler). So separating things out immediately reducing code of the individual drivers if they move to the newer interface of schedule_message... if they do not need prepare/unprepare, then it is (almost) a NOP. > Right, and what I'm saying is that I'm not clear what's being done in > that code and hence why moving it makes a difference. well - the calculation of the above DMA chain takes a few usecs. and with the code "above" plus the prepare in the driver we avoid those few usecs that get run every time. > Adding new interfaces is a change too, on the driver side as well. Which is optional and would get first used only by device-drivers that need high tru-put. And those device-drivers that need this, should already be heavily optimized to avoid memory allocations (even if it is just on the heap) and message setup for every call. > These seem rather artificial and I'm not sure they'd tell us much about > why the effects were being observed. What would be more interesting > would be seeing where the time is being spent at the minute for real > loads (for example by using trace points like those that are currently > being added in mainline for examining the behaviour of the message pump) well - my use case is exactly that CAN issue where I have message loss when only sending 5 packets in sequence in 30% of all cases with stock drivers. That is why I started investigating the issue in the first place - so do not tell me it is academic! The proposed SPI-test driver is academic to some extent, but it still tries to simulate "real" device traffic with optimal thru-put. Or do you think that reading an ADC as fast as possible is an academic exercise? (the test driver would essentially mimic that and allow for scheduling mechanism to work that allow high thruput (if you configure multiple messages) > - if we really are spending a noticeable proportion of the time in setup > then that's obviously a good case for optimising it but right now I > don't feel I understand which bits are slow enough to register in the > context of the overall transaction and hence if the suggested approach > is the best way to optimise them. well - you see the setup I have to make to transfer 2 writes 10 reads? that is 10+ calls to dmapool_alloc, then filling in the structure. That takes some time (cant remember how many usec right now). And then tearing it down I need to do the same again - ok mostly dmapool_release... > An implementation that avoided the need to have the master drivers > updated in order to get benefits would be a lot more interesting since > that would make this much more generally useful, and of course concrete > patches would be good too, but the main thing I'm missing is clarity on > how the gains are being made. a master driver will never be able to profit from the new interface, besides the "small" gain of not having to validate the messages every time in __spi_async (if the device-driver does call prepare) And as said: all the spi_sync methods are candidates to optimization of this - at the cost of some memory used for caching multiple messages. (say having 16 times 1xspi_message+1xspi_transfer structures for transfer length 1 to 16 in spi_write. Similar for spi_read. Or 256 1xspi_message+2xspi_transfer to transfer 1-16 writes and 1-16 reads. plus some memory copy for the reads and writes) Martin -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <52823E73.503-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <52823E73.503-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-12 17:59 ` Martin Sperl [not found] ` <2252E63E-176C-43F7-B259-D1C3A142DAFE-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-13 15:11 ` Mark Brown 2013-11-14 1:50 ` Mark Brown 2 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-12 17:59 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA First, sorry for the bad formatting in the last email - it looked ok when typing it with Thunderbird... I even used the recommended plugin to make it wrap correctly (still format=flowed). OK, but here some statistics that I had promised. I got the following data from my logic analyzer with a patched spi.c that brings CS down while running the sanity check in __spi_async! (note that this kind of debugging has some potential race-conditions with HW - if a DMA-transfer is running - but for the minimal cases it gives good enough data with minimal "overhead") This is the code that I measure: /* bring CS1 down by switching polarity */ writel((1<<22),*regs); /* Half-duplex links include original MicroWire, and ones with * only one data pin like SPI_3WIRE (switches direction) or where * either MOSI or MISO is missing. They can also be caused by * software limitations. */ if ((master->flags & SPI_MASTER_HALF_DUPLEX) || (spi->mode & SPI_3WIRE)) { unsigned flags = master->flags; list_for_each_entry(xfer, &message->transfers, transfer_list) { if (xfer->rx_buf && xfer->tx_buf) return -EINVAL; if ((flags & SPI_MASTER_NO_TX) && xfer->tx_buf) return -EINVAL; if ((flags & SPI_MASTER_NO_RX) && xfer->rx_buf) return -EINVAL; } } /** * Set transfer bits_per_word and max speed as spi device default if * it is not set for this transfer. */ list_for_each_entry(xfer, &message->transfers, transfer_list) { if (!xfer->bits_per_word) xfer->bits_per_word = spi->bits_per_word; if (!xfer->speed_hz) xfer->speed_hz = spi->max_speed_hz; if (master->bits_per_word_mask) { /* Only 32 bits fit in the mask */ if (xfer->bits_per_word > 32) return -EINVAL; if (!(master->bits_per_word_mask & BIT(xfer->bits_per_word - 1))) return -EINVAL; } } /* bring CS1 up by switching polarity again */ writel(0x00,*regs); Depending on the message actually delivered I see that CS is down for about 1 transfers/message: 0.75us 3 transfers/message: 2.5us 8 transfers/message: 4.7us Now the preparing of the DMA structure described before takes: 1 transfers/message: 26us 2 transfers/message: 18-23us 3 transfers/message: 46-55us 8 transfers/message: 106us There are obviously variations depending on what data is in L2 cache. As for other interesting measurements a single example with 5 transfers: Interrupt to __spi_async: 19us __spi_async sanity start/end: 2us __SPI_ASYNC to DMA_PREPARE: 99us dma_prepare start/end: 40us dma_prepare_end to CS DOWN: 4us CS DOWN to CS UP: 16us (real transfer) the values differ between interrupts, but I believe it still shows where we see the delays that produce latencies. If we use the prepare Message interface, with my proposed change then the 2us for sanity checks are not applied and also 40us for dma_prepare are not applied. This shows that we can reduce necessary CPU cycles both for "normal" use for bus drivers that do NOT implement any internal "preparations" to make DMA work and how much we can save if we go the DMA route. Also you see how long it takes from the time we have called transfer to the time when DMA_Prepare gets run inside transfer_one_message? This is actually one of the biggest "factors" to latency on its own. And that is the scheduling delay of the message-pump! So the "threaded" approach - even though nice on paper - is actually introducing long latencies. And that is why i would like to see how a fully DMA pipelined driver would fare - my guess is that the delays would be much smaller. It will not change the delays interrupt to running the handler, but the scheduling of the message pump would go away immediately. To summarize: 10k SPI messages/s with stock drivers produce 10ms/s of CPU utilization when running with "stock" code. When optimizing the device-drivers to run the same messages prepared (by adding the spi_prepare/unprepare_message in the initialization phase) we save about 1% System CPU (assuming messages with a single transfer) So I hope that these are some hard enough facts for you to say, that "preparing" messages, that do not "change in structure" in a device driver _does_ bring an advantage. Martin -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <2252E63E-176C-43F7-B259-D1C3A142DAFE-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <2252E63E-176C-43F7-B259-D1C3A142DAFE-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-13 15:43 ` Mark Brown [not found] ` <20131113154346.GT878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-13 15:43 UTC (permalink / raw) To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 2722 bytes --] On Tue, Nov 12, 2013 at 06:59:02PM +0100, Martin Sperl wrote: > First, sorry for the bad formatting in the last email - it looked ok when typing > it with Thunderbird... I even used the recommended plugin to make it wrap > correctly (still format=flowed). Like I say still reading that, a couple of initial thoughts though... > As for other interesting measurements a single example with 5 transfers: > Interrupt to __spi_async: 19us > __spi_async sanity start/end: 2us > __SPI_ASYNC to DMA_PREPARE: 99us > dma_prepare start/end: 40us > dma_prepare_end to CS DOWN: 4us > CS DOWN to CS UP: 16us (real transfer) This is making me question the use of DMA at all here, this looks like the situation of a lot of drivers where they switch to PIO mode for small transfers since the cost of managing DMA is too great. I'm also curious which parts of the DMA preparation are expensive - is it the building of the datastructure for DMA or is it dealing with the coherency issues for the DMA controller? The dmaengine API currently needs transfers rebuilding each time I believe... Also how does this scale for larger messages? I appreciate that you want to push the entire queue down into hardware, I'm partly thinking of the costs for drivers that don't go and do any of the precooking here. > This is actually one of the biggest "factors" to latency on its own. > And that is the scheduling delay of the message-pump! Right, which is why I'm much more interested in the refactoring of that code to support kicking transfers without bouncing up there - the stuff I was talking about which made you talk about checksumming and so on. > So the "threaded" approach - even though nice on paper - is actually > introducing long latencies. And that is why i would like to see how > a fully DMA pipelined driver would fare - my guess is that the delays > would be much smaller. It will not change the delays interrupt to > running the handler, but the scheduling of the message pump would go > away immediately. Exactly, but this is largely orthogonal to having the client drivers precook the messages. You can't just discard the thread since some things like clock reprogramming can be sleeping but we should be using it less (and some of the use that is needed can be run in parallel with the transfers of other messages if we build up a queue). > So I hope that these are some hard enough facts for you to say, that > "preparing" messages, that do not "change in structure" in a device > driver _does_ bring an advantage. The 40us is definitely somewhat interesting though I'd be interested to know how that compares with PIO too. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131113154346.GT878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131113154346.GT878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-13 18:35 ` Martin Sperl [not found] ` <ED58E869-A9F6-4BB2-8EC6-D71F946509DC-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-13 18:35 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA Hi Mark >> As for other interesting measurements a single example with 5 transfers: >> Interrupt to __spi_async: 19us >> __spi_async sanity start/end: 2us >> __SPI_ASYNC to DMA_PREPARE: 99us >> dma_prepare start/end: 40us >> dma_prepare_end to CS DOWN: 4us >> CS DOWN to CS UP: 16us (real transfer) > > This is making me question the use of DMA at all here, this looks like > the situation of a lot of drivers where they switch to PIO mode for > small transfers since the cost of managing DMA is too great. I'm also > curious which parts of the DMA preparation are expensive - is it the > building of the datastructure for DMA or is it dealing with the > coherency issues for the DMA controller? The dmaengine API currently > needs transfers rebuilding each time I believe... > > Also how does this scale for larger messages? > > I appreciate that you want to push the entire queue down into hardware, > I'm partly thinking of the costs for drivers that don't go and do any of > the precooking here. Well - if you look at the above example: it takes 99us to get from SPI_ASYNC (post-message check) to dma_prepare inside transfer_one_message. And this is the time spent in the framework AND scheduler! The DMA prepare takes 40us you still can run the DMA Prepare in the same time and be faster... 8OK, I did not account for the "teardown" time, but that should be faster, as it just walks the list and returns it to the dmapool. Also keep in mind that this message is (as said) actually comprised of 5 spi_transfers in a chain with two of those having set CHANGE_CS, so it is already a "bigger" spi_message than you would typically see. A write_then_read (2 transfers) has a "setup time of about 23us. But as this transfer happens during "setup" the dmapool may not have any pages allocated yet, which increases the overhead for allocation and thus adds a bias to this measurement. You also see that this driver is already trying to keep the latencies as short as possible by using the chaining instead of issuing each of those CS-enable sequence separately. So in the end we can give a rough estimate that it takes 10us/transfer to process - so that means we are getting the same thru-put delay for DMA (prepared) versus interrupt-driven (Polling being the worsted) for spi_messages which are <=10 transfers. Which is already a pretty big spi message... But then you should not forget that whith the Interrupt driven approach, you have delays between each transfer due to interrupts, when you reconfigure the data registers. And I am 100% certain that you will not be able to achieve a 1us delay between 2 transfers (of arbitrary size) in interrupt driven mode. This is again CPU cycles (which I am not even sure are correctly accounted for in sys_cpu). And then on top of that - at least the spi-bcm2835 interrupt makes a wakeup call to the workqueue whenever it finishes a single transfer (and again the scheduler gets involved...). That is also why I have reported a high Interrupts and context-switch rate for this driver. So you think this is really cheaper from the CPU perspective to take the Interrupt-driven approach? (note also that I have set the spi_queue to run with RT-priority, so the spi_pump has a huge advantage getting CPU... > > Exactly, but this is largely orthogonal to having the client drivers > precook the messages. You can't just discard the thread since some > things like clock reprogramming can be sleeping but we should be using > it less (and some of the use that is needed can be run in parallel with > the transfers of other messages if we build up a queue). Ok - the clock setting is possibly valid for some SPI devices, but not for all. And we all know that a "one-size fits all" does not scale in performance. Also it may be a possibility that different devices have different feature-sets, where in part we need to drop back to something else (thread). At some point I had been playing with the idea of doing just that - having a spi_pump to handle things like delay and to work around stupid "bugs"... but still trying to do as much as possible within the DMA I found something that solves everything that we have as abilities in the API right now. But I have to admit, that it required a lot of effort coming up with something that works (and it also produces a lot more code) and also it uncovers a lot more HW issues than expected (I found 4 so far)... > The 40us is definitely somewhat interesting though I'd be interested to > know how that compares with PIO too. Some basic facts ONLY about the SPI transfers themselves: CS down to last CS up for a message of 5 71us transfer of the 5 transfer message with PIPO 16us for DMA. So that is ONLY measured on the SPI Bus! The time lost here is between spi_transfers, which is typically between 8-12us. But also a lot of time is lot of time is lost between the last clock and CS up for PIPO: 19us So the way that the driver is written, its interrupt gets called when it is finished with the transfer (or if the FIFO buffer needs to get filled) and in case of no more data, it will wake up the message pump. Which is in this case surprisingly fast which schedules the message. You see where PIPO looses its time? That is also the reason why i want to move back to the "transmit" interface and see how much this improves the driver. Ciao, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <ED58E869-A9F6-4BB2-8EC6-D71F946509DC-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <ED58E869-A9F6-4BB2-8EC6-D71F946509DC-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-13 19:33 ` Mark Brown [not found] ` <20131113193320.GE878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-13 19:33 UTC (permalink / raw) To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 2285 bytes --] On Wed, Nov 13, 2013 at 07:35:27PM +0100, Martin Sperl wrote: > You also see that this driver is already trying to keep the latencies as > short as possible by using the chaining instead of issuing each of those > CS-enable sequence separately. It's going to be more sensible to work to eliminate the extra overhead for doing multiple messages than to faff around in individual client drivers, solve the problem at the right level - this will play more nicely when two devices are hitting the same bus as well. > So you think this is really cheaper from the CPU perspective to take the > Interrupt-driven approach? Well, if we ignore the scheduling (which we ought to be able to get rid of anyway) then the DMA is starting to look like overhead for what appears to be a small transfer. > > Exactly, but this is largely orthogonal to having the client drivers > > precook the messages. You can't just discard the thread since some > > things like clock reprogramming can be sleeping but we should be using > > it less (and some of the use that is needed can be run in parallel with > > the transfers of other messages if we build up a queue). > Ok - the clock setting is possibly valid for some SPI devices, but not for > all. And we all know that a "one-size fits all" does not scale in > performance. > Also it may be a possibility that different devices have different > feature-sets, where in part we need to drop back to something else (thread). Of course, we will need to allow the driver to control when it punts back to the thread. What we don't want to have to do is reimplement the mechanics of all this in every single driver since that's the situation we have now for everything that isn't bitbanging. Broadly speaking most of the hardware we have now falls into a few classes and most of it isn't being driven anywhere near as effectively as it could be. > You see where PIPO looses its time? Not really, I mean the data transfers wind up being slower because they're bigger than they looked from the timings (you haven't mentioned how big they are...) but I don't really see why PIO (assuming that's what you mean by "PIPO" which is an acronym I don't recall hearing before) impacts the time taken to raise /CS. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131113193320.GE878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131113193320.GE878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-13 21:31 ` Martin Sperl 0 siblings, 0 replies; 57+ messages in thread From: Martin Sperl @ 2013-11-13 21:31 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA >> >> You see where PIPO looses its time? > > Not really, I mean the data transfers wind up being slower because > they're bigger than they looked from the timings (you haven't mentioned > how big they are...) but I don't really see why PIO (assuming that's > what you mean by "PIPO" which is an acronym I don't recall hearing > before) impacts the time taken to raise /CS. The Transfers are: 1+1CS+2+3CS+3 (numbers = bytes, CS=CS_CHANGE) But what you ignore is the fact that under load the spi-bcm-2835 is running at 90+% System CPU. while the DMA driver runs at 80-85% System load. (Not to mention number of interrupts, context switches,...) And DMA is not fully optimized yet - so do not compare it to something that was used as a first basis to get the issues of DMA solved before optimizing the full way. You are essentially comparing a "production" code (in kernel) to a R&D code (the dma version). Take the "case" of DMA as a prototype to see how it does with the one-message interface. And then let us see how pipelined DMA works by comparison... And then add the "prepare" message into the mix as further possible optimizations to see how far this brings us away from the original driver with its high system_CPU, interrupts and context switch count. Ciao, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <52823E73.503-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-12 17:59 ` Martin Sperl @ 2013-11-13 15:11 ` Mark Brown [not found] ` <20131113151102.GS878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-14 1:50 ` Mark Brown 2 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-13 15:11 UTC (permalink / raw) To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 830 bytes --] On Tue, Nov 12, 2013 at 03:42:59PM +0100, Martin Sperl wrote: Not fully read this yet but one thing... > >The messages and transfers ought to be totally reusable already, the > >only bit that should need to be recreated is any data the driver creates > >internally, most likely that's only going to be data for DMA. We do > >already have an interface for drivers to do the DMA mapping though it's > >not very well loved at the minute... > What DMA-interface are you talking about? at least in spi.h in 3.10 > I find nothing that would indicate an extra DMA interface... Drivers can supply pre-mapped DMA buffers if they set is_dma_mapped which is about as good as you get with dmaengine at the minute. The main opportunities for enhancement here are in the master interface rather than the client, at least as things stand. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131113151102.GS878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131113151102.GS878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-13 15:48 ` Martin Sperl [not found] ` <77070979-0CE4-4C76-B12E-DA94B2577172-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-13 15:48 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA Hi Mark! > > Drivers can supply pre-mapped DMA buffers if they set is_dma_mapped > which is about as good as you get with dmaengine at the minute. The > main opportunities for enhancement here are in the master interface > rather than the client, at least as things stand. This is part of my pre-requisite for making it prepareable in my case... Because otherwise you have (again) to walk the tree to set the rx_dma via dma_map_single and here the API explicitly states that between the map and unmap call the caller may NOT access the pages. The reason is (probably) for old-ISA devices that may only DMA map 24 bit addresses. So the map/unmap actually is supposed to copy data to a bounce buffer and back (depending on the direction given). Typically not an issue on ARM though, I would say... Hence this becomes expensive to map and then creating the "whole" thing works just as fine. Ciao, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <77070979-0CE4-4C76-B12E-DA94B2577172-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <77070979-0CE4-4C76-B12E-DA94B2577172-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-13 16:59 ` Mark Brown 0 siblings, 0 replies; 57+ messages in thread From: Mark Brown @ 2013-11-13 16:59 UTC (permalink / raw) To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 1519 bytes --] On Wed, Nov 13, 2013 at 04:48:47PM +0100, Martin Sperl wrote: > > Drivers can supply pre-mapped DMA buffers if they set is_dma_mapped > > which is about as good as you get with dmaengine at the minute. The > > main opportunities for enhancement here are in the master interface > > rather than the client, at least as things stand. > This is part of my pre-requisite for making it prepareable in my case... > Because otherwise you have (again) to walk the tree to set the rx_dma > via dma_map_single and here the API explicitly states that between the map > and unmap call the caller may NOT access the pages. So you expect the caller to also keep the buffers mapped separately? The restrictions on mapping are one of the concerns I have with nailing down exactly what you're envisaging a prepared and unprepared transaction doing (plus masters that don't actually want DMA in the first place). > The reason is (probably) for old-ISA devices that may only DMA map 24 bit > addresses. So the map/unmap actually is supposed to copy data to a bounce > buffer and back (depending on the direction given). > Typically not an issue on ARM though, I would say... No, it's not just for ISA - other systems can have DMA access restrictions for example due to LPAE or other things that cause limited addressing on some internal buses and obviously many modern systems have IOMMUs which require things to be mapped in and out before devices can access them without needing bounce buffers. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <52823E73.503-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-12 17:59 ` Martin Sperl 2013-11-13 15:11 ` Mark Brown @ 2013-11-14 1:50 ` Mark Brown [not found] ` <20131114015009.GB26614-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-14 1:50 UTC (permalink / raw) To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 10220 bytes --] On Tue, Nov 12, 2013 at 03:42:59PM +0100, Martin Sperl wrote: > This "chaining" is _obviously_ not possible with the transfer_one > message BUS-master-API (or it would require some "ugly hacks" to > make use of transfer_one_message API) Meh, it doesn't seem that urgently different to be honest. The basic flow of interactions with the device is very much typical of SPI controllers. In any case while it's not entirely clearly explained here I *think* the major thing that's being done differently here is that the register writes to the controller are being done using DMA. Is that the case? That's certainly a bit of an innovation and I don't think it's actually all that device specific - while reusing the same channel for things other than the actual data transfer is a bit cute I can't see any obvious reason why it shouldn't work and it's something that I'd expect a great proportion of DMA controllers would be able to support. The delays are a bit dodgy for that but for untimed delays could probably be implemented with dummy memcpy() type transfers easily enough. I'd expect this to integrate nicely with the general stuff to split out the pushing of transfers and configuration of transfers that I'd been kicking around - transfer_one() is intended to be split into two for this, one bit to configure the transfer and another bit to actually do it. The original idea was more around recognising and exploiting scatter/gather patterns but pipelining the DMAs for configuration would allow that to be extended a bit. One thing that does concern me is the overhead of all the DMA work for smaller transfers - once there's a queue built up it's more clearly useful but without a queue it's not obvious that the effort is going to be worth it. > Well - as said I am wondering if the DMA engine can really can help > - especially if you want to keep the "prepare" code fast. I am using > dmapool for memory alloc/free, but I fear we would need to run 2 mallocs > if I was using DMA-engine and more setup/tear-down overhead, as well > as more code that needs to get executed to go via DMA-engine. The big win from dmaengine is that it means that you don't need to implement essentially the same DMA code in each and every driver, all the DMA drivers end up with a uniform API. Since by and by large the process of mapping a SPI message for DMA is the same no matter what the controller is this ends up being a win and we don't require each and every master driver author to jump through the same hoops. If it doesn't do what's desired we should fix it to make it better (same as here). > As said: I will keep it as is for now and first implement the full DMA > chaining approach. then we can see how much the move to DMA-Engine > is costing... That's fine for experimentation but for an actual merged API I think it does need to be generally applicable so that it's worth the effort for client drivers to support it. > not the prime "candidates" for the improvement in the first place. > If they want to run with minimal latencies, they HAVE to run spi_async. > Anything else is not very efficient as it requires several context > switches by the way of the current design. Well, the benefits of submitting things asynchronously are as much about giving the stack something to play with and reducing turnaround time as anything else - it means the stack can submit the next transfer to the hardware before it tells the caller that the current one completed which reduces the idle time on the bus and means that there's much more scope for reducing needless context switching. Right now we're not doing as good a job of that as we can in most drivers though frankly even the current state is an improvement on what we had. > >The other bit of easy to implement that concerns me is making sure that > >it's easy to implement for the stack as a whole - one of the big > If you can get away from the idea that prepare is ONLY related to > the Master, then it may provide additional "advantages". Also when > you do have a non-prepared message, you can still can call the > prepare code from the framework and afterwards unprepare it - like I > had proposed. > That way any "checks" the driver would need to do to see if it can > handle the message, it can do immediately in the "prepare" code. I'm having a really hard time getting excited about the improvements from saving the checks though, they just seem so small for the effort by themselves. You are right, they do exist, but they seem so small in relation to the overall transfer time. That said if a patch were to appear which did that and didn't provide any master side hooks I think that'd probably be OK. > whatever the driver may find helpful for its own purpose. One > candidate here could be also implement dma_map_single in the prepare > call... Of course this reduces the range of situations where the calling driver can reuse a prepared message since it means if they want to access the data they have to unprepare and then reprepare the message which rather seems to defeat the point of having them worry about doing that. This is why I was asking about nailing down the sematics, I was worried about this. > So separating things out immediately reducing code of the individual > drivers if they move to the newer interface of schedule_message... > if they do not need prepare/unprepare, then it is (almost) a NOP. Except for all the actually configuring the hardware stuff on configuration changes if the register writes aren't DMAed (which is going to be the common case unless it's frameworked). > >Adding new interfaces is a change too, on the driver side as well. > Which is optional and would get first used only by device-drivers > that need high tru-put. And those device-drivers that need this, > should already be heavily optimized to avoid memory allocations > (even if it is just on the heap) and message setup for every call. Equally well is it really worth their while if you don't get much win without a specially optimised controller driver? Extra advantages with a good driver are one thing but requiring it makes it a harder sell to drivers using the API. > >These seem rather artificial and I'm not sure they'd tell us much about > >why the effects were being observed. What would be more interesting > >would be seeing where the time is being spent at the minute for real > >loads (for example by using trace points like those that are currently > >being added in mainline for examining the behaviour of the message pump) > well - my use case is exactly that CAN issue where I have message loss > when only sending 5 packets in sequence in 30% of all cases with > stock drivers. That is why I started investigating the issue in the > first place - so do not tell me it is academic! Sure, there's room for improvement but let's make that as general an improvement as possible. Something that flows better for existing drivers is much more of a win there. I'm getting alarm bells that the gains for this one system might be hard to realise on other systems. > The proposed SPI-test driver is academic to some extent, but it > still tries to simulate "real" device traffic with optimal thru-put. > Or do you think that reading an ADC as fast as possible is an > academic exercise? (the test driver would essentially mimic that and > allow for > scheduling mechanism to work that allow high thruput (if you > configure multiple messages) I'm suspicous of benchmarks which limit so that you end up running so fast you don't have time to actually do anything with the data, you can end up tuning things so that more realistic users suffer (spending too much time in interrupt context for example). > >- if we really are spending a noticeable proportion of the time in setup > >then that's obviously a good case for optimising it but right now I > >don't feel I understand which bits are slow enough to register in the > >context of the overall transaction and hence if the suggested approach > >is the best way to optimise them. > well - you see the setup I have to make to transfer 2 writes 10 reads? > that is 10+ calls to dmapool_alloc, then filling in the structure. > That takes some time (cant remember how many usec right now). > And then tearing it down I need to do the same again - ok mostly > dmapool_release... Yes, but then this is because the driver is doing lots of really small DMAs when most only DMA for larger transfers and then only the data. I can see can be a benefit if you have a big enough DMA queue to be busy for extended periods but that's not always going to be the case and then you get into wondering about how much you can keep mapped and so on. Like I say I can see this being a technique which is useful over a range of drivers but there's going to be a point where it's simpler to just do the I/O. > >An implementation that avoided the need to have the master drivers > >updated in order to get benefits would be a lot more interesting since > >that would make this much more generally useful, and of course concrete > >patches would be good too, but the main thing I'm missing is clarity on > >how the gains are being made. > a master driver will never be able to profit from the new interface, > besides the "small" gain of not having to validate the messages > every time in __spi_async (if the device-driver does call prepare) What about all the DMA pipelining stuff for the register I/O and so on? That seems to be the biggest part of the win here. > And as said: all the spi_sync methods are candidates to optimization > of this - at the cost of some memory used for caching multiple > messages. > (say having 16 times 1xspi_message+1xspi_transfer structures for > transfer length 1 to 16 in spi_write. Similar for spi_read. Or 256 > 1xspi_message+2xspi_transfer to transfer 1-16 writes and 1-16 reads. > plus some memory copy for the reads and writes) But then that's a memory plus cache management cost (per master), not to mention complexity. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131114015009.GB26614-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131114015009.GB26614-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-14 19:47 ` Martin Sperl [not found] ` <9640F4C7-7F82-453E-9D83-5875A1559A20-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-14 19:47 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA Hi Mark! > Meh, it doesn't seem that urgently different to be honest. The basic > flow of interactions with the device is very much typical of SPI > controllers. > Well - if it was not for the requirement to call spi_finalize_current_message prior to leaving transfer_one message to get the next one - I might have continued this route with pipelining. But hacking my way around those limitations (by taking the next item of the list, etc.) seemed a bit absurd to be honest. > In any case while it's not entirely clearly explained here I *think* the > major thing that's being done differently here is that the register > writes to the controller are being done using DMA. Is that the case? yes, that is true > That's certainly a bit of an innovation and I don't think it's actually > all that device specific - while reusing the same channel for things > other than the actual data transfer is a bit cute I can't see any > obvious reason why it shouldn't work and it's something that I'd expect > a great proportion of DMA controllers would be able to support. The > delays are a bit dodgy for that but for untimed delays could probably be > implemented with dummy memcpy() type transfers easily enough. > > I'd expect this to integrate nicely with the general stuff to split out > the pushing of transfers and configuration of transfers that I'd been > kicking around - transfer_one() is intended to be split into two for > this, one bit to configure the transfer and another bit to actually do > it. The original idea was more around recognising and exploiting > scatter/gather patterns but pipelining the DMAs for configuration would > allow that to be extended a bit. > > One thing that does concern me is the overhead of all the DMA work for > smaller transfers - once there's a queue built up it's more clearly > useful but without a queue it's not obvious that the effort is going to > be worth it. Well - if you think this "approach" is also applicable to other devices, then this is one more argument to allow device drivers to prepare its messages (with the is_dma_mapped requirement to keep cost low). Note that if you do that and want to leverage the "prepare_message", then I would recommend you add an additional flag field to these calls. Because then you may pass some hints to the engine (like if it can use GFP_KERNEL or GFP_ATOMIC during allocations). And prepare could get done with GFP_KERNEL, while a "prepare" during spi_async would need to be done via GFP_ATOMIC. >> Well - as said I am wondering if the DMA engine can really can help >> - especially if you want to keep the "prepare" code fast. I am using >> dmapool for memory alloc/free, but I fear we would need to run 2 mallocs >> if I was using DMA-engine and more setup/tear-down overhead, as well >> as more code that needs to get executed to go via DMA-engine. > > The big win from dmaengine is that it means that you don't need to > implement essentially the same DMA code in each and every driver, all > the DMA drivers end up with a uniform API. Since by and by large the > process of mapping a SPI message for DMA is the same no matter what the > controller is this ends up being a win and we don't require each and > every master driver author to jump through the same hoops. > If it doesn't do what's desired we should fix it to make it better (same > as here). Well the sequencing is the trick for most parts, but I also keep extra data adjacent to my real control-blocks to avoid doing double allocations and keeping data close to each other. For my "usage" I even have a method where I can put 8 bytes for configs in this structure as well - which allows me to avoid another set of memory allocations for just that... In principe it should be possible to generalize that. My basic "internal" prepare function looks like this: static struct bcm2835dma_dma_cb *bcm2835dma_add_cb( struct bcm2835dma_spi *bs, /* contains dma pool parameters and dma channels */ struct bcm2835dma_spi_dma *dma, /* a descriptor to the DMA channel */ struct list_head *list, /* list where to chain the cbs to */ struct spi_message * msg, /* the message we handle */ unsigned long info, /* DMA flags for the engine */ dma_addr_t src, /* the source DMA address, if -1 take from struct*/ dma_addr_t dst, /* the dest. DMA address, if -1 take from struct*/ unsigned long length, /* length of dma */ unsigned long stride, /* stride - not really used as it is buggy*/ u8 link_to_last /* if we should link this cb to the last on the same dma channel*/ ); This will allocate from pool, then fill in the structure linking it to the last in the chain of the same channel and return it. So this might be pretty generic already - and putting dmaengine in there might be a possibility... > That's fine for experimentation but for an actual merged API I think it > does need to be generally applicable so that it's worth the effort for > client drivers to support it. First of all it is a bus-driver that should be working. (also to avoid the expected "short transfer issue" for some use-cases I would propose to run it in parallel to the existing version) Then I believe we can start discussing any approach to refactor things after we get that far and we see the performance improvements and drawbacks for each of those steps. >> not the prime "candidates" for the improvement in the first place. >> If they want to run with minimal latencies, they HAVE to run spi_async. >> Anything else is not very efficient as it requires several context >> switches by the way of the current design. > > Well, the benefits of submitting things asynchronously are as much about > giving the stack something to play with and reducing turnaround time as > anything else - it means the stack can submit the next transfer to the > hardware before it tells the caller that the current one completed which > reduces the idle time on the bus and means that there's much more scope > for reducing needless context switching. Right now we're not doing as > good a job of that as we can in most drivers though frankly even the > current state is an improvement on what we had. Well, actually if you do it right - like I did in my version of the mcp2515 driver, I am actually scheduling 2 spi_messages back to back. and only the first has a callback. The first reads the status registers on which to base decisions what to do and also pulls the interrupt high. While the second message does a speculative transfer of the first inbound message buffer and reads some error counters. So ideally while the callback of the 1st message is running the 2nd message is running in HW. And this callback will schedule acknowledgement of the read rx buffer as well as schedule the read of the second hw buffer if needed and finally reenabling interrupts. This 3rd message also contains a callback to push the previously read HW buffers to the network stack and create error frames if necessary... So it is in itself geared towards minimizing interrupt "latencies". If everything works as fast as possible (interrupts,...) the whole set of messages would look like a giant chain without gaps on the SPI bus totally reducing the time from start to finish. And if everything is done in interrupt context, then one such interrupt from the device would trigger 2 additional interrupts (one to trigger the 1st and the other to trigger the 3rd message) and be done with it in a total of 3 interrupts and 0 context switches. And if you have all the messages "prepared" then the async_spi call is a cheap thing as well: * message->spi=spi; * schedule message in dma by just chaining it to a currently running chain. If I would estimate such an ideal situation, then the whole transfer including callbacks (maybe except for the last) would be handled in: * 18us from interrupt to first spi_async end scheduling DMA * 17us for the first message (1W1R+2W3R+3W) * 29us for the second message (2W2R+2W/13R+1W1R) * 25us for the third message (1W+1W/13R+3W) in total that should be finished in 89us. Again - the final callback is not included - SPI Bus-speed: 8MHz, these timings have been taken from actual samples measured and just added up. Also note that this example assumes both Message buffer being full otherwise you can remove the 1W/13R from the 3rd message and thus reduce the time there by 17us. By comparison the shortest possible CAN message (@500KHz, standard ID and 0 data bytes) is between 94us and 111us (depending on the ID bit- pattern) This is obviously still theoretical and not proven. But to get to all those gains to work the bus driver has to work DMA streaming, the API has to implement prepare_message and the device-driver needs to be also very "efficient" in its scheduling. And that is the tricky part - the device driver needs to be designed to have good scheduling to reduce its latencies. So if you get that far into the game of optimizing your driver, then * you will prepare your messages in the init code * dma mapping the transfers becomes easy, because you can easily allocate the data you need for your transfers from DMA * and then adding the optional "call" to spi_prepare_message is the simplest exercise. For any driver that is _not_ optimized for latencies and where one needs low latency then a rewrite is required. Also just put the prepare_message api into perspective with other apis that are rarely used: extern int spi_bus_lock(struct spi_master *master); extern int spi_bus_unlock(struct spi_master *master); There we have 2 users of that interface in the kernel: drivers/iio/adc/ad_sigma_delta.c drivers/mmc/host/mmc_spi.c So we have interfaces for rare uses - why not have another one that is equally light-weight on the hot path of spi_async? And which gives a slight benefit to all drivers by avoiding the checks every time... > I'm having a really hard time getting excited about the improvements > from saving the checks though, they just seem so small for the effort by > themselves. You are right, they do exist, but they seem so small in > relation to the overall transfer time. That said if a patch were to > appear which did that and didn't provide any master side hooks I think > that'd probably be OK. > ... > > Sure, there's room for improvement but let's make that as general an > improvement as possible. Something that flows better for existing > drivers is much more of a win there. I'm getting alarm bells that the > gains for this one system might be hard to realise on other systems. > Unfortunately 1+us is not that small for a low-end device - it may be small for a multi-core system, but then you just hide it behind more CPU... And as said - if you run 10000 spi messages/s that is >10ms of time spent just doing the API and that is >1% system load on a single core. I believe I should finish my POC to get fully up to my expectations and then I can show you the advantage of having this "minimal" change, which has also an overall positive impact (even though slight). >> The proposed SPI-test driver is academic to some extent, but it >> still tries to simulate "real" device traffic with optimal thru-put. >> Or do you think that reading an ADC as fast as possible is an >> academic exercise? (the test driver would essentially mimic that and >> allow for >> scheduling mechanism to work that allow high thruput (if you >> configure multiple messages) > > I'm suspicous of benchmarks which limit so that you end up running so > fast you don't have time to actually do anything with the data, you can > end up tuning things so that more realistic users suffer (spending too > much time in interrupt context for example). Well - maybe we should make this a separate thread... On my proposed list would be something: * like the generic version I had talked about * the mcp2515 CAN-driver that I have described above * the enc28j60 network driver * the spi-mmc sd card driver For all of them we can provide well-defined performance pattern (READ/Write/s or Packets/s) plus some kind of ADC testing >> And as said: all the spi_sync methods are candidates to optimization >> of this - at the cost of some memory used for caching multiple >> messages. >> (say having 16 times 1xspi_message+1xspi_transfer structures for >> transfer length 1 to 16 in spi_write. Similar for spi_read. Or 256 >> 1xspi_message+2xspi_transfer to transfer 1-16 writes and 1-16 reads. >> plus some memory copy for the reads and writes) > > But then that's a memory plus cache management cost (per master), not to > mention complexity. Well, that is something we would need to look into - but it should not be _that_ complicated and the SPI_master could allocate those via a framework call or a bit in the spi_master structure - say, if it supports message, then set it up during registering the device and remove it afterwards... That is very generic and not a lot more overhead that spi_sync_write/read does it right now. But let us first show how far we can go with this first bus-driver that does everything in dma and then we can compare results with and without the "optimize"... Ciao, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <9640F4C7-7F82-453E-9D83-5875A1559A20-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <9640F4C7-7F82-453E-9D83-5875A1559A20-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-15 11:15 ` Martin Sperl [not found] ` <5286026B.2090903-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-15 13:33 ` Mark Brown 1 sibling, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-15 11:15 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA Hi Mark! I thought a bit about what you had said about DMA and building-blocks, and I have come up with a possible vision of the "future". You essentially gave me the idea that we could reduce restrictions on the proposed spi_prepare_message by adding some flags of in what way the structure may change. So if you got a "fully" fixed structure, then we could add "flags" here to indicate if the structure also is fixed with regards to: * Length of Transfers * Change in Pointer to Transfers * providing DMA data or not (this could get indicated via is_dma_mapped) Then the bus-driver (or framework) could start to use different construction schemes to build the DMA chain based on different flag variations and thus optimizing the time it takes to build the DMA (which is currently the worsted case in my driver). This would definitely "improve" the performance of those sync calls (if we make a prepare in the sync call somehow for the hot-path) Also the driver can start to use (pre-built/cached) "building-blocks" that make up a lot of the DMA control blocks that are needed. (configuring the SPI controller and such) For more complex drivers we may also defer the DMA-preparation (and scheduling thereof) of messages to a worker thread which would produce DMA chains to schedule. But if the message is prepared it would get scheduled from spi_async context immediately, unless there is something in the prepare queue, which then would have to wait so that no reordering occurs. This comes with a potential "trap": We would need to define if a "reorder" of SPI messages between two different SPI Bus devices with different CS is a valid operation or not. (the question is mostly: would this "queuing" happen on a per bus-driver basis or on a per CS basis) The timing sequence would be like this (for scheduling spi_async messages at the same time): driver A: spi_async(spi0,msg0_unprepared); /* getting put into the "queue" */ driver B: spi_async(spi1,msg1_prepared); /* getting scheduled immediately */ On the SPI Bus you would see the following: msg1_prepared; msg0_unprepared; Here we have the requirements of a "definition" of scheduling policies. As far as I can tell (and I may be wrong) the issue described would only impact if a single device-driver would require two or more different CS to make a device work properly. So I would say that "reordering" is essentially allowed if we talk about two different drivers talking to different devices (or multiple instances of a single driver talking to distinct devices of the same type). Note that this could show some strange "concurrency behavior" in some drivers due to implicit assumptions which are not defined until now and which would never show up on the current setup. That is unless you find a board with 2 SPI buses and you connect an identical device on each SPI Bus. But if one looks at this from this direction, then such an assumption in the driver would be a bug in the device-driver that would need to get fixed in the driver itself. But if a driver exists for a device that requires two (or more) Chip selects with the additional "requirement" of "proper" ordering, then: Either the driver needs to make both messages "prepared". We may need to provide an extra flag to the "prepare" to force "global ordering" on the SPI-bus. or we make it a requirement that - in such cases - the spi_lock_bus interface is used to block the whole bus. But the first question is: are there any drivers currently that require multiple CS to do its work? All of these ideas will also help us in the future and may then may get used for other bus-drivers as well - we then could even start using a "DMA-sequencing" engine inside the framework to reduce the code on the driver (how to define those would be tricky though - we would need to see how "flexible" other devices are compared to bcm2835). But first we need to get one working driver with "optimal" performance that we may confirm it is working in principle and then continue to see where optimizations can get made and finally refactor this into a "generic" solution sharing code in the SPI framework and it may eventually move into dmaengine... Another thing that came to my mind is a means to give spi_async a hint if it runs in an interrupt context or not. This would give the opportunity to allow memory allocations in spi_async and transfer using GFP_KERNEL instead of GFP_ATOMIC for cases that run in non_interrupt cases. We could make this a flag in the spi_message structure or use different functions: spi_async_threaded and spi_async_interrupt while having the spi_async map to spi_async_interrupt for compatibility reasons with older drivers. Both are valid approaches - I believe it is a matter of taste which one gets used... Another thing that you may have mentioned could be a message that would be like "read(8|16)bitReadXbyte". You first read (1/2) bytes and then read X bytes that come from the first one or two bytes. The problem here may be that such a "scheme" would not (not necessarily) work with a fully DMA pipelined approach - especially endian-nes might be different (possibly could work around this with 1 byte DMA transfers, if the DMA engine supports it). But for some HW-devices length might be mixed with other control-bits, which would require some "and+shift" logic, which would become impossible to implement in DMA for most DMA-engines (the mcp2515 would be such a case). The question here is really: how many such devices do exists that provide such, that would make it worth doing that in a generic way? Finally I have a proposal to avoid the "naming" conflict as we have seen in our discussions so far: why not call the new interface: int spi_(pre)build_message(struct spi_device*,struct spi_message, unsigned long flags); int spi_unbuild_message(struct spi_device*,struct spi_message); with the corresponding equivalent methods in spi_master? Other naming ideas are welcome (I just would not say DMA, as it might get used for other things as well) So I think all the above ideas give us a lot to ponder how/where we may want to move in the future... In the meantime I will try to move to the SPI bus-driver to the transfer interface and I will probably add a module parameter to allow the selection of which interface is used. This way we have then many more scenarios to compare to see how they fare with different situations/use-cases/... But in the end I would guess that going with the "direct" DMA chaining approach would give the best results... Martin P.s: maybe we should start renaming the subject of this thread? -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <5286026B.2090903-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <5286026B.2090903-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-16 14:23 ` Mark Brown [not found] ` <20131116142356.GY15393-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-16 14:23 UTC (permalink / raw) To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 2235 bytes --] On Fri, Nov 15, 2013 at 12:15:55PM +0100, Martin Sperl wrote: > You essentially gave me the idea that we could reduce restrictions > on the proposed spi_prepare_message by adding some flags of in what way > the structure may change. So if you got a "fully" fixed structure, > then we could add "flags" here to indicate if the structure also is > fixed with regards to: Adding flags seems like a good idea, probably easiest to start with everything is fixed and then add the ability to relax things later. > This comes with a potential "trap": We would need to define if a > "reorder" of SPI messages between two different SPI Bus devices with > different CS is a valid operation or not. > (the question is mostly: would this "queuing" happen on a per bus-driver > basis or on a per CS basis) I'd expect us to continue to do things per bus, at least for the time being. This is definitely a stage two consideration though, getting something working > But the first question is: are there any drivers currently that require > multiple CS to do its work? No. > All of these ideas will also help us in the future and may then may get > used for other bus-drivers as well - we then could even start using a > "DMA-sequencing" engine inside the framework to reduce the code on the > driver (how to define those would be tricky though - we would need to > see how "flexible" other devices are compared to bcm2835). The only thing that you're talking about that sounds unusual is the ability to add timed delays in the hardware, otherwise this all sounds very standard assuming DMA controllers will play nice with mixing in writes to registers (which I expect to be common). > We could make this a flag in the spi_message structure or use different > functions: spi_async_threaded and spi_async_interrupt while having the > spi_async map to spi_async_interrupt for compatibility reasons with > older drivers. Separate functions would be a bit more idiomatic. > why not call the new interface: > int spi_(pre)build_message(struct spi_device*,struct spi_message, > unsigned long flags); > int spi_unbuild_message(struct spi_device*,struct spi_message); Seems fine. Or init/done. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131116142356.GY15393-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131116142356.GY15393-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-19 13:11 ` Martin Sperl [not found] ` <528B6370.9000903-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-19 13:11 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA Hi Mark! >> why not call the new interface: >> int spi_(pre)build_message(struct spi_device*,struct spi_message, >> unsigned long flags); >> int spi_unbuild_message(struct spi_device*,struct spi_message); > > Seems fine. Or init/done. > So I will call it: int spi_init_message(struct spi_device*,struct spi_message, unsigned long flags); int spi_done_message(struct spi_device*,struct spi_message); (or you want to call it spi_message_init/done instead?) I will post a patch for this soon. Martin -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <528B6370.9000903-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <528B6370.9000903-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-19 15:02 ` Mark Brown [not found] ` <20131119150204.GA2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-19 15:02 UTC (permalink / raw) To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 429 bytes --] On Tue, Nov 19, 2013 at 02:11:12PM +0100, Martin Sperl wrote: > So I will call it: > int spi_init_message(struct spi_device*,struct spi_message, > unsigned long flags); > int spi_done_message(struct spi_device*,struct spi_message); > (or you want to call it spi_message_init/done instead?) The latter might be better for consistency with spi_message_add_tail() and similar. > I will post a patch for this soon. Excellent. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131119150204.GA2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131119150204.GA2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-19 15:13 ` Martin Sperl 0 siblings, 0 replies; 57+ messages in thread From: Martin Sperl @ 2013-11-19 15:13 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA Actually I found that there is some ambiguity with spi_message_init - it exists and is used by every spi-device-driver... So I have started to think of why not call it: spi_message_(un)compile or spi_message_(un)optimize To make it totally clear that this is an optimization. So for now I have used the first. I will give measurements for timing-"differences" when using the stock spi-bcm2835 on my use-case with reduced traffic by moving to only 125khz CAN-bus-speed, to show there is no regression. Martin On 19.11.2013, at 16:02, Mark Brown wrote: > On Tue, Nov 19, 2013 at 02:11:12PM +0100, Martin Sperl wrote: > >> So I will call it: >> int spi_init_message(struct spi_device*,struct spi_message, >> unsigned long flags); >> int spi_done_message(struct spi_device*,struct spi_message); > >> (or you want to call it spi_message_init/done instead?) > > The latter might be better for consistency with spi_message_add_tail() > and similar. > >> I will post a patch for this soon. > > Excellent. -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <9640F4C7-7F82-453E-9D83-5875A1559A20-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-15 11:15 ` Martin Sperl @ 2013-11-15 13:33 ` Mark Brown [not found] ` <20131115133312.GE26614-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 1 sibling, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-15 13:33 UTC (permalink / raw) To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 4102 bytes --] On Thu, Nov 14, 2013 at 08:47:35PM +0100, Martin Sperl wrote: > > Meh, it doesn't seem that urgently different to be honest. The basic > > flow of interactions with the device is very much typical of SPI > > controllers. > Well - if it was not for the requirement to call > spi_finalize_current_message prior to leaving transfer_one message to get > the next one - I might have continued this route with pipelining. > But hacking my way around those limitations (by taking the next item of the > list, etc.) seemed a bit absurd to be honest. The thing to do there is extend the interface, you're treating things like you're working on a proprietary OS where the APIs are fixed. > Well - if you think this "approach" is also applicable to other devices, > then this is one more argument to allow device drivers to prepare its > messages (with the is_dma_mapped requirement to keep cost low). No, this isn't something which should be being open coded in individual drivers - it's far too much redundant work especially once you get into doing the tradeoffs for no queue vs queue cases. Using regmap MMIO may be helpful but we need to do that in a way that allows us to integrate it into the DMA pipeline... This is part of what I keep saying about doing things in a way that makes them generally applicable. > Note that if you do that and want to leverage the "prepare_message", > then I would recommend you add an additional flag field to these calls. > Because then you may pass some hints to the engine (like if it can > use GFP_KERNEL or GFP_ATOMIC during allocations). And prepare > could get done with GFP_KERNEL, while a "prepare" during spi_async > would need to be done via GFP_ATOMIC. Or in the async case we just punt the prepare to the thread. > > That's fine for experimentation but for an actual merged API I think it > > does need to be generally applicable so that it's worth the effort for > > client drivers to support it. > First of all it is a bus-driver that should be working. > (also to avoid the expected "short transfer issue" for some use-cases I > would propose to run it in parallel to the existing version) > Then I believe we can start discussing any approach to refactor things > after we get that far and we see the performance improvements and > drawbacks for each of those steps. It'll definitely need some experimentation and tuning. > Well, actually if you do it right - like I did in my version of the > mcp2515 driver, I am actually scheduling 2 spi_messages back to back. > and only the first has a callback. > The first reads the status registers on which to base decisions what > to do and also pulls the interrupt high. > While the second message does a speculative transfer of the > first inbound message buffer and reads some error counters. > So ideally while the callback of the 1st message is running the 2nd > message is running in HW. Yes, this is exactly the sort of thing I'm talking about - it's cutting out dead time on the bus. > For any driver that is _not_ optimized for latencies and where one > needs low latency then a rewrite is required. Right, and the big concern with the way you're currently proposing to implement the optimisation is that it's leaving essentially the whole thing up to the individual driver. This is going to result in limited adoption and is going to mean there's no sharing of learning and fixes between drivers. > I believe I should finish my POC to get fully up to my expectations > and then I can show you the advantage of having this "minimal" change, > which has also an overall positive impact (even though slight). To repeat yet again, something that is just a tunnel through to this one heavily optimised master driver isn't good. The problem with the proposed implementation is that it is essentially just tunnelling through to a heavily optimised master, it's not generic enough. If other devices can make use of it it's more interesting but then those uses ought to be being factored out. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131115133312.GE26614-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131115133312.GE26614-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-15 14:52 ` Martin Sperl [not found] ` <0BA2243C-2F22-492A-B517-76E243535549-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-15 14:52 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA On 15.11.2013, at 14:33, Mark Brown wrote: > No, this isn't something which should be being open coded in individual > drivers - it's far too much redundant work especially once you get into > doing the tradeoffs for no queue vs queue cases. Using regmap MMIO may > be helpful but we need to do that in a way that allows us to integrate > it into the DMA pipeline... > > This is part of what I keep saying about doing things in a way that > makes them generally applicable. Well - we first need to see what is the real "common denominator" of boards, that can do DMA to configure SPI itself. So we would need at least 2 or 3 to figure that out (and also all the bugs we may expect to see and find work-arrounds for those) but in principle it should be possible to factor processing out like this: * main transfer loop * while (!transfer_last) { * prepareSPIRegisters() * prepareAdditionalTXDMA * do { * append RX/TX transfers * } while(do_inner) * append wait_for_clock * append delay_us * if cs_change * append CS_UP * append wait_for_clock * } * if message->callback * append DMAInterrupt_and_position I believe that might be a generic enough approach that it should work if one would factor out the "appends" plus the do_inner loop exit condition into the master structure. But my guess is that you would not even allow to get something generic as this merged unless there are multiple drivers that would be using that interface. So we would be in a chicken/egg from this direction as well... > >> Note that if you do that and want to leverage the "prepare_message", >> then I would recommend you add an additional flag field to these calls. >> Because then you may pass some hints to the engine (like if it can >> use GFP_KERNEL or GFP_ATOMIC during allocations). And prepare >> could get done with GFP_KERNEL, while a "prepare" during spi_async >> would need to be done via GFP_ATOMIC. > > Or in the async case we just punt the prepare to the thread. again: that means an uneccessary delay especially when you think about running the message_pump without real-time priority introduces 10us additional delay... > Yes, this is exactly the sort of thing I'm talking about - it's cutting > out dead time on the bus. > Right, and the big concern with the way you're currently proposing to > implement the optimisation is that it's leaving essentially the whole > thing up to the individual driver. This is going to result in limited > adoption and is going to mean there's no sharing of learning and fixes > between drivers. Well, any driver that is using spi_sync can not get improved without a rewrite anyway - the context switching becomes prohibitive. So what some drivers seem to do is create a separate thread, which they wake up from their interrupt handler, which then runs the interrupt handling and uses SPI_sync communication in this thread. Which means at least 2 context switches one from SPI interrupt (finished) to wake the message pump, which then will call complete and wake up the real thread. Whatever you do on the framework side you cannot improve the performance of such drivers without touching the driver itself and converting them to something else. It is a whole other story if the driver is already doing async scheduling of its messages. Then the spi_async thing streamlines the "next" message to some extent and does not allow the message_pump to sleep which gives it better response-times. But you still waste time between the interrupt handler runs and wakes up the message_pump - especially as the callback is allowed to run from interrupt context. As for learning & fixing: Any driver will go unfixed until someone finds a bug or a performance issue and spends time on improving it. What about writing some "documentation" to show how a driver can get improved to get higher tru-put? And then you can also make the framework do more work for the driver and thus avoiding replication of code. But then you would again get into a need to create a new API for just that covering all those cases. Maybe by using a code-generator like the samba team did for the smb messages. That is a huge separate project with lots or R&D required, which is way outside of the scope of writing a DMA pipelined bus-driver and proposing things that may improve the performance of optimized drivers. > To repeat yet again, something that is just a tunnel through to this one > heavily optimised master driver isn't good. The problem with the > proposed implementation is that it is essentially just tunnelling > through to a heavily optimised master, it's not generic enough. If > other devices can make use of it it's more interesting but then those > uses ought to be being factored out. Well - as I understand this, it means that: If there were more bus-drivers that could drive the whole SPI transfer of a message via DMA, then it would become acceptable to get that API in place until then if there is only a single bus and device driver, then it is not acceptable. So please define what would be your "critical" mass at which you would accept that a API-extension is worth the effort to get integrated? (Besides metric that shows how much it improves the situation, which also would need to get defined - we do not want to have performance-regressions either) Martin -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <0BA2243C-2F22-492A-B517-76E243535549-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <0BA2243C-2F22-492A-B517-76E243535549-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-16 12:56 ` Mark Brown 0 siblings, 0 replies; 57+ messages in thread From: Mark Brown @ 2013-11-16 12:56 UTC (permalink / raw) To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 3081 bytes --] On Fri, Nov 15, 2013 at 03:52:42PM +0100, Martin Sperl wrote: > Well - we first need to see what is the real "common denominator" of > boards, that can do DMA to configure SPI itself. So we would need at > least 2 or 3 to figure that out (and also all the bugs we may expect to > see and find work-arrounds for those) There's a bunch of fairly clear patterns for drivers here - the main constraints are going to be having to punt to task context for sleeping operations and DMA controllers that just can't have transfers to or from the register space mixed in with the command stream (all of which fits into the general patterns we've got already). > But my guess is that you would not even allow to get something generic as > this merged unless there are multiple drivers that would be using that > interface. > So we would be in a chicken/egg from this direction as well... We're already going in this direction, that's why I'm going through splitting up the operations that actually talk to the hardware from the bits that implement generic logic and queue driving. The major blocker previously was that I didn't know what you were proposing to do in your driver. > > Or in the async case we just punt the prepare to the thread. > again: that means an uneccessary delay > especially when you think about running the message_pump without > real-time priority introduces 10us additional delay... For the cases you're concerned about this shouldn't be an issue, the thread should be behind the I/O or the client driver will have prepared. I'd also expect to be able to fall back to PIO (which I suspect we'll want to keep as an option for idle cases anyway). > > Right, and the big concern with the way you're currently proposing to > > implement the optimisation is that it's leaving essentially the whole > > thing up to the individual driver. This is going to result in limited > > adoption and is going to mean there's no sharing of learning and fixes > > between drivers. > Well, any driver that is using spi_sync can not get improved without a > rewrite anyway - the context switching becomes prohibitive. No, I'm talking about requiring things to be done on the master side not the client side. That's key for making it worthwhile for client drivers to use things like preparing messages and makes it much easier to do the adaptions required in the master drivers. > Well - as I understand this, it means that: > If there were more bus-drivers that could drive the whole SPI transfer of > a message via DMA, then it would become acceptable to get that API in > place until then if there is only a single bus and device driver, then it > is not acceptable. No, like I said in an earlier e-mail now that I understand how it can be generally useful rather than just being a driver specific thing I'm OK with adding the user visible prepare API (but not the calls into the master driver). It's the master driver interface that still needs work so the logic isn't open coded in individual drivers. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131109183056.GU2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-10 10:54 ` Martin Sperl @ 2013-11-10 11:05 ` Mark Brown [not found] ` <20131110110524.GA878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 1 sibling, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-10 11:05 UTC (permalink / raw) To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 885 bytes --] On Sat, Nov 09, 2013 at 06:30:56PM +0000, Mark Brown wrote: > You've been doing a bunch of other performance improvements as well, > it's not clear to me how much of this is coming from the API change and > how much of this is due to changes which can also be done by improving > the implementation and without requiring drivers to be specifically > updated to take advantage of it. An example of the sort of thing I'm concerned about: if I understand correctly one of the things the prepare and unprepare operations in the driver do is the DMA mapping. This will mean that if you prepare in the caller the system will be able to do the DMA mapping while a prior transfer is in progress which should help performance. That's a good thing to do but it should be something that can be done as part of the existing API without needing drivers using SPI to be updated. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131110110524.GA878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131110110524.GA878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-10 16:41 ` Martin Sperl [not found] ` <3361A01A-C7D0-4689-AFBD-085D3E62A67C-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Martin Sperl @ 2013-11-10 16:41 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA On 10.11.2013, at 12:05, Mark Brown wrote: > An example of the sort of thing I'm concerned about: if I understand > correctly one of the things the prepare and unprepare operations in the > driver do is the DMA mapping. This will mean that if you prepare in the > caller the system will be able to do the DMA mapping while a prior > transfer is in progress which should help performance. That's a good > thing to do but it should be something that can be done as part of the > existing API without needing drivers using SPI to be updated. yes - but that comes back to my argument that this is feasible to do transparently within the framework, but the cost of doing so would eat away most of the possible performance gains unnecessarily and make the situation worse for situations where there IS a change. As explained - you would need to make an educated guess if a spi_transfer chain has changed or not (and I talk about the need to check the chain itself, speed, transferlength, rx_buff,tx_buff,... have not changed). Doing this sort of checking will still take a few usecs per message and call. One could do some hashing (with the risk of collisions) or keeping a second copy of the data and do a byte wise compare of spi_message and the spi_transfer chain. Or some other. But still this requires processing you need to run on each spi_async call, independently if it has been prepared via the framework preciously or not. And the risk of finding a false positive is non-0 (in case of hashing) and then you still spend spinning CPU cycles on it (not to mention the debugging nightmare if such a situation occurs on a production system) So you waste CPU cycles unnecessarily for something where you are unsure. So assuming you need 1us per message for this kind of "test" if the structure has changed or not, and you run 3000 such SPI messages/s, then you are wasting 3ms of time. And (again to repeat myself) you have on top of that a memory management issue, because you then would need to garbage collect your prepared resources for all those cases where the message just got allocated on the heap and discarded afterwards - there is no means in the current API to release the memory when releasing the memory of the spi_message. So to make things efficient here you would need to extend the API - but this time it would be a mandatory change for all drivers (and I doubt you will find EVERY instance resulting in bugs to fix down the road) Also with this approach the "message" would possibly need to get prepared in IRQ context (because spi_async can get called from interrupt handler) which would require allocation of memory in GFP_ATOMIC context just to play it safe in all situations. Not so good either. I am mostly seeing this "spi_prepare_message" as a method for the driver to give some optimization hints to the engine to minimize "parsing" of the structures. And I would also prefer - if possible - to see the API to be able to allocate memory from "blocking" memory pools, so the spi_prepare_message would be allowed to run only in threaded context and may not get called from interrupt contexts. Anyway those "typical" calls can easily get prepared during the module initialization phase and to avoid memory allocation in interrupt handlers and callbacks... Ok - I have tried to explain my concerns again. So if you still think you can do the things requested via the framework, then Here a "quick" example where the "prepared messages" would reduce CPU overhead immediately without any bus-driver specific implemention: static int __spi_async(struct spi_device *spi, struct spi_message *message) { struct spi_master *master = spi->master; struct spi_transfer *xfer; + /* run the below computations only if our message has not been changed */ + if (__spi_message_has_changed(spi,message)) { /* Half-duplex links include original MicroWire, and ones with * only one data pin like SPI_3WIRE (switches direction) or where * either MOSI or MISO is missing. They can also be caused by * software limitations. */ if ((master->flags & SPI_MASTER_HALF_DUPLEX) || (spi->mode & SPI_3WIRE)) { unsigned flags = master->flags; list_for_each_entry(xfer, &message->transfers, transfer_list) { if (xfer->rx_buf && xfer->tx_buf) return -EINVAL; if ((flags & SPI_MASTER_NO_TX) && xfer->tx_buf) return -EINVAL; if ((flags & SPI_MASTER_NO_RX) && xfer->rx_buf) return -EINVAL; } } /** * Set transfer bits_per_word and max speed as spi device default if * it is not set for this transfer. */ list_for_each_entry(xfer, &message->transfers, transfer_list) { if (!xfer->bits_per_word) xfer->bits_per_word = spi->bits_per_word; if (!xfer->speed_hz) xfer->speed_hz = spi->max_speed_hz; if (master->bits_per_word_mask) { /* Only 32 bits fit in the mask */ if (xfer->bits_per_word > 32) return -EINVAL; if (!(master->bits_per_word_mask & BIT(xfer->bits_per_word - 1))) return -EINVAL; } } + /* call spi_prepare_message */ + if (master->prepare_message) { + ret=master->prepare_message(master,message); + if (ret) + return ret; + } + } message->spi = spi; message->status = -EINPROGRESS; return master->transfer(spi, message); } So how would you implement the __spi_message_has_changed function in a manner that is performance efficient and worth called in a "hot-path"? And how would you handle the garbage collection required for the "transparent case" you are trying to get. Please stop wishfull thinking, instead provide alternative ideas how to solve the issue. Please provide a solution that works as simple as: /* in the driver init code - if you want to "improve" the latency of spi_async */ spi_prepare_message(spi,message); /* in the driver release code - if you want to "improve" the latency of spi_async */ spi_unprepare_message(spi,message); /* and the "check" itself */ static int __spi_message_has_changed(spi,message) { return (! message->is_driver_prepared); } So what about the performance measurements I have mentioned? Do you have any concerns with the described approach and the metrics proposed? You want me to measure and publish those results? Martin-- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <3361A01A-C7D0-4689-AFBD-085D3E62A67C-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <3361A01A-C7D0-4689-AFBD-085D3E62A67C-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-11-11 11:18 ` Mark Brown [not found] ` <20131111111842.GE2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 57+ messages in thread From: Mark Brown @ 2013-11-11 11:18 UTC (permalink / raw) To: Martin Sperl; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 511 bytes --] On Sun, Nov 10, 2013 at 05:41:30PM +0100, Martin Sperl wrote: > As explained - you would need to make an educated guess if a spi_transfer > chain has changed or not (and I talk about the need to check the chain > itself, speed, transferlength, rx_buff,tx_buff,... have not changed). No, you're going off and talking about something else here. You are very focused on the idea of skipping some of the setup but that's not the only thing being discussed. Please try to separate the various ideas. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <20131111111842.GE2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <20131111111842.GE2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2013-11-11 11:44 ` Martin Sperl 0 siblings, 0 replies; 57+ messages in thread From: Martin Sperl @ 2013-11-11 11:44 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA Hi Mark! On 11.11.2013 12:18, Mark Brown wrote: > No, you're going off and talking about something else here. You are > very focused on the idea of skipping some of the setup but that's not > the only thing being discussed. Please try to separate the various > ideas. Well - from my perspective (for which I also started this thread) it is _all_ about skipping setup work (besides the question about why spi_master.transfer is depreciated, which is unanswered). If you want to discuss something in the same context, then please start to state explicitly what you want to get addressed here as well. Because I have also seen a disconnect in this discussion before (including running in circles) and I had tried to summarize my understanding of this disconnect earlier, but no response to that portion. So to summarize my understanding of the disconnect again: * your concern is "refactoring" message pump to reduce bus-driver code * my concern is speed and latencies of SPI messages And as far as I can see they primarily touch on the subject of functionality both containing "prepare" but for different purpose - one for the bus-driver interface and the other for the device driver interface. So again, please state what you want to achieve in this context and how you see the two subject are interconnected and why/how theyneed to get handled together. Thanks, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
[parent not found: <F70C4469-6325-48D5-A7CA-52CC1FC404BD-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org>]
* Fwd: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver [not found] ` <F70C4469-6325-48D5-A7CA-52CC1FC404BD-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> @ 2013-10-29 19:09 ` Linus Walleij 0 siblings, 0 replies; 57+ messages in thread From: Linus Walleij @ 2013-10-29 19:09 UTC (permalink / raw) To: linux-spi-u79uwXL29TY76Z2rM5mHXA, Mark Brown; +Cc: Martin Sperl Here is Martins original message. Yours, Linus Walleij ---------- Forwarded message ---------- From: Martin Sperl <martin-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> Date: Mon, Oct 28, 2013 at 11:42 AM Subject: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver To: spi-devel-general-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Cc: grant.likely-s3s/WqlpOiPyB63q8FvJNQ@public.gmane.org, broonie-yzvPICuk2AATkU/dhu1WVueM+bqZidxxQQ4Iyu8u01E@public.gmane.org, linus.walleij-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org Hi! I am currently writing an SPI driver for the raspberry pi that only relies on DMA for the whole transfer. It can already handle a FULL SPI_message with multiple transfers, CS_CHANGE, and speed_hz changes only by using DMA, so it is using the dma to the max... Right now (for the ease of "development") the driver is still using the transfer_one_message interface. But unfortunately this interface introduces latencies/gaps between individual messages because of the latencies: interrupt -> wakeup_spinlock->schedule message_pump_thread. So as the DMA driver is already handling a single SPI transfer very efficiently, there is also an option to chain multiple SPI transfers in the DMA engine and just get interrupts for completion. This I why I thought of moving away from spi_transfer_one_message and back to the simpler transfer interface, where the preprocessing would get done (DMA control block-chain generation) and then appending it to the existing (possibly running) DMA chain. But just yesterday I was looking thru the code and came to the message: "master is unqueued, this is depreciated" (drivers/spi/spi.c Line 1167). This came in with commit ffbbdd21329f3e15eeca6df2d4bc11c04d9d91c0 and got included in 3.4. So I am wondering why you would depreciate this interface - I agree that for most drivers moving to the transfer_one_message interface is the way forward and reduces code size and bugs. And this depreciation message will hopefully motivate driver authors to rewrite their code to make use of the new interface. But there may be some drivers that would really benefit of using the more complex interface without it getting depreciated and at some point possibly removed. And I believe that the pipelined DMA driver is a good candidate to show that under some circumstances the transfer interface may still be the better solution... Obviously I could work around the inner working of the message-pump by manssaging the data before some of the calls, which I would say is even more depreciated! Now this brings me to different question: Could we implement some additional "functions" for preparing an SPI message to reduce the overhead of computation of a spi message repeatedly by keeping a "prepared" Transaction, which we just need to schedule it. Changing data would not be a problem, but changing the data block-addresses would be a change. that would require recalculating the data. The interface could looks something like this: int spi_prepare_message(struct_spi_dev*, struct spi_message *); int spi_unprepare_message(struct_spi_dev*, struct spi_message *); together with an additional object (void* prepared) in the spi_message structure for keeping such prepared data... OK - abusing the existing queue and state could be used, but -if i read the comments correctly - this data is only guaranteed to be available to the driver between spi_async and the corresponding callback (outside of it it is used used by the message pump and gets cleared in spi_finalize_current_message function. Thanks, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 57+ messages in thread
end of thread, other threads:[~2013-11-19 15:13 UTC | newest] Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <F70C4469-6325-48D5-A7CA-52CC1FC404BD@sperl.org> [not found] ` <CACRpkdb6y=o4__snBs2DR1f=xW_u7KdkHg3fb7XN5e2gicBJeg@mail.gmail.com> [not found] ` <CACRpkdb6y=o4__snBs2DR1f=xW_u7KdkHg3fb7XN5e2gicBJeg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-10-29 16:59 ` Fwd: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver Linus Walleij [not found] ` <CACRpkdb7y88oq7XyVFc_0Nx4pXtaebPe7KB2yizBRJGwWLqJig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-10-29 17:56 ` Mark Brown 2013-10-29 21:18 ` Martin Sperl [not found] ` <06C7F4D3-EC91-46CF-90BE-FC24D54F2389-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-10-29 22:53 ` Mark Brown [not found] ` <20131029225353.GB11424-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-10-30 17:11 ` Martin Sperl [not found] ` <F64AD25A-C7EC-4A0D-9169-850C12F4D8A3-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-10-30 21:51 ` Mark Brown 2013-10-30 8:40 ` Martin Sperl [not found] ` <02BFF0F6-3836-4DEC-AA53-FF100E037DE9-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-10-30 17:19 ` Mark Brown [not found] ` <20131030171902.GL2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-10-30 18:33 ` Martin Sperl [not found] ` <8D8B0BAD-0E00-4147-B4C8-FBB18F060C96-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-10-30 21:57 ` Mark Brown [not found] ` <20131030215716.GV2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-10-30 22:52 ` Martin Sperl [not found] ` <3342FD19-4438-463B-89B2-A83D3475AC22-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-10-31 0:10 ` Mark Brown [not found] ` <20131031001004.GW2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-04 17:33 ` Martin Sperl [not found] ` <18639D9A-630E-44F3-AA7A-ADFF5D5E8B56-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-04 18:45 ` Mark Brown [not found] ` <20131104184511.GR2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-04 21:43 ` Martin Sperl [not found] ` <5A55A832-5313-499C-A483-BF5A6649D69D-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-05 1:03 ` Linus Walleij 2013-11-06 9:48 ` Mark Brown [not found] ` <20131106094854.GF11602-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-06 11:28 ` Martin Sperl [not found] ` <844EDAEA-3FDC-48D0-B59E-CECC0A83761E-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-06 11:32 ` Mark Brown [not found] ` <20131106113219.GJ11602-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-06 12:10 ` Martin Sperl [not found] ` <C6C68042-63A0-40FD-8363-B4553ECB4774-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-06 16:24 ` Mark Brown [not found] ` <20131106162410.GB2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-06 19:54 ` Martin Sperl [not found] ` <3B0EDE3F-3386-4879-8D89-2E4577860073-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-06 23:26 ` Mark Brown [not found] ` <20131106232605.GC2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-07 0:43 ` Martin Sperl [not found] ` <72D635F5-4229-4D78-8AA3-1392D5D80127-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-07 20:31 ` Mark Brown [not found] ` <20131107203127.GB2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-08 14:16 ` Martin Sperl [not found] ` <86AE15B6-05AF-4EFF-8B8F-10806A7C148B-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-08 16:19 ` Mark Brown [not found] ` <20131108161957.GP2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-08 17:31 ` Martin Sperl [not found] ` <5F70E708-89B9-4DCF-A31A-E688BAA0E062-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-08 18:09 ` Mark Brown [not found] ` <20131108180934.GQ2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-08 19:18 ` Martin Sperl [not found] ` <C375DEE6-1AEC-4AFB-A9D6-583DCB4476A3-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-09 18:30 ` Mark Brown [not found] ` <20131109183056.GU2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-10 10:54 ` Martin Sperl [not found] ` <6C7903B3-8563-490E-AD7D-BA5D65FFB9BC-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-12 1:19 ` Mark Brown [not found] ` <20131112011954.GH2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-12 14:42 ` Martin Sperl [not found] ` <52823E73.503-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-12 17:59 ` Martin Sperl [not found] ` <2252E63E-176C-43F7-B259-D1C3A142DAFE-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-13 15:43 ` Mark Brown [not found] ` <20131113154346.GT878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-13 18:35 ` Martin Sperl [not found] ` <ED58E869-A9F6-4BB2-8EC6-D71F946509DC-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-13 19:33 ` Mark Brown [not found] ` <20131113193320.GE878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-13 21:31 ` Martin Sperl 2013-11-13 15:11 ` Mark Brown [not found] ` <20131113151102.GS878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-13 15:48 ` Martin Sperl [not found] ` <77070979-0CE4-4C76-B12E-DA94B2577172-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-13 16:59 ` Mark Brown 2013-11-14 1:50 ` Mark Brown [not found] ` <20131114015009.GB26614-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-14 19:47 ` Martin Sperl [not found] ` <9640F4C7-7F82-453E-9D83-5875A1559A20-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-15 11:15 ` Martin Sperl [not found] ` <5286026B.2090903-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-16 14:23 ` Mark Brown [not found] ` <20131116142356.GY15393-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-19 13:11 ` Martin Sperl [not found] ` <528B6370.9000903-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-19 15:02 ` Mark Brown [not found] ` <20131119150204.GA2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-19 15:13 ` Martin Sperl 2013-11-15 13:33 ` Mark Brown [not found] ` <20131115133312.GE26614-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-15 14:52 ` Martin Sperl [not found] ` <0BA2243C-2F22-492A-B517-76E243535549-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-16 12:56 ` Mark Brown 2013-11-10 11:05 ` Mark Brown [not found] ` <20131110110524.GA878-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-10 16:41 ` Martin Sperl [not found] ` <3361A01A-C7D0-4689-AFBD-085D3E62A67C-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-11-11 11:18 ` Mark Brown [not found] ` <20131111111842.GE2674-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2013-11-11 11:44 ` Martin Sperl [not found] ` <F70C4469-6325-48D5-A7CA-52CC1FC404BD-d5rIkyn9cnPYtjvyW6yDsg@public.gmane.org> 2013-10-29 19:09 ` Fwd: " Linus Walleij
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).