* ARM: bcm2835: DMA driver + spi_optimize_message - some questions/info @ 2014-04-02 16:06 Martin Sperl [not found] ` <43389276-E591-4E09-AB84-491C2CB2D9A7-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org> 0 siblings, 1 reply; 8+ messages in thread From: Martin Sperl @ 2014-04-02 16:06 UTC (permalink / raw) To: linux-spi-u79uwXL29TY76Z2rM5mHXA, linux-rpi-kernel Cc: Mark Brown, Stephen Warren Hi! I have finished a first fully working revision of the DMA-only driven SPI bus driver for the BCM2835 which is now quite efficient with regards to CPU-resources. But to make the most out of it it will require that the real device drivers are also optimized for thru-put to avoid unnecessarily rerun some computations/transformations. These "optimizations" are mostly: * use spi_messages that are created on the heap and do not need to get setup/torn down after each invocation (typically on the stack) * use the (previously discussed) optimize interface to "fix" spi_messages in such a way that minimal of processing needs to get done prior to submission to DMA (translation to dma, verify if all parameters are valid) - patch against 3.13 below * thinking of how to schedule transfers in such a way that while a complete callback is being processed more transfers are already scheduled and can run making the best use of the time available. And they apply mostly to drivers that potentially generate a high number of SPI messages in a short period of time. With this we can "reduce" a sequence of typically 7-10 commands to the device down to 3 spi_messages of which the first 2 are scheduled one after the other (note that this requires locking, so that no race condition can occur on callback, which I found out the hard way may happen even in a GPIO irq handler getting interrupted by a DMA irq handling the callback) and then optimize them once avoiding the need to verify/transform the messages 1000 times per second. This means that there are (typically) no visible gaps between the the 3 spi_messages and that the SPI clock is running with minimal "gaps". For an image of this optimized case as seen by a logic analyser, please see here: http://www.raspberrypi.org/forums/download/file.php?id=6335. For comparisson here also an image showing the PIO in kernel spi_bcm2835 driver: http://www.raspberrypi.org/forums/download/file.php?id=6332 For example a sequence with 10 transfers with a total of 53 bytes to/from the device at 8MHz bus speed takes 84us, which is fairly close to the 53us that is be the ideal transfer time for 53 bytes (without taking CS (de)asserting into account) The way the drivers have been designed we are also minimizing the number of interrupts - for such a "complex" transfer we have a total of only 4 interrupts per CAN-message (2 related to GPIO (generic+pin-specific) and 2 to trigger the complete callbacks for 2 of the 3 spi_messages) As the original spi-bcm2835 PIO-driver is known to trigger a lot of interrupts and consumes a lot of CPU (which is unfortunately not accounted for in vmstat) I also have done some tests to see how much time is really left for other work by user processes. So I have taken the following scenario: * CAN bus at 250khz * running at 100% utilization with extended frames and 8 bytes data (note that would mean an equivalent of 25% utilization on 1MHz CAN bus) * resulting in 1532 CAN messages/second which means one message every 0.616ms * running a compile of the spi-bcm2835 driver several times measuring with "time" focusing on real/elapsed time * also observing via vmstat 10 This have been tested against the following scenarios: * no SPI drivers loaded * MCP2515 optimize disabled + spi-bcm2835 currently in kernel * MCP2515 optimize enabled + spi-bcm2835 currently in kernel * MCP2515 optimize disabled + spi-bcm2835dma * MCP2515 optimize enabled + spi-bcm2835dma The difference between 2 and 3 shows the "gain" of avoiding running the _spi_async validations 4800 times per second. while the difference between 3 and 4 shows the advantage of the use of DMA scheduling - even with the overhead for the "creation" of the DMA sequences on every call and the difference between 4 and 5 shows how much can be the net-gain when optimizing/preparing those SPI messages once. And you can also compare all of them against an idle system (option 1) So here the results: driver optimize real_compile interrupts/s irq2EOT none NA 50s 300 N/A spi-bcm2835(=PIO) off 120s 45000 293us spi-bcm2835(=PIO) on 115s 45000 290us spi-bcm2835dma off 76s 6700 172us spi-bcm2835dma on 60s 6700 82us (irq2EOT = time between IRQ line goes down and the CS goes high after the last transfer) So you can see that the optimize makes a slight difference for PIO. You can also see that the costly DMA setup (which adds latencies to the initial scheduling of the first byte) is still a lot more CPU friendly than all those interrupts when using PIO. And finally the optimize with DMA gives us closer to "idle" time measurements - the difference is in parts the time spent in the IRQ handlers (GPIO and DMA), but also regarding the fact that the DMAs are producing load on the memory bus reading those control-blocks and have an impact on L2 Cache evictions. Obviously this "optimization" is most important for drivers that are running spi_messages at high rates, as there the gain becomes most significant. But it also applies to spi_messages where we transfer say 4096 bytes. On the BCM2835 an SPI-interrupt gets triggered in PIO mode every 12 bytes so that the interrupt handler can read and fill the FIFO before it there is an overrun - so for those 4096 bytes we would need to handle 341 interrupts. Assuming 8MHz BUS rate we would ideally transfer this in 0.04 seconds. So transferring 1 MByte of data would take exactly one second and trigger 83252 interrupts - that is lots of interrupts! By comparison the DMA driver would (in the unoptimized case) maybe take 30us at the start of the transfer to create the DMAs to issue and then run with a single interrupt at the end of the transfer (but on the bcm2835 it would need to limit itself to 32 transfers of 32K each, but still a single IRQ). In the optimized case we would run the transfer within 9us from issuing the spi_async call. Note that those usec measurements have been taken by "poking" GPIOs to up/down via writel and measuring the pin-changes via a logic-analyzer as logging via kprint kills performance and does not allow for insight over longer periods of time. So I hope these measurements convince you that the optimize patch below is "worth" adding - it also includes a means for the device driver to "vary" some parts of the spi message (source, destination, length, speed, delay_usec), which [wc]ould change between invocations of spi_async. As for the spi-bcm2835dma driver itself: it is still work in progress, but I am having some questions with regards to submitting it - some of it has to do with designing it in such a way that parts of it may at some point may in the future get incorporated into the spi-framework. So some facts: * driver still does not fully follow coding styles * not all vary cases are not all handled correctly and fixing those requires * it allows the use of ANY GPIO as CS not the limit of effective 2 CS exposed by the hardware. this actually started as a workaround for a buggy CS implementation in the HW block which might take CS up for a few 100th of a usec creating hickups with some devices (which only shows up when programming the SPI-registers via DMA transfers), but the step from making 2 GPIO work to make all GPIO work as CS is not really complicated... * right now I am not really using dma-engine, as it is too CPU intensive setting a complex SPI message up (a typical read+write SPI message takes typically 21 DMA transfers) * instead I started a new dma-fragment interface that has: ** dma_links, which are probably similar to some structures in dma_engine ** dma_fragments, which are chains of dma_links and/or dma_fragments the structure of which does not change and the data inside only changes in a few places ** dma_fragment_caches, which are caches of fragments which can get reused quickly without heavy setup overhead (this avoids mallocs, dma_pool_fetch) these caches also are responsible for creating fragments if needed and tearing them down there are also fetch and merge methods to create compound dma_fragments for final scheduling * there are also some "common" spi_dma_fragment parts that could easily get merged into the spi-core. * there is some provision to see if an spi message contains is of a single/double transfer type and in those cases take an "optimized" spi_message (done by the driver) instead and fill it with the data from the submitted spi_message and make use of all those "vary" cases. this would reduce the DMA creation cost for those synchronous transfers (spi_write_then_read, spi_write, spi_read), but this is not fully in place * per HW-document it requires 2 DMAs to handle the full transfer, but as per interrupt latencies unfortunately we require a 3rd DMA channel which only triggers the interrupt to schedule complete (the DMA needs to be "idle" long enough for the kernel to detect which IRQ source was responsible for the wakeup, so using TX-DMA is not an option, as the kernel is sometimes too slow to handle the interrupt before the next TX transfer is scheduled in the pipelined DMA case) * currently tested with a focus on the CAN controller mcp2515 (minor testing with other devices) In the end I have the following "distinct" fragments and caches for each of those: * spi_merged_fragment: an empty fragment used from cache into which all the below fragments get merged (avoids calls to malloc,... to be fast) * spi_setup: setup SPI registers for the next transfer includes: ** setting clock ** enabling SPI-HW+resetting FIFOs ** CS assert ** preparing TX-DMA channel - on bcm2835 a total of 6 DMAs * spi_transfer: do a single transfer (in the BCM2835 if the transfer is NOT a multiple of 4, then after the transfer the FIFOS will have to be reset) - on bcm2835 a total of 2 DMAs (1 RX, 1 TX) * spi_delay: if a single delay is configured then this does the delay in DMA - on bcm2835 a total of 1 DMAs * spi_csdelay: if there is a cs_change then this is responsible for it - - it will also do a delay prior to CS-up and keeps CS deasserted for half a SPI clock cycle. - on bcm2835 a total of 3 DMAs (delay, CS-deassert,delay) * trigger_interrupt: trigger an interrupt - typically to call complete - on bcm2835 a total of 3 DMAs (2 RX, 1IRQ DMA essentially just triggering the interrupt, but doing nothing else) * marks the spi_message as finished (this actually copies the 1MHz clock counter of the BCM2835 to a location to mark it as finished - we would even have absolute timings when the transfer finished (same could be applied for start of spi_message) if the spi core would want to expose such information. - on bcm2835 a total of 1 DMAs So a single read/write transfer is actually made up of the following: * spi_merged_fragment containing: ** spi_setup (6CBs) ** spi_transfer (2CBs) ** spi_csdelay (3CBs) ** trigger_irq (3CBs) ** finished (1CBs) so a total of 15 DMA transfers for a single SPI_message with one spi_transfer. Each additional transfer (without cs_change) adds an additional 6 CBs If the number of transfers so far is a multiple of 4 (full words), then this is reduced to an additional 2 CBs(=Control Block/dma_link) I hope these look "generic enough" to be reusable... There is also one question here with regards to semantics: When do we need to call the "complete" method? When we have received all the data or when the CS has been de-asserted (after the delay_usec)? The reason why I ask is because for optimizations some of the irq scheduling latencies could get hidden by triggering the complete while CS is still asserted. Note that I decided to go with the approach of fetch from cache and merge (where the cache functionality is hiding some of the work about releasing the structures again) instead of having separate link/unlink methods that need to get called explicitly (more method pointers in the structures,...) The central pieces of bus_driver specific code for creating/scheduling the DMA which corresponds to a spi_message with the above framework(s) looks like this: ----------------------------------------------------------------------------- struct spi_merged_dma_fragment *bcm2835dma_spi_message_to_dma_fragment( struct spi_message *msg, int flags, gfp_t gfpflags) { struct spi_device *spi = msg->spi; struct spi_master *master = spi->master; struct bcm2835dma_spi *bs = spi_master_get_devdata(master); struct spi_merged_dma_fragment *merged; struct spi_transfer *xfer; int err=0; int is_last; /* some optimizations - it might help if we knew the length... */ /* check if we got a frame that is of a single transfer */ if ( list_is_singular(&msg->transfers) ) { /* check if we got something in the structure we could use */ } /* fetch a merged fragment */ merged = (typeof(merged)) dma_fragment_cache_fetch( &bs->fragment_merged, gfpflags); if (! merged) return NULL; /* initialize some fields */ merged->message = msg; merged->transfer = NULL; merged->last_transfer = NULL; merged->dma_fragment.link_head = NULL; merged->dma_fragment.link_tail = NULL; merged->complete_data = NULL; merged->needs_spi_setup = 1; /* now start iterating the transfers */ list_for_each_entry(xfer, &msg->transfers, transfer_list) { /* check if we are the last in the list */ is_last = list_is_last(&xfer->transfer_list, &msg->transfers); /* assign the current transfer */ merged->transfer = xfer; /* do we need to reconfigure spi compared to the last transfer */ if (! merged->needs_spi_setup) { if (merged->last_transfer->speed_hz != xfer->speed_hz) merged->needs_spi_setup = 1; else if (merged->last_transfer->tx_nbits != xfer->tx_nbits) merged->needs_spi_setup = 1; else if (merged->last_transfer->rx_nbits != xfer->rx_nbits) merged->needs_spi_setup = 1; else if (merged->last_transfer->bits_per_word != xfer->bits_per_word) merged->needs_spi_setup = 1; } /* if we have no last_transfer, then we need to setup spi */ if (merged->needs_spi_setup) { err = spi_merged_dma_fragment_merge_fragment_cache( &bs->fragment_setup_spi, merged, gfpflags); if (err) goto error; merged->needs_spi_setup = 0; } /* add transfer if the transfer length is not 0 or if we vary length */ if ( (xfer->len) /* || (xfer->vary & SPI_OPTIMIZE_VARY_LENGTH) */) { /* schedule transfer */ err = spi_merged_dma_fragment_merge_fragment_cache( &bs->fragment_transfer, merged, gfpflags); if (err) goto error; /* set last transfer */ merged->last_transfer=xfer; } /* add cs_change with optional extra delay if requested or last in sequence */ if ((xfer->cs_change)||(is_last)) { err = spi_merged_dma_fragment_merge_fragment_cache( &bs->fragment_cs_deselect, merged, gfpflags); } else if ( (xfer->delay_usecs) /* || (xfer->vary & SPI_OPTIMIZE_VARY_DELAY) */) { /* or add a delay if requested */ err = spi_merged_dma_fragment_merge_fragment_cache( &bs->fragment_delay, merged, gfpflags); } if (err) goto error; } /* and add an interrupt if we got a callback to handle * if there is no callback, then we do not need to release it * immediately - even for prepared messages */ if (msg->complete) { err=spi_merged_dma_fragment_merge_fragment_cache( &bs->fragment_trigger_irq, merged, gfpflags); if (err) goto error; } /* and the "end of transfer flag" * - must be the last to avoid races... */ err=spi_merged_dma_fragment_merge_fragment_cache( &bs->fragment_message_finished, merged, gfpflags); if (err) goto error; /* reset transfers, as these are invalid by the time * we run the transforms */ merged->transfer = NULL; merged->last_transfer = NULL; /* and return it */ return merged; error: dev_printk(KERN_ERR,&spi->dev, "bcm2835dma_spi_message_to_dma_fragment: err=%i\n", err); spi_merged_dma_fragment_dump( merged, &msg->spi->dev, 0,0, &bcm2835_dma_link_dump ); return NULL; } static int bcm2835dma_spi_message_optimize(struct spi_message *message) { message->state = bcm2835dma_spi_message_to_dma_fragment( message, 0, GFP_ATOMIC); if (!message->state) return -ENOMEM; if (unlikely(debug_dma & DEBUG_DMA_OPTIMIZE)) { dev_printk(KERN_INFO,&message->spi->dev, "Optimizing %pf %pf\n",message,message->state); if (debug_dma & DEBUG_DMA_OPTIMIZE_DUMP_FRAGMENT) { spi_merged_dma_fragment_dump(message->state, &message->spi->dev, 0,0, &bcm2835_dma_link_dump ); } } return 0; } static void bcm2835dma_spi_message_unoptimize(struct spi_message *msg) { dma_fragment_release(msg->state); msg->state=NULL; if (unlikely(debug_dma&DEBUG_DMA_OPTIMIZE)) { dev_printk(KERN_INFO,&msg->spi->dev, "Unoptimizing %pf\n",msg); } } static int bcm2835dma_spi_transfer(struct spi_device *spi, struct spi_message *message) { int err=0; struct spi_merged_dma_fragment *merged; struct spi_master *master = spi->master; struct bcm2835dma_spi *bs = spi_master_get_devdata(master); unsigned long flags; /* fetch DMA fragment */ #ifdef SPI_HAVE_OPTIMIZE if ((message->is_optimized) ) { merged = message->state; } else #endif { merged = bcm2835dma_spi_message_to_dma_fragment( message, 0, GFP_ATOMIC); message->state = merged; } if (!merged) return -ENOMEM; /* assign some values */ message->actual_length = 0; /* and execute the pre-transforms */ err = spi_merged_dma_fragment_execute_pre_dma_transforms( merged,merged,GFP_ATOMIC); if (err) goto error; if (unlikely(debug_dma & DEBUG_DMA_ASYNC)) { dev_printk(KERN_INFO,&message->spi->dev, "Scheduling Message %pf fragment %pf\n", message,merged); if (debug_dma & DEBUG_DMA_ASYNC_DUMP_FRAGMENT) { spi_merged_dma_fragment_dump(merged, &message->spi->dev, 0,0, &bcm2835_dma_link_dump ); } } /* statistics on message submission */ spin_lock_irqsave(&master->queue_lock,flags); bs->count_spi_messages++; #ifdef SPI_HAVE_OPTIMIZE if ((message->is_optimized) ) bs->count_spi_optimized_messages++; #endif spin_unlock_irqrestore(&master->queue_lock,flags); /* and schedule it */ if (merged) { bcm2835dma_schedule_dma_fragment(message); } /* and return */ return 0; error: dev_printk(KERN_ERR,&spi->dev,"spi_transfer_failed: %i",err); dma_fragment_release(&merged->dma_fragment); return -EPERM; } static int bcm2835dma_spi_probe(struct platform_device *pdev) { ... master->transfer = bcm2835dma_spi_transfer; #ifdef SPI_HAVE_OPTIMIZE if (use_optimize) { master->optimize_message = bcm2835dma_spi_message_optimize; master->unoptimize_message = bcm2835dma_spi_message_unoptimize; } #endif ... } ----------------------------------------------------------------------------- The diff (against 3.13) to the framework to allow spi_optimize to work: diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c index d745f95..24a54aa 100644 --- a/drivers/spi/spi.c +++ b/drivers/spi/spi.c @@ -1598,15 +1598,12 @@ int spi_setup(struct spi_device *spi) } EXPORT_SYMBOL_GPL(spi_setup); -static int __spi_async(struct spi_device *spi, struct spi_message *message) +static inline int __spi_verify(struct spi_device *spi, + struct spi_message *message) { struct spi_master *master = spi->master; struct spi_transfer *xfer; - message->spi = spi; - - trace_spi_message_submit(message); - if (list_empty(&message->transfers)) return -EINVAL; if (!message->complete) @@ -1705,9 +1702,28 @@ static int __spi_async(struct spi_device *spi, struct spi_message *message) return -EINVAL; } } + return 0; +} + +static int __spi_async(struct spi_device *spi, struct spi_message *message) +{ + struct spi_master *master = spi->master; + int ret = 0; + + trace_spi_message_submit(message); + + if (message->is_optimized) { + if (spi != message->spi) + return -EINVAL; + } else { + message->spi = spi; + ret = __spi_verify(spi,message); + if (ret) + return ret; + } message->status = -EINPROGRESS; - return master->transfer(spi, message); + return spi->master->transfer(spi, message); } /** @@ -1804,6 +1820,48 @@ int spi_async_locked(struct spi_device *spi, struct spi_message *message) } EXPORT_SYMBOL_GPL(spi_async_locked); +/** + * spi_message_optimize - optimize a message for repeated use minimizing + * processing overhead + * + * @spi: device with which data will be exchanged + * @message: describes the data transfers, including completion callback + * Context: can sleep + */ +int spi_message_optimize(struct spi_device *spi, + struct spi_message *message) +{ + int ret = 0; + if (message->is_optimized) + spi_message_unoptimize(message); + + message->spi = spi; + ret = __spi_verify(spi,message); + if (ret) + return ret; + + if (spi->master->optimize_message) + ret = spi->master->optimize_message(message); + if (ret) + return ret; + + message->is_optimized = 1; + + return 0; +} +EXPORT_SYMBOL_GPL(spi_message_optimize); + +void spi_message_unoptimize(struct spi_message *message) +{ + if (!message->is_optimized) + return; + + if (message->spi->master->unoptimize_message) + message->spi->master->unoptimize_message(message); + + message->is_optimized = 0; +} +EXPORT_SYMBOL_GPL(spi_message_unoptimize); /*-------------------------------------------------------------------------*/ diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h index 8c62ba7..5206038 100644 --- a/include/linux/spi/spi.h +++ b/include/linux/spi/spi.h @@ -287,6 +287,12 @@ static inline void spi_unregister_driver(struct spi_driver *sdrv) * spi_finalize_current_transfer() so the subsystem can issue * the next transfer * @unprepare_message: undo any work done by prepare_message(). + * @optimize_message: allow computation of optimizations of a spi message + * this may set up the corresponding DMA transfers + * or do other work that need not get computed every + * time a message gets executed + * Not called from interrupt context. + * @unoptimize_message: undo any work done by @optimize_message(). * @cs_gpios: Array of GPIOs to use as chip select lines; one per CS * number. Any individual value may be -ENOENT for CS lines that * are not GPIOs (driven by the SPI controller itself). @@ -412,7 +418,8 @@ struct spi_master { struct spi_message *message); int (*unprepare_message)(struct spi_master *master, struct spi_message *message); - + int (*optimize_message)(struct spi_message *message); + void (*unoptimize_message)(struct spi_message *message); /* * These hooks are for drivers that use a generic implementation * of transfer_one_message() provied by the core. @@ -506,6 +513,8 @@ extern struct spi_master *spi_busnum_to_master(u16 busnum); * @delay_usecs: microseconds to delay after this transfer before * (optionally) changing the chipselect status, then starting * the next transfer or completing this @spi_message. + * @vary: allows a driver to mark a SPI transfer as "modifyable" on the + * specific pieces of information * @transfer_list: transfers are sequenced through @spi_message.transfers * * SPI transfers always write the same number of bytes as they read. @@ -584,6 +593,12 @@ struct spi_transfer { u8 bits_per_word; u16 delay_usecs; u32 speed_hz; +#define SPI_OPTIMIZE_VARY_TX_BUF (1<<0) +#define SPI_OPTIMIZE_VARY_RX_BUF (1<<1) +#define SPI_OPTIMIZE_VARY_SPEED_HZ (1<<2) +#define SPI_OPTIMIZE_VARY_DELAY_USECS (1<<3) +#define SPI_OPTIMIZE_VARY_LENGTH (1<<4) + u32 vary; struct list_head transfer_list; }; @@ -594,6 +609,9 @@ struct spi_transfer { * @spi: SPI device to which the transaction is queued * @is_dma_mapped: if true, the caller provided both dma and cpu virtual * addresses for each transfer buffer + * @is_optimized: if true, then the spi_message has been processed with + * spi_message_optimize() - @state belongs to the spi-driver now + * and may not get set by the driver * @complete: called to report transaction completions * @context: the argument to complete() when it's called * @actual_length: the total number of bytes that were transferred in all @@ -622,6 +640,8 @@ struct spi_message { struct spi_device *spi; unsigned is_dma_mapped:1; +#define SPI_HAVE_OPTIMIZE + unsigned is_optimized:1; /* REVISIT: we might want a flag affecting the behavior of the * last transfer ... allowing things like "read 16 bit length L" @@ -655,6 +675,9 @@ static inline void spi_message_init(struct spi_message *m) INIT_LIST_HEAD(&m->transfers); } +int spi_message_optimize(struct spi_device *s,struct spi_message *m); +void spi_message_unoptimize(struct spi_message *m); + static inline void spi_message_add_tail(struct spi_transfer *t, struct spi_message *m) { ----------------------------------------------------------------------------- So the questions are: Are there any major questions on this design? Does it look acceptable from a design perspective? Is it generic enough that it might get used with other devices as well? Do we need all those "VARY" cases or would we need to vary by more fields? I still think that the optimize/unoptimize_message methods are quite similar to prepare/unprepare_message so the question is if these can not get merged into a single pair of methods. Some of those influence the back-end parts that do the setup of the initial and link-time transfers (which I have not shared yet and which need most cleanup) Also it influences the final design on vary in the which is still a bit more complicated (and as mentioned not fully working at this moment). As soon I have some guidance which way to go I will do a final cleanup and and more documentation and share it in the form of patches... Also the question is how I should initially submit those "generic" components. Should I create a helper.h and helper.c with those components, so that they can get split out later or should I keep them separate from the start. Thanks, Martin P.s: people interrested in the still "unfit" code as of now please look at the out of tree driver at: https://github.com/msperl/spi-bcm2835 p.p.s: I hope the formatting of the email is fine - if not, then apologies... -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 8+ messages in thread
[parent not found: <43389276-E591-4E09-AB84-491C2CB2D9A7-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org>]
* Re: ARM: bcm2835: DMA driver + spi_optimize_message - some questions/info [not found] ` <43389276-E591-4E09-AB84-491C2CB2D9A7-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org> @ 2014-04-02 18:15 ` Mark Brown [not found] ` <20140402181547.GH2269-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 8+ messages in thread From: Mark Brown @ 2014-04-02 18:15 UTC (permalink / raw) To: Martin Sperl Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, linux-rpi-kernel, Stephen Warren [-- Attachment #1: Type: text/plain, Size: 3681 bytes --] On Wed, Apr 02, 2014 at 06:06:11PM +0200, Martin Sperl wrote: Just a general meta comment: your mail is over 700 lines long, that's way too long. Please split things up into smaller chunks or take a higher level view as something this long is just asking far too much of the reader especially without work to structure the text. Things like splitting out the device specifics help a lot. I've just skimmed through this for now, I'll try to find time to read in more detail but there's a backlog of patches I need to get to. > * right now I am not really using dma-engine, as it is too CPU intensive > setting a complex SPI message up (a typical read+write SPI message takes > typically 21 DMA transfers) > * instead I started a new dma-fragment interface that has: This needs to be resolved for mainlining, if dmaengine is not up to the job we need to either decide to replace it wholesale or extend it (more likely the latter) rather than just doing something custom. I do see the need to improve the DMA APIs here but that needs to be done in a way that's joined up with other work on DMA. > * there is some provision to see if an spi message contains is of a > single/double transfer type and in those cases take an "optimized" > spi_message (done by the driver) instead and fill it with the data > from the submitted spi_message and make use of all those "vary" cases. > this would reduce the DMA creation cost for those synchronous transfers > (spi_write_then_read, spi_write, spi_read), but this is not fully in place I don't know what a "single/double transfer type" is. > There is also one question here with regards to semantics: > When do we need to call the "complete" method? > When we have received all the data or when the CS has been de-asserted > (after the delay_usec)? Users might reasonably assume that the transfer has finished at the time any callback runs and that has to include the /CS change otherwise the device might not have reacted. > So the questions are: > Are there any major questions on this design? > Does it look acceptable from a design perspective? > Is it generic enough that it might get used with other devices as well? I was having a bit of a wood from the trees problem following the design here but broadly there does seem to be a reasonable amount of usefulness and reusability here. A lot of systems are going to struggle to use the full feature set due to restrictions on things like minimum DMA transfer sizes and not being able to DMA to control registers and like I say the DMA API issues need to be addressed. Trying to do chip selects with DMA is also going to be a lot of fun in the general case, especially with GPIOs. I also suspect that there is a cutoff point where PIO makes sense even if we can DMA. > Do we need all those "VARY" cases or would we need to vary by more fields? The main things are probably replacing and changing the data. > I still think that the optimize/unoptimize_message methods are quite similar > to prepare/unprepare_message so the question is if these can not get merged > into a single pair of methods. Drivers are currently using prepare_message() for actual hardware setup specific to the message that's done around actual transmission whereas the optimisation can be done very distant to the actual use of the message. > Also the question is how I should initially submit those "generic" components. > Should I create a helper.h and helper.c with those components, so that they > can get split out later or should I keep them separate from the start. I'm not sure what those helper bits are but in general keeping things so that they can be reused is best. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <20140402181547.GH2269-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: ARM: bcm2835: DMA driver + spi_optimize_message - some questions/info [not found] ` <20140402181547.GH2269-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2014-04-02 20:40 ` Martin Sperl [not found] ` <1AA37E97-BDD7-4B53-B092-18D5D7439F8B-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org> 0 siblings, 1 reply; 8+ messages in thread From: Martin Sperl @ 2014-04-02 20:40 UTC (permalink / raw) To: Mark Brown Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, linux-rpi-kernel, Stephen Warren Thanks for the feedback. Will try to keep emails shorter - I wanted to get everything out in one and document it also trying to answer questions I thought would come up. So I will take it that I can continue from where I am right now. Just to summarize the current "modules" and their respective files: spi-optimize - essentially the patch shared with my original email touching: include/linux/spi/spi.h driver/spi/spi.c maybe adding some documentation on how to "optimize" high-volume drivers to make optimal use of the interface. basic generic dma-fragment support (structures/methods): include/linux/dma-fragment.h drivers/dma/dma-fragment.c (maybe add: Documentation/dma-fragment.txt) bcm2835 specific fragment support (dumping, scheduling, linking): include/linux/dma/bcm2835-dma.h drivers/dma/bcm2835-dma.c generic spi specific dma-fragment code: include/linux/spi/spi-dmafragment.h drivers/spi/spi-dmafragment.c mostly about extensions pre/post dma transforms for vary support - also handling DMA-mapping, and data for the state-machine used in transformation from spi-message to dma_fragment) this could get integrated with spi.h/spi.c if it is deemed sensible spi-bcm2835dma driver: include/linux/spi/bcm2835.h (SPI DMA registers and Bits) drivers/spi/spi-bcm2835dma.h (datastructures and defines) drivers/spi/spi-bcm2835dma_drv.c (driver itself) drivers/spi/spi-bcm2835dma_frag.c (the code filling/creating the fragments) Does this look "ok" from an approach perspective? I will share one module after the other for initial review. I will brush up the code for the generic DMA fragment code, so that it becomes presentable and will share that part in the hope that it live in parallel to dmaengine and/or get somehow integrated - possibly as another target besides cyclic. It is really mostly management of those dma blocks so that they can get joined together easily as well as the management of a group of those blocks. Also it is designed to be generic that is why there is the bcm2835 specific code as well as a separate module. As such drivers using this specific interface will need to know enough about the HW itself, my assumption is that the driver will need to know about the DMA engine used to work properly - so I left all sorts of abstractions out of the design - this can get added later if needed... Anyway this should come with deeper integration with dma-engine. With regards to VARY: I have used the "options" provided to cover most of the bases that I thought I would need to "auto-optimize" those simple cases of spi_read, spi_write, spi_write_then_read. The VARY_DELAY_USEC was just added as an after-thought, as it is not so "complex" to enable this as a feature... As for the measurements on effective CPU utilization shared originally - is the data shared understood/convincing? Does it need more explanation? Ciao, Martin P.s: as an afterthought: I actually think that I could implement a DMA driven bit-bang SPI bus driver with up to 8 data-lines using the above dma_fragment approach - not sure about the effective clock speed that this could run... But right now it is not worth pursuing that further. -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <1AA37E97-BDD7-4B53-B092-18D5D7439F8B-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org>]
* Re: ARM: bcm2835: DMA driver + spi_optimize_message - some questions/info [not found] ` <1AA37E97-BDD7-4B53-B092-18D5D7439F8B-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org> @ 2014-04-03 22:02 ` Mark Brown [not found] ` <20140403220232.GE14763-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 8+ messages in thread From: Mark Brown @ 2014-04-03 22:02 UTC (permalink / raw) To: Martin Sperl Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, linux-rpi-kernel, Stephen Warren [-- Attachment #1: Type: text/plain, Size: 2676 bytes --] On Wed, Apr 02, 2014 at 10:40:41PM +0200, Martin Sperl wrote: > Will try to keep emails shorter - I wanted to get everything out in one > and document it also trying to answer questions I thought would come up. Sending several mails might've helped too - the way a patch series is structured is actually a fairly good way of breaking things up for comprehnsibility. > Just to summarize the current "modules" and their respective files: > spi-optimize - essentially the patch shared with my original email touching: > include/linux/spi/spi.h > driver/spi/spi.c > maybe adding some documentation on how to "optimize" high-volume drivers > to make optimal use of the interface. There should be some win from this purely from the framework too even without drivers doing anything. > Does this look "ok" from an approach perspective? Broadly. Like I say the DMA stuff is the biggest alarm bell - if it's not playing nicely with dmaengine that'll need to be dealt with. > I will share one module after the other for initial review. It's OK to post the lot at once as part of a series but it needs to be split up into separate logical patches. Hopefully we can get the externally visible optimisation stuff merged relatively quickly so that drivers can start building on it. > I will brush up the code for the generic DMA fragment code, so that it > becomes presentable and will share that part in the hope that it live in > parallel to dmaengine and/or get somehow integrated - possibly as another target > besides cyclic. Yes, something like that. The most important thing is to make it usable with any driver. > As such drivers using this specific interface will need to know enough about > the HW itself, my assumption is that the driver will need to know about the DMA > engine used to work properly - so I left all sorts of abstractions out of the > design - this can get added later if needed... > Anyway this should come with deeper integration with dma-engine. That would seem very surprising - I'd really have expected that we'd be able to expose enough capability information from the DMA controllers to allow fairly generic code; there's several controllers that have to work over multiple SoCs. > P.s: as an afterthought: I actually think that I could implement a DMA driven > bit-bang SPI bus driver with up to 8 data-lines using the above dma_fragment > approach - not sure about the effective clock speed that this could run... > But right now it is not worth pursuing that further. Right, and it does depend on being able to DMA to set GPIOs which is challenging in the general case. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <20140403220232.GE14763-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: ARM: bcm2835: DMA driver + spi_optimize_message - some questions/info [not found] ` <20140403220232.GE14763-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2014-04-04 14:17 ` martin sperl [not found] ` <5AAD4FEA-2887-4A9D-9FE3-588210BFD1A6-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org> 0 siblings, 1 reply; 8+ messages in thread From: martin sperl @ 2014-04-04 14:17 UTC (permalink / raw) To: Mark Brown Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, linux-rpi-kernel, Stephen Warren Hi Mark! This is again a long email - trying to answer your questions/concerns. On 04.04.2014 00:02, Mark Brown wrote: > There should be some win from this purely from the framework too even > without drivers doing anything. > If the device-driver does not do anything, then there is no cost involved with framework (ok - besides an additional if "(!msg->is_optimized) ...") If the bus driver does not support optimization there is still some win. I have shared the "optimization" data already, but here again the overview: Running a compile of the driver several times (measuring elapsed time) with different driver/optimize combinations: driver optimize real_compile interrupts/s irq2EOT none NA 50s 300 N/A spi-bcm2835 off 120s 45000 293us spi-bcm2835 on 115s 45000 290us spi-bcm2835dma off 76s 6700 172us spi-bcm2835dma on 60s 6700 82us For the "default" driver essentially the CPU cycles available to userspace went up from 41.6% to 43.5% (=50s/115s). It is not much, but it is still something. This is achived by cutting out "_spi_verify", which is mostly "_spi_async()" in the current code. But if you take now the optimize + DMA-driver case: we have 83.3% (=50s/60s) CPU available for userspace. And without optimize: 65.8% (=50s/76s). And both of those numbers are big wins! Note that the first version of the driver did not implement caching for fragments, but was rebuilding the full DMA-chain on the fly each time, the available CPU cycles was somewhere in the 45-50% range, so better than the stock driver, but not by much. Merging only 5 Fragments is way more efficient than building 19 dma controlblocks from scratch including the time for allocation, filling with data, deallocation,... As for "generic/existing unoptimized" device-drivers - as mentioned - there is the idea of providing an auto-optimize option for common spi_read, spi_write, spi_write_then_read type cases (by making use of VARY and optimize on some driver-prepared messages) For the framework there might also be the chance to do some optimizations of its own when "spi_optimize" gets called for a message. There the framework might want to call the spi_prepare methods only once. But I do not fully know the use-cases and semantics for prepare inside the framework - you say it is different from the optimize I vision. A side-effect of optimize means that ownership of state and queue members is transferred to the framework/bus_driver and only those fields flagged by vary may change. There may be some optimizations possible for the framework based on this "transfer of ownership"... > That would seem very surprising - I'd really have expected that we'd be > able to expose enough capability information from the DMA controllers to > allow fairly generic code; there's several controllers that have to work > over multiple SoCs. > It is mostly related to knowing the specific registers which you need to set... How to make it more abstract I have not yet figured it out. But it might boil down to something like that: * create Fragment * add Poke(frag,Data,bus_address(register)) * add Poke ... As of now I am more explicit yet, which is also due to the fact that I want to be able to handle a few transfer cases together (write only, read only, read-write), which require slightly different DMA parameters - and the VARY interface should allow me to handle all together with minimal setup overhead. But for this to know you need to "know" the DMA capabilities to make the most of it - maybe some abstraction is possible there as well... But it is still complicated by the fact that the driver needs to use 3 DMA channels to drive SPI. As mentioned actually 2, but the 3rd is needed to stably trigger a completion interrupt without any race conditions, that would inhibit the DMA interrupt to really get called (irq-flag might have been cleared already). So this is quite specific to the DMA + SPI implementation. >> P.s: as an afterthought: I actually think that I could implement a DMA driven >> bit-bang SPI bus driver with up to 8 data-lines using the above dma_fragment >> approach - not sure about the effective clock speed that this could run... >> But right now it is not worth pursuing that further. >> > Right, and it does depend on being able to DMA to set GPIOs which is > challenging in the general case. "pulling" GPIO up/down - on the BCM2835 it is fairly simple: to set a GPIO: write to GPIOSET registers with the corresponding bits (1<<GPIOPIN) set. To clear it: write to GPIOCLEAR registers again with the same mask. So DMA can set all or 0 GPIO pins together. One can set/clear up to 32 GPIO with a single writel or DMA. Drawback is that it needs two writes to set an exact value for multiple GPIOs and under some circumstances you need to be aware of what you do. This feature is probably due to the "dual" CPU design ARM + VC4/GPU, which allows to work the GPIO pins from both sides without any concurrency issues (as long as the ownership of the specific pin is clear). The concurrency is serialized between ARM, GPU and DMA via the common AXI bus. Unfortunately the same is NOT possible for changing GPIO directions / alternate functions (but this is supposed to be rarer, so it can get arbitrated between components...) > Broadly. Like I say the DMA stuff is the biggest alarm bell - if it's > not playing nicely with dmaengine that'll need to be dealt with. > As for DMA-engine: The driver should (for the moment) also work with minimal changes also on the foundation kernel - there is a much bigger user base there, that use it for LCD displays, CAN controllers, ADCs and more - so it gets more exposure to different devices than I can access myself. But still: I believe I must get the basics right first before I can start addressing DMAengine. And one of the issues I have with DMA-engine is that you always have to set up tear down the DMA-transfers (at least the way I understood it) and that is why I created this generic DMA-fragment interface which can also cache some of those DMA artifacts and allows chaining them in individual order. So the idea is to take that to build the DMA-control block chain and then pass it on to the dma-engine. Still a lot of things are missing - for example if the DMA is already running and there is another DMA fragment to execute the driver chains those fragments together in the hope that the DMA will continue and pick it up. Here the stats for 53M received CAN messages: root@raspberrypi:~/spi-bcm2835# cat /sys/class/spi_master/spi0/stats bcm2835dma_stats_info - 0.1 total spi_messages: 160690872 optimized spi_messages: 160690870 started dma: 53768175 linked to running dma: 106922697 last dma_schedule type: linked dma interrupts: 107127237 queued messages: 0 As explained, my highly optimized device driver schedules 3 spi_messages. the first 2 together the 3rd in the complete function of the 1st message. And the counters for "linked to running dma" is about double the counter of "started DMA". The first spi_message will need to get stated normally (as it is typically idle) while the 2nd and 3rd are typically linked. If you do the math this is linking happens in 66.54% of all spi_messages. Under ideal circumstances this value should be 66.666666% (=2/3). So there are times when the ARM is slightly too slow and typically the 3rd message is really scheduled only when the DMA has already stopped. Running for more than 2 days with 500M CAN-messages did not show any further races (but the scheduling needs to make heavy use of dsb() that this does not happen....) This kind of thing is something that DMA-engine does not support as of now. But prior to getting something like this accepted it first needs a proof that it works... And this is the POC that shows that it is possible and gives huge gains (at least on some platforms)... Hope this answers your questions. Ciao, Martin-- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <5AAD4FEA-2887-4A9D-9FE3-588210BFD1A6-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org>]
* Re: ARM: bcm2835: DMA driver + spi_optimize_message - some questions/info [not found] ` <5AAD4FEA-2887-4A9D-9FE3-588210BFD1A6-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org> @ 2014-04-10 22:35 ` Mark Brown [not found] ` <20140410223531.GF6518-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 8+ messages in thread From: Mark Brown @ 2014-04-10 22:35 UTC (permalink / raw) To: martin sperl Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, linux-rpi-kernel, Stephen Warren [-- Attachment #1: Type: text/plain, Size: 5947 bytes --] On Fri, Apr 04, 2014 at 04:17:24PM +0200, martin sperl wrote: > On 04.04.2014 00:02, Mark Brown wrote: > > There should be some win from this purely from the framework too even > > without drivers doing anything. > If the device-driver does not do anything, then there is no cost involved with > framework (ok - besides an additional if "(!msg->is_optimized) ...") > If the bus driver does not support optimization there is still some win. Right, I was more pointing out that you were underselling here - you were talking only about the benefits when everything is updated. This also means we can usefully merge partial support, addressing only some of the improvements is still going to give some wins and makes it easier to digest the code overall. > For the framework there might also be the chance to do some optimizations of > its own when "spi_optimize" gets called for a message. > There the framework might want to call the spi_prepare methods only once. > But I do not fully know the use-cases and semantics for prepare inside the > framework - you say it is different from the optimize I vision. The main thing I'd like to do is actually coalesce adjacent transfers within a message if they are compatible and we're doing DMA so we end up with fewer hardware interactions overall. There's a fairly common pattern of writing multiple registers (for firmware download or setting coefficients) where you have a small transfer for a register then a larger one for a data block which would benefit. For prepare I'd expect we can get to skipping reconfiguring the hardware entirely when not needed even without an optimisation step, though the tradeoff might not be worth it. This may be required more than once in a message though. > > That would seem very surprising - I'd really have expected that we'd be > > able to expose enough capability information from the DMA controllers to > > allow fairly generic code; there's several controllers that have to work > > over multiple SoCs. > It is mostly related to knowing the specific registers which you need to set... > How to make it more abstract I have not yet figured it out. > But it might boil down to something like that: > * create Fragment > * add Poke(frag,Data,bus_address(register)) > * add Poke ... Registers where? If it's in the DMA controller we can just add ops for the DMA controller can't we? > But it is still complicated by the fact that the driver needs to use 3 DMA > channels to drive SPI. As mentioned actually 2, but the 3rd is needed to > stably trigger a completion interrupt without any race conditions, that > would inhibit the DMA interrupt to really get called (irq-flag might have > been cleared already). Right, that's a peculiarity of your system (hopefully) which makes life difficult for that. The Samsung controller has a vaugely similar thing where the transmit interrupt goes off too early for what we want so we need to ensure there's always a read if we're doing interrupts but that is a lot easier to work around. > > Right, and it does depend on being able to DMA to set GPIOs which is > > challenging in the general case. > "pulling" GPIO up/down - on the BCM2835 it is fairly simple: > to set a GPIO: write to GPIOSET registers with the corresponding bits > (1<<GPIOPIN) set. > To clear it: write to GPIOCLEAR registers again with the same mask. > So DMA can set all or 0 GPIO pins together. That's quite convenient for this sort of use - lots of controllers just have a register where you need to set the values of some GPIOs so you end up doing a read/modify/write which is obviously hard with DMA. We would need gpiolib to export an interface to allow this operation; it should be doable but I expect it's like the regular bulk set and tends to get bogged down. > > Broadly. Like I say the DMA stuff is the biggest alarm bell - if it's > > not playing nicely with dmaengine that'll need to be dealt with. > As for DMA-engine: The driver should (for the moment) also work with minimal > changes also on the foundation kernel - there is a much bigger user base > there, that use it for LCD displays, CAN controllers, ADCs and more - so it > gets more exposure to different devices than I can access myself. It's good to talk early so the ideas are in people's heads if nothing else, it also helps ensure everyone is travelling in the same direction. > But still: I believe I must get the basics right first before I can start > addressing DMAengine. Sounds like you already have a pretty good prototype? > And one of the issues I have with DMA-engine is that you always have to set up > tear down the DMA-transfers (at least the way I understood it) and that is why > I created this generic DMA-fragment interface which can also cache some of > those DMA artifacts and allows chaining them in individual order. > So the idea is to take that to build the DMA-control block chain and then pass > it on to the dma-engine. Right, that's the biggest thing I can see in what you're suggesting for dmaengine and it sounds like something other users could also benefit from so I'd not expect massive pushback. > This kind of thing is something that DMA-engine does not support as of now. > But prior to getting something like this accepted it first needs a proof > that it works... On the other hand you may find that what you've got already is enough for people, or get a feel for the sort of things people are looking for to show a win. Sometimes you'll even get "oh, that's a really good idea I'll just add the feature" (if you're very lucky). BTW I have acquired a RPi for some other use - is the CAN board you are using something I can order? I'm not sure if I'd ever have much time to look at it but I'm generally trying to expand the range of SPI hardware I have available for test. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <20140410223531.GF6518-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: ARM: bcm2835: DMA driver + spi_optimize_message - some questions/info [not found] ` <20140410223531.GF6518-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2014-04-11 12:40 ` martin sperl [not found] ` <25BB67E3-2929-49D2-BDDE-2D6B3D43534E-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org> 0 siblings, 1 reply; 8+ messages in thread From: martin sperl @ 2014-04-11 12:40 UTC (permalink / raw) To: Mark Brown Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, linux-rpi-kernel, Stephen Warren On 11.04.2014 00:35, Mark Brown wrote: > Right, I was more pointing out that you were underselling here - you were > talking only about the benefits when everything is updated. This also means > we can usefully merge partial support, addressing only some of the > improvements is still going to give some wins and makes it easier to digest > the code overall. Well - I thought I had shown this with my shared timing-data already... (it was just a 5s gain for the PIO driver, but still...) > The main thing I'd like to do is actually coalesce adjacent transfers within > a message if they are compatible and we're doing DMA so we end up with fewer > hardware interactions overall. There's a fairly common pattern of writing > multiple registers (for firmware download or setting coefficients) where you > have a small transfer for a register then a larger one for a data block which > would benefit. I just found out something similar just a few days ago: doing a write then read with 2+1 is more effort to handle than a read/write of 3 bytes and some moving of data arround. Also it actually saves memory, as an spi_transfer is bigger than those 3 extra bytes needed. With the DMA Driver this saved me 4us on overall transfer time - It dropped from 52us to 47us. For comparisson: the real data transfer for those 36 bytes (without 7x CS) takes 35us - so we are pretty close to ideal... There is not so much of a difference with the PIO drivers as they take 200us for the same, but it sees a drop on the number of interrupts (typically 1 IRQ/transfer), which increases CPU availability for userspace... Redesigning the interrupt-handler to push bytes spanning multiple spi_transfers into the fifo together would help, but then it becomes much more complex - at least the spi-bcm2835 does call "complete(&bs->done);" when it has finished the transfer. I believe I will add all those findings to the documentation with the spi_optimize patch... > Registers where? If it's in the DMA controller we can just add ops for > the DMA controller can't we? Possibly - still working on cleaning up the code and making it as generic as possible > Right, that's a peculiarity of your system (hopefully) which makes life > difficult for that. The Samsung controller has a vaugely similar thing > where the transmit interrupt goes off too early for what we want so we > need to ensure there's always a read if we're doing interrupts but that > is a lot easier to work around. Specs says to use 2 Interrupts one for feeding data to FIFO the other to read data from FIFO. Probably one could use multiple interleaved DMAs for reads and writes, but then you would be limited to 32 byte transfers and would have some "gaps" between the fifos filling/emptying - this way at least 65535 bytes can get sent with a single transfer... The interrupt DMA is more on the tricky side - if we had to have an interrupt every time an spi_message finished, it would be possible with 2 DMAs, but as we have the possibility of chaining multiple spi_messages together for higher thruput, then this "waiting" for interrupt is not possible (IRQ flag gets sometimes cleared by the next DMA controlblock), so we need to have another "stable" IRQ source and hence DMA3. (actually I was using the TX DMA for that, which gets started by the RX-DMA, so we have a race-time between DMA for IRQ and the interrupt handler on the CPU reading the registers quickly enough to find out the source of the IRQ) > That's quite convenient for this sort of use - lots of controllers > just have a register where you need to set the values of some GPIOs so > you end up doing a read/modify/write which is obviously hard with DMA. > We would need gpiolib to export an interface to allow this operation; > it should be doable but I expect it's like the regular bulk set and > tends to get bogged down. yes - it is most convenient - but on the other hand it is also a bit of a pain if you need to do that on the ARM, as you have to do 2 writes for a single change in gpiolib and ordering could become a problem with bitbanged SPI, but it is possible (zero first, set last... > Sounds like you already have a pretty good prototype? > As said: it mostly works, but i need to cleanup the code before I can post it, but before I do that I had to finish the CAN driver that makes use of the features to see what pitfalls we still have. One of those found is the fact that an optimized message can only reliably get used with spi_async. For spi_sync use it has to have the complete function defined when optimizing, as it would otherwise not implement a callback IRQ in DMA. And then when spi_sync sets the complete function the DMA would still issue the optimized message without interrupts. Also there is one other race-condition, that happens sometimes quickly sometimes after 48 hours, but I think I might have an idea but for that I need more info on the event itself... > Right, that's the biggest thing I can see in what you're suggesting for > dmaengine and it sounds like something other users could also benefit > from so I'd not expect massive pushback. > > >> This kind of thing is something that DMA-engine does not support as of now. >> But prior to getting something like this accepted it first needs a proof >> that it works... >> > On the other hand you may find that what you've got already is enough > for people, or get a feel for the sort of things people are looking for > to show a win. Sometimes you'll even get "oh, that's a really good idea > I'll just add the feature" (if you're very lucky). > So who else should I involve as soon as I have presentable code? (coding standard issues I have already fixed - mostly...) > BTW I have acquired a RPi for some other use - is the CAN board you are > using something I can order? I'm not sure if I'd ever have much time to > look at it but I'm generally trying to expand the range of SPI hardware > I have available for test. Could be as simple as a bread-board + mcp2515 + mcp2551 + 16MHz Oszillator + 4 Resistors + 2-3 Capacitors for the chips... Other people have bought the pican module, which essentially implements the same, but is specific to the PI... http://skpang.co.uk/catalog/pican-canbus-board-for-raspberry-pi-p-1196.html And in both cases you probably will need a peer which accepts messages or sends them... It is a bit tricky - especially the setting up of the correct Baud rate on the CAN-Bus side - so people are complaining that it is hard to make it work... Ciao, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <25BB67E3-2929-49D2-BDDE-2D6B3D43534E-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org>]
* Re: ARM: bcm2835: DMA driver + spi_optimize_message - some questions/info [not found] ` <25BB67E3-2929-49D2-BDDE-2D6B3D43534E-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org> @ 2014-04-21 22:20 ` Mark Brown 0 siblings, 0 replies; 8+ messages in thread From: Mark Brown @ 2014-04-21 22:20 UTC (permalink / raw) To: martin sperl Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, linux-rpi-kernel, Stephen Warren [-- Attachment #1: Type: text/plain, Size: 6240 bytes --] On Fri, Apr 11, 2014 at 02:40:07PM +0200, martin sperl wrote: > On 11.04.2014 00:35, Mark Brown wrote: > > The main thing I'd like to do is actually coalesce adjacent transfers within > > a message if they are compatible and we're doing DMA so we end up with fewer > > hardware interactions overall. There's a fairly common pattern of writing > > multiple registers (for firmware download or setting coefficients) where you > > have a small transfer for a register then a larger one for a data block which > > would benefit. > > I just found out something similar just a few days ago: > doing a write then read with 2+1 is more effort to handle than a read/write > of 3 bytes and some moving of data arround. Also it actually saves memory, > as an spi_transfer is bigger than those 3 extra bytes needed. > With the DMA Driver this saved me 4us on overall transfer time - It dropped > from 52us to 47us. For comparisson: the real data transfer for those 36 > bytes (without 7x CS) takes 35us - so we are pretty close to ideal... It can be even nicer for drivers where there's some minimum limit on the size of transfers (quite a lot of them), it can sometimes allow short writes to be converted from being PIO to being combined with an adjacent DMAs. Possibly even two PIOs making a DMA, though that case is more on the edge. > There is not so much of a difference with the PIO drivers as they take 200us > for the same, but it sees a drop on the number of interrupts (typically > 1 IRQ/transfer), which increases CPU availability for userspace... > Redesigning the interrupt-handler to push bytes spanning multiple > spi_transfers into the fifo together would help, but then it becomes much > more complex - at least the spi-bcm2835 does call "complete(&bs->done);" > when it has finished the transfer. It's probably not worth doing unless it's factored out into the framework and affects lots of drivers; it seems most likely to get done as part of a general improvement in the ability to drive the queue from interrupt context. Most systems won't be able to do the fully DMAed pipeline the Pi can. > > Right, that's a peculiarity of your system (hopefully) which makes life > > difficult for that. The Samsung controller has a vaugely similar thing > > where the transmit interrupt goes off too early for what we want so we > > need to ensure there's always a read if we're doing interrupts but that > > is a lot easier to work around. > Specs says to use 2 Interrupts one for feeding data to FIFO the other to > read data from FIFO. Probably one could use multiple interleaved DMAs for > reads and writes, but then you would be limited to 32 byte transfers and > would have some "gaps" between the fifos filling/emptying - this way at > least 65535 bytes can get sent with a single transfer... Sure, that's standard though looking at the code in drivers it seems there's a fairly common pattern of just ignoring one of the completions, usually the transmit one, at the SPI level since we can guarantee that the other will always come later. > The interrupt DMA is more on the tricky side - if we had to have an > interrupt every time an spi_message finished, it would be possible with > 2 DMAs, but as we have the possibility of chaining multiple spi_messages > together for higher thruput, then this "waiting" for interrupt is not > possible (IRQ flag gets sometimes cleared by the next DMA controlblock), > so we need to have another "stable" IRQ source and hence DMA3. > (actually I was using the TX DMA for that, which gets started by the > RX-DMA, so we have a race-time between DMA for IRQ and the interrupt > handler on the CPU reading the registers quickly enough to find out the > source of the IRQ) I think it's worth implementing the three DMA solution as an optimisation on the two DMA version partly to get the big part of the win from the two DMAs integrated but also because that's going to be the more generally useful pattern. It also lets the tricky bit with the extra DMA be considered separately. > One of those found is the fact that an optimized message can only reliably > get used with spi_async. For spi_sync use it has to have the complete > function defined when optimizing, as it would otherwise not implement a > callback IRQ in DMA. And then when spi_sync sets the complete function > the DMA would still issue the optimized message without interrupts. Hrm. We could provide a flag saying this will be used synchronously, or more likely given driver use patterns the other way around. Doesn't seem quite elegant though. Or insert a dummy transfer (eg, some register read) which does interrupt into the chain after the real one to do the callback. > > On the other hand you may find that what you've got already is enough > > for people, or get a feel for the sort of things people are looking for > > to show a win. Sometimes you'll even get "oh, that's a really good idea > > I'll just add the feature" (if you're very lucky). > So who else should I involve as soon as I have presentable code? > (coding standard issues I have already fixed - mostly...) The dmaengine maintainers for the DMA stuff - Vinod Koul <vinod.koul-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> and Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> - and the GPIO maintainers for that - Linus Walleij <linus.walleij-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> and Alexandre Courbot <gnurou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> (check MAINTAINERS). Plus anyone working on the drivers you modified. > the same, but is specific to the PI... > http://skpang.co.uk/catalog/pican-canbus-board-for-raspberry-pi-p-1196.html > And in both cases you probably will need a peer which accepts messages or > sends them... > It is a bit tricky - especially the setting up of the correct Baud rate > on the CAN-Bus side - so people are complaining that it is hard to make > it work... Hrm, OK. Might be too fiddly, dunno. I do also keep thinking I should get a FPGA board and impelement some test devices that way to let me exercise the framework, it's not exactly realistic though. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2014-04-21 22:20 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-04-02 16:06 ARM: bcm2835: DMA driver + spi_optimize_message - some questions/info Martin Sperl [not found] ` <43389276-E591-4E09-AB84-491C2CB2D9A7-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org> 2014-04-02 18:15 ` Mark Brown [not found] ` <20140402181547.GH2269-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2014-04-02 20:40 ` Martin Sperl [not found] ` <1AA37E97-BDD7-4B53-B092-18D5D7439F8B-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org> 2014-04-03 22:02 ` Mark Brown [not found] ` <20140403220232.GE14763-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2014-04-04 14:17 ` martin sperl [not found] ` <5AAD4FEA-2887-4A9D-9FE3-588210BFD1A6-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org> 2014-04-10 22:35 ` Mark Brown [not found] ` <20140410223531.GF6518-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2014-04-11 12:40 ` martin sperl [not found] ` <25BB67E3-2929-49D2-BDDE-2D6B3D43534E-TqfNSX0MhmxHKSADF0wUEw@public.gmane.org> 2014-04-21 22:20 ` Mark Brown
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.