From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Sperl Subject: Re: Depreciated spi_master.transfer and "prepared spi messages" for an optimized pipelined-SPI-DMA-driver Date: Sun, 10 Nov 2013 11:54:16 +0100 Message-ID: <6C7903B3-8563-490E-AD7D-BA5D65FFB9BC@sperl.org> References: <20131106162410.GB2674@sirena.org.uk> <3B0EDE3F-3386-4879-8D89-2E4577860073@sperl.org> <20131106232605.GC2674@sirena.org.uk> <72D635F5-4229-4D78-8AA3-1392D5D80127@sperl.org> <20131107203127.GB2493@sirena.org.uk> <86AE15B6-05AF-4EFF-8B8F-10806A7C148B@sperl.org> <20131108161957.GP2493@sirena.org.uk> <5F70E708-89B9-4DCF-A31A-E688BAA0E062@sperl.org> <20131108180934.GQ2493@sirena.org.uk> <20131109183056.GU2493@sirena.org.uk> Mime-Version: 1.0 (Apple Message framework v1283) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8BIT Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Mark Brown Return-path: In-Reply-To: <20131109183056.GU2493-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> Sender: linux-spi-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Hi Mark! You seem to be very hard to convince! On 09.11.2013, at 19:30, Mark Brown wrote: > You mentioned that systems could be constructed but you didn't mention > examples of doing that; in general prolonged bus saturation tends to be > something that people avoid when designing systems. > True, but CAN in principle can work saturated when only 2 nodes talk to each other and the others are quiet. Collision detection is also in the protocol and the one "winning" the Collision can continue transmitting its message without an abort. And in the Firmware upgrade case the CAN ID is typically 7 configured with the most recessive bits (so highest possible CAN ID), so that it will loos any arbitration and thus allow other nodes to send their data with higher priority and thus effectively slowing down the firmware upgrade without impacting other devices. So a saturated bus is not "ideal", but it is possible and would not impact devices. Also note that some CAN controllers, like the mcp2515, can have message HW-filters configured to limit system load. But these are typically limited to a certain set of filter-pattern (2 Masks and 2-3 bit pattern per mask), so for a "central" controller these do not seem to be a good match, as you would run out of filters, and a central CPU of a vehicle will want to get all data on which it may base its decision and the limited amount of filters is not scaling for this. Also one can estimate from 100% CAN-Bus Utilization with 90% System load. to 50% Bus utilization and 45% System load to 25% Bus utilization and 22.5% System load And even the 22.5% CPU load as already too much in my opinion when you can get it down further with a simple API extension and a driver optimized for speed. Note also, that the message-processing overhead is in principle just (mostly) dependent on the number of messages/interval, while the possible message rate on the CAN-bus are primarily dependent on the length of the message. So with those 2 receive buffers in HW, I have the following time to react to a two buffers full + avoid an overflow situation: 29-bit IDs and 8 byte/frame at 500khz: 774us to 948 us (depending on data bytes, as there is bit-stuffing happening after 5 identical bits in sequence) 11 bit IDs and 0 bytes/frame at 500kHz: 282us to 333us These shows the distinct best to worsted case form a bit-length perspective. So if you got 3 distinct nodes that all want to send the worsted case (single frame of 0 bytes length , what would be typically called a RTR frame) within a window of 188us you can easily trigger the situation of the RPI not being able to keep up with the request rate and will drop the packet with the lowest priority - and that is just 3 distinct packages. Actually I have a simple case, where a single node sends a sequence of 5 CAN messages of 8 byte size and then stays quiet for 10 seconds (so an effective bus utilization of 0.015%) where the "stock" in-kernel drivers (mcp251x and spi-bcm2835) are already running into buffer overruns and in 20% of all cases only report 3 out of those 5 frames. (Not to mention that the in tree mcp251x driver stops to work after such a situation because of a race confition due to edge-interrupts getting "reactivated" too late after the INT line has already gone again. This happens on a Foundation based kernel somewhere between version 3.2 and 3.6, with 3.6 showing this behaviour explicitly) So you see that the "bus saturated" is just a "prolonged" situation of my 5 frame scenario which I can easily reproduce in real-live - and I doubt you can say, that this scenario would count as "prolonged bus situation". > What I'm missing with this example is the benefit of having an API for > pre-cooked messages, how it will deliver a performance improvement? > Flashing should be an infrequent operation and I'd expect it to involve > little reuse of messages which I'd expect to be the main thing that we > could gain from the API change. I'd also not expect the controller > specific work to be especially expensive. It will deliver a performance improvement by not having to create identical structures or running sanity thest every time you transmit the prepared message. And for the DMA chains I have to create DMA chains that are of a length 10-20 - depending on the number of transfers. So just running this piece of code 3000 times/ second minutes or hours is overhead that we could definitely safe - IMO it is a low-hanging fruit. This does not mean that every driver has to implement it necessarily - the first candidates are the ones that need very high SPI-message thru-put and/or low scheduling latencies. We can even look into optimizing spi_write, spi_read and spi_write_then_read for the "hot" path by spending a little more memory for "typical" transfer sizes. Then the "generic" the "generic" drivers would also require less CPU resources - even if the bus driver does not implement prepare_message itself. The "sanity" checks in __spi_async would not get executed every time. I would say any little less CPU utilization helps if it is easy to implement. >> >> Also my next work is moving to DMA scheduling multiple messages via "transfer". >> This should bring down the CPU utilization even further and it should also >> decrease the context switches as the spi_pump thread goes out of the picture... >> (and that will probably decrease the number of overall interrupts as well...) > > Right, and simply driving transfers from interrupt rather than task > context probably gets an awfully long way there. This is the sort of > improvement which will benefit all users of the API - the main reason > I'm so cautious about changing the API is that I don't want to make it > more complex to implement this sort of improvement. Also as an example if you would want the framework to provide "generic" GPIO interrupt handling to send a (prepared) message without the driver having to write the code itself, then on the "registration" call of the message to send in case of a triggered Interrupt, the framework can also execute this preparation call in the framework already, so the drivers would not even care for that! >>> I'd like to see both a practical use case and specific analysis showing >>> that changing the API is delivering a benefit as opposed to the parts >>> which can be done by improving the implementation of the current API. > >> I have already shared at some point and also it shows in the forum: > > You've been doing a bunch of other performance improvements as well, > it's not clear to me how much of this is coming from the API change and > how much of this is due to changes which can also be done by improving > the implementation and without requiring drivers to be specifically > updated to take advantage of it. I have told you, that I have created a module parameter which enables/disables the prepare code. then I ran the loading of the modules, configuring can speeds,... and ran vmstat 10 6 then removed the modules. and then I ran the same again with the parameter to disable prepare. and we get the different values... 88% System versus 80% system (and the 80% are actually too high, as the system got so much faster, that it does not need to handle 2 packets at a time can handle each packet individually. Nothing else changed besides this enable/disable parameter. >> Does this answer your question and convince you of this being realistic? > > It's still not clear to me exactly how this works and hence if the > benefit comes from the API change itself. There is no change in API - there is only an optional extension... If you are not convinced yet, then here a proposal for measurements I can try to gather for you to hopefully convince you: I would implement the previously posted patches to modify the spi_async call to allow us to avoid running the "sanity checks" in the __spi_async function. Here the bus-drivers I would use for testing: * spi-bcm2708 (from the raspbien kernel using the transfer interface) * spi-bcm2835 (from the "stock" kernel using the transfer_one_message interface with an patch to allow for real time priority in the message pump - enabled via a module parameter). * run with and without RT priority * spi-bcm2835dma with the message_pump interface * run with and without RT priority (module parameter as well) * spi-bcm2835dma with the transfer interface and implementing the message queue in DMA itself without any processing And 2 device-drivers: * A (hypothetical) ADC read driver, where we want to achieve the highest possible thru-put. I would implement this as "2 messages" that get scheduled in sequence. both of them having a callback scheduling itself again immediately in the callback call. Why not not work with a single message, you may ask. Well, with a single message and a callback scheduling the message again the "limiting" factor becomes the processing of the callback and subsequent. while with this 2 message approach, when one message is "finished" the other message could get executed while the callback of the first message is running. This makes a difference for the pipelined DMA driver as we can leverage "the DMA/ARM code concurrency". (not to work effectively in the end it might require more than 2 messages, as there might still be a situation where the DMA driver would stall because there is no message in the pipeline. Important metrics are: * higher means better: * spi-message thru-put * lower means better: * System CPU% * interrupts * context switches * maybe also a version of the above that would mimic reading an max187 ADC, which requires clock idle after CS down for 8.5us during which it does the conversion is run - only then data is valid and clock may start for the transfer. * the optimized mcp2515 driver that I have been writing running in loopback mode connected to a 100% saturated CAN-bus@500kHz due to retransmits because of unacknowledged CAN-messages. Important metrics are: * higher means better: * total number of messages processed (there is a top limit) * lower means better: * ratio of messages processed from second buffer versus first HW buffer * number of buffer overruns resulting in packet-loss * number of overruns not even detected due to "latencies that miss a second frame" * System CPU% * interrupts * context switches Both drivers would have a parameter to enable/disable the spi_prepare_message(spi,message) calls during their initial setup based on a module parameter. Would these be good enough scenarios for you to finally convince you if the numbers are favorable? Martin P.s: You know, it would be a LOT of work to get such values!-- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html