* High CPU load when using MAX14830 SPI UART controller @ 2020-05-19 14:33 Thomas Petazzoni 2020-05-19 15:24 ` Mark Brown 2020-05-20 10:44 ` Jan Kundrát 0 siblings, 2 replies; 11+ messages in thread From: Thomas Petazzoni @ 2020-05-19 14:33 UTC (permalink / raw) To: linux-spi; +Cc: Mark Brown, Jan Kundrát Hello, I have an ARM platform based on the Microchip SAMA5D3 processor (single Cortex-A5 at 536 Mhz), which has a MAX14830 (datasheet at [0], driver at drivers/tty/serial/max310x.c) UART controller on SPI. I'm running an unmodified Linux 5.6 kernel. The SPI controller driver for that platform is drivers/spi/spi-atmel.c. [0] https://datasheets.maximintegrated.com/en/ds/MAX14830.pdf I have a remote machine sending data at 57600 bps over one of the four UARTs provided by the MAX14830. The data is sent 20 bytes at a time, every 16 ms. This means 1250 bytes per second total, i.e a pretty ridiculous amount of data being transferred. But interestingly, this simple data transfer causes a constant 32% CPU load on the SAMA5D3 in the kernel side. top shows: Mem: 38204K used, 208404K free, 156K shrd, 4672K buff, 14660K cached CPU: 0% usr 32% sys 0% nic 66% idle 0% io 0% irq 0% sirq So, 32% of the time spent in the "system". More precisely, it shows: PID PPID USER STAT VSZ %VSZ %CPU COMMAND 80 2 root SW 0 0% 21% [irq/50-spi1.3] 79 2 root SW 0 0% 5% [spi1] 174 2 root IW 0 0% 5% [kworker/u2:2-ev] 82 2 root IW 0 0% 2% [kworker/0:2-eve] 256 243 root S 1356 1% 1% ./progread /dev/ttyMAX0 "progread" is the program reading the data out of the UART (it does nothing with it, except check it is valid). irq/50-spi1.3 is the threaded IRQ of the MAX14830 driver. spi1 is I assume the kthread of the SPI subsystem. With the sending of 20 bytes data frames every 16 ms, it means data is send ~62.5 per seconds. Looking at /proc/interrupts, I get ~260 interrupts per second on the MAX14830, which means that the "batch reading" implemented in 2b4bac48c10848ccffd484e7cd025dc085c1bd32 is effective but not to the point where I have a single interrupt per received frame. Every second, I'm also getting 7500-7700 interrupts from the SPI controller of the SAMA5. "perf top" is not extremely helpful: PerfTop: 1097 irqs/sec kernel:90.0% exact: 0.0% [4000Hz cycles], (all, 1 CPU) ------------------------------------------------------------------------------- 40.10% [kernel] [k] finish_task_switch 8.52% [kernel] [k] atmel_spi_transfer_one_message 3.92% [kernel] [k] schedule_timeout 2.41% [kernel] [k] __softirqentry_text_start 2.32% [kernel] [k] __sched_text_start 2.26% [kernel] [k] arch_cpu_idle 1.92% [kernel] [k] __spi_pump_messages 1.79% [kernel] [k] spi_finalize_current_message 1.62% [kernel] [k] __spi_validate 1.29% [kernel] [k] schedule 0.84% [kernel] [k] _regmap_raw_read 0.83% [kernel] [k] complete 0.76% [kernel] [k] __spi_sync 0.76% [kernel] [k] __ksize 0.65% [kernel] [k] _regmap_read 0.64% libc-2.30.so [.] strcmp 0.60% [kernel] [k] mmioset 0.59% [kernel] [k] del_timer.part.0 0.58% [kernel] [k] wait_for_completion_timeout 0.57% [kernel] [k] regmap_readable perf record/report gives: 19.42% spi1 [kernel.kallsyms] [k] finish_task_switch 17.42% kworker/u2:1-ev [kernel.kallsyms] [k] finish_task_switch 9.22% irq/50-spi1.3 [kernel.kallsyms] [k] atmel_spi_transfer_one_message 7.76% irq/50-spi1.3 [kernel.kallsyms] [k] finish_task_switch 4.15% irq/50-spi1.3 [kernel.kallsyms] [k] schedule_timeout 4.15% kworker/0:2-eve [kernel.kallsyms] [k] finish_task_switch 2.25% swapper [kernel.kallsyms] [k] arch_cpu_idle 1.94% irq/50-spi1.3 [kernel.kallsyms] [k] __spi_pump_messages 1.84% irq/50-spi1.3 [kernel.kallsyms] [k] __spi_validate 1.78% irq/50-spi1.3 [kernel.kallsyms] [k] spi_finalize_current_message Does the very high CPU time spent in finish_task_switch() simply means that we're scheduling too much ? Any hints on how to improve this situation ? Best regards, Thomas Petazzoni -- Thomas Petazzoni, CTO, Bootlin Embedded Linux and Kernel engineering https://bootlin.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: High CPU load when using MAX14830 SPI UART controller 2020-05-19 14:33 High CPU load when using MAX14830 SPI UART controller Thomas Petazzoni @ 2020-05-19 15:24 ` Mark Brown 2020-05-20 10:18 ` Thomas Petazzoni 2020-05-20 10:44 ` Jan Kundrát 1 sibling, 1 reply; 11+ messages in thread From: Mark Brown @ 2020-05-19 15:24 UTC (permalink / raw) To: Thomas Petazzoni; +Cc: linux-spi, Jan Kundrát [-- Attachment #1: Type: text/plain, Size: 1983 bytes --] On Tue, May 19, 2020 at 04:33:53PM +0200, Thomas Petazzoni wrote: > perf record/report gives: > 19.42% spi1 [kernel.kallsyms] [k] finish_task_switch > 17.42% kworker/u2:1-ev [kernel.kallsyms] [k] finish_task_switch > 9.22% irq/50-spi1.3 [kernel.kallsyms] [k] atmel_spi_transfer_one_message > 7.76% irq/50-spi1.3 [kernel.kallsyms] [k] finish_task_switch > 4.15% irq/50-spi1.3 [kernel.kallsyms] [k] schedule_timeout > 4.15% kworker/0:2-eve [kernel.kallsyms] [k] finish_task_switch > 2.25% swapper [kernel.kallsyms] [k] arch_cpu_idle > 1.94% irq/50-spi1.3 [kernel.kallsyms] [k] __spi_pump_messages > 1.84% irq/50-spi1.3 [kernel.kallsyms] [k] __spi_validate > 1.78% irq/50-spi1.3 [kernel.kallsyms] [k] spi_finalize_current_message > Does the very high CPU time spent in finish_task_switch() simply means > that we're scheduling too much ? > Any hints on how to improve this situation ? Not thinking about this too deeply... The copybreak for DMA in spi-ateml is 16 bytes so at 20 byte reads (assuming that's what actually ends up as a SPI transfer) it might be worth looking at tuning that and seeing if PIO helps, though I'm not seeing any DMA stuff showing up there. The other thing that jumps out looking at the code is that in the interrupts to complete transfers the driver will just complete() and wake the main transfer thread up, for the reads where we always have two transfers per message this will cause an extra context switch. If the driver were able to initiate the next transfer directly from interrupt context then it should schedule less (and generally reduce latency too). I can't really see much obvious in the serial driver - it might want to consider error checking, or if it doesn't care if transfers complete it might want to switch to async SPI calls, but nothing that looks like there's anything obvious for SPI related that you could do here. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: High CPU load when using MAX14830 SPI UART controller 2020-05-19 15:24 ` Mark Brown @ 2020-05-20 10:18 ` Thomas Petazzoni 2020-05-20 11:26 ` Mark Brown 0 siblings, 1 reply; 11+ messages in thread From: Thomas Petazzoni @ 2020-05-20 10:18 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi, Jan Kundrát, Alexandre Belloni, nicolas.ferre Hello Mark, Thanks a lot for your fast answer, and feedback! On Tue, 19 May 2020 16:24:49 +0100 Mark Brown <broonie@kernel.org> wrote: > > Does the very high CPU time spent in finish_task_switch() simply means > > that we're scheduling too much ? > > > Any hints on how to improve this situation ? > > Not thinking about this too deeply... > > The copybreak for DMA in spi-ateml is 16 bytes so at 20 byte reads > (assuming that's what actually ends up as a SPI transfer) it might be > worth looking at tuning that and seeing if PIO helps, though I'm not > seeing any DMA stuff showing up there. I've already tried disabling DMA entirely by dropping the 'dmas' property from the DT, and it didn't change anything, even though some transfers where larger than 16 bytes: in practice with how the UART SPI controller works, whenever I'm receiving 20 bytes, I'm typically seeing one transfer of 17 or 18 bytes, and then another one or two transfers of 1-2 bytes each. > The other thing that jumps out looking at the code is that in the > interrupts to complete transfers the driver will just complete() and > wake the main transfer thread up, for the reads where we always have two > transfers per message this will cause an extra context switch. If the > driver were able to initiate the next transfer directly from interrupt > context then it should schedule less (and generally reduce latency too). Indeed, I see. The spi-atmel driver is however really structured around synchronously handling each xfer of the spi_message. I guess what we would want to see is atmel_spi_transfer_one_message() start the first transfer, and then wait on a completion that indicates the completion of *all* xfers. And then, it's the interrupt handler that dequeues the next xfer from the spi_message and submits them, so that we don't go back to the calling thread for each xfer. > I can't really see much obvious in the serial driver - it might want to > consider error checking I'm not sure what you mean by "it might want to consider error checking". Could you explain? >, or if it doesn't care if transfers complete it might want to switch >to async SPI calls, but nothing that looks like > there's anything obvious for SPI related that you could do here. I'm not sure how it can use async transfers. Most of the transfers are initiated using regmap_read(), which inherently is synchronous: we need the value of the register being read to decide if we received some data, how much data we've received, etc. The only thing that could be asynchronous I guess is retrieving the data from the UART RX FIFO: the completion callback could push this data up to the TTY layer rather. On the MAX14830 driver side, I've however noticed that the tx_work workqueue is woken up everytime the TXEMPTY interrupt fires, and this happens even in my RX-only workload. I've disabled it entirely, it saves a few % of CPU time, but really not much. Mark: in this situation where a threaded interrupt handler in max310x is doing SPI transfers, are those SPI transfers then offloaded to the SPI workqueue, causing another context switch, or are they done directly within the context of the interrupt thread ? I see __spi_sync() has some logic to push out the messages from the calling context, I guess as opposed to offloading them to the SPI workqueue ? Overall, do we consider it "normal" to have a ~30% CPU load for what looks like a very light workload ? Thanks! Thomas -- Thomas Petazzoni, CTO, Bootlin Embedded Linux and Kernel engineering https://bootlin.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: High CPU load when using MAX14830 SPI UART controller 2020-05-20 10:18 ` Thomas Petazzoni @ 2020-05-20 11:26 ` Mark Brown 2020-07-21 13:39 ` Thomas Petazzoni 0 siblings, 1 reply; 11+ messages in thread From: Mark Brown @ 2020-05-20 11:26 UTC (permalink / raw) To: Thomas Petazzoni Cc: linux-spi, Jan Kundrát, Alexandre Belloni, nicolas.ferre [-- Attachment #1: Type: text/plain, Size: 5949 bytes --] On Wed, May 20, 2020 at 12:18:19PM +0200, Thomas Petazzoni wrote: > Mark Brown <broonie@kernel.org> wrote: > > The other thing that jumps out looking at the code is that in the > > interrupts to complete transfers the driver will just complete() and > > wake the main transfer thread up, for the reads where we always have two > > transfers per message this will cause an extra context switch. If the > > driver were able to initiate the next transfer directly from interrupt > > context then it should schedule less (and generally reduce latency too). > Indeed, I see. The spi-atmel driver is however really structured around > synchronously handling each xfer of the spi_message. I guess what we > would want to see is atmel_spi_transfer_one_message() start the first > transfer, and then wait on a completion that indicates the completion > of *all* xfers. And then, it's the interrupt handler that dequeues the > next xfer from the spi_message and submits them, so that we don't go > back to the calling thread for each xfer. Right. You'd need to rearrange it to go through the entire message setting up DMA mappings that will be needed then had everything off to interrupt context or have extra complexity and optionally go back up to task context if there's anything complicated to handle. It might also be possible to rearrange the transfers so that the entire message is done by the hardware if you can chain things together, though that only helps with DMA and it'll involve adding dummy TX and RX segments and might be more trouble than it's worth. > > I can't really see much obvious in the serial driver - it might want to > > consider error checking > I'm not sure what you mean by "it might want to consider error > checking". Could you explain? It wasn't checking the return values of SPI API calls. > >, or if it doesn't care if transfers complete it might want to switch > >to async SPI calls, but nothing that looks like > > there's anything obvious for SPI related that you could do here. > I'm not sure how it can use async transfers. Most of the transfers are > initiated using regmap_read(), which inherently is synchronous: we need > the value of the register being read to decide if we received some Yeah, that's mostly useful on the transmit side I think. > data, how much data we've received, etc. The only thing that could be > asynchronous I guess is retrieving the data from the UART RX FIFO: the > completion callback could push this data up to the TTY layer rather. That might save you some context thrashing, or allow batching of switches. > Mark: in this situation where a threaded interrupt handler in max310x > is doing SPI transfers, are those SPI transfers then offloaded to the > SPI workqueue, causing another context switch, or are they done > directly within the context of the interrupt thread ? I see > __spi_sync() has some logic to push out the messages from the calling > context, I guess as opposed to offloading them to the SPI workqueue ? It should be doing transfers in calling context if the controller is idle, the SPI thread should only be used when the controller is already busy or to clean up when the controller goes idle (which will unfortunately happen rather a lot in your use case). Actually looking at the code in __spi_pump_messages() again I think that in the case where we don't have any cleanup to do we should be able to avoid kicking the thread for that which should help a bit for spi-atmel. Can you give the patch below a go (compile tested only, not even tried to boot)? You can generally get a good idea of what's going on with regard to context switching at the SPI level from the SPI tracepoints, and about any latencies in there too. > Overall, do we consider it "normal" to have a ~30% CPU load for what > looks like a very light workload ? It feels a bit high. It might possibly go down under load if there's other stuff going on, though I don't think that applies in this use case - it's mainly cases where we can avoid idling the hardware due to having a lot of batched up work. diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c index c083ee3995e4..9bfc622466c7 100644 --- a/drivers/spi/spi.c +++ b/drivers/spi/spi.c @@ -1311,6 +1311,14 @@ void spi_finalize_current_transfer(struct spi_controller *ctlr) } EXPORT_SYMBOL_GPL(spi_finalize_current_transfer); +static void spi_idle_runtime_pm(struct spi_controller *ctlr) +{ + if (ctlr->auto_runtime_pm) { + pm_runtime_mark_last_busy(ctlr->dev.parent); + pm_runtime_put_autosuspend(ctlr->dev.parent); + } +} + /** * __spi_pump_messages - function which processes spi message queue * @ctlr: controller to process queue for @@ -1355,10 +1363,17 @@ static void __spi_pump_messages(struct spi_controller *ctlr, bool in_kthread) return; } - /* Only do teardown in the thread */ + /* Defer any non-atomic teardown to the thread */ if (!in_kthread) { - kthread_queue_work(&ctlr->kworker, - &ctlr->pump_messages); + if (!ctlr->dummy_rx && !ctlr->dummy_tx && + !ctlr->unprepare_transfer_hardware) { + spi_idle_runtime_pm(ctlr); + ctlr->busy = false; + trace_spi_controller_idle(ctlr); + } else { + kthread_queue_work(&ctlr->kworker, + &ctlr->pump_messages); + } spin_unlock_irqrestore(&ctlr->queue_lock, flags); return; } @@ -1375,10 +1390,7 @@ static void __spi_pump_messages(struct spi_controller *ctlr, bool in_kthread) ctlr->unprepare_transfer_hardware(ctlr)) dev_err(&ctlr->dev, "failed to unprepare transfer hardware\n"); - if (ctlr->auto_runtime_pm) { - pm_runtime_mark_last_busy(ctlr->dev.parent); - pm_runtime_put_autosuspend(ctlr->dev.parent); - } + spi_idle_runtime_pm(ctlr); trace_spi_controller_idle(ctlr); spin_lock_irqsave(&ctlr->queue_lock, flags); [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: High CPU load when using MAX14830 SPI UART controller 2020-05-20 11:26 ` Mark Brown @ 2020-07-21 13:39 ` Thomas Petazzoni 2020-07-21 22:30 ` Mark Brown 0 siblings, 1 reply; 11+ messages in thread From: Thomas Petazzoni @ 2020-07-21 13:39 UTC (permalink / raw) To: Mark Brown; +Cc: linux-spi, Jan Kundrát, Alexandre Belloni, nicolas.ferre Hello Mark, Sorry for the very long delay in getting back to you on this. I'm finally back on this topic. On Wed, 20 May 2020 12:26:59 +0100 Mark Brown <broonie@kernel.org> wrote: > Right. You'd need to rearrange it to go through the entire message > setting up DMA mappings that will be needed then had everything off to > interrupt context or have extra complexity and optionally go back up to > task context if there's anything complicated to handle. This overall requires some fairly significant surgery in the spi-atmel driver, which has code paths for PIO, peripheral DMA and dmaengine DMA, all of which need to be adjusted. > > I'm not sure what you mean by "it might want to consider error > > checking". Could you explain? > > It wasn't checking the return values of SPI API calls. Right, but this is unlikely to be the source of the CPU consumption issue we're seeing. > > Mark: in this situation where a threaded interrupt handler in max310x > > is doing SPI transfers, are those SPI transfers then offloaded to the > > SPI workqueue, causing another context switch, or are they done > > directly within the context of the interrupt thread ? I see > > __spi_sync() has some logic to push out the messages from the calling > > context, I guess as opposed to offloading them to the SPI workqueue ? > > It should be doing transfers in calling context if the controller is > idle, the SPI thread should only be used when the controller is already > busy or to clean up when the controller goes idle (which will > unfortunately happen rather a lot in your use case). > > Actually looking at the code in __spi_pump_messages() again I think that > in the case where we don't have any cleanup to do we should be able to > avoid kicking the thread for that which should help a bit for spi-atmel. > Can you give the patch below a go (compile tested only, not even tried > to boot)? I gave it a try, and unfortunately there is no difference. But it is not too surprising, as top shows something like this: 80 2 root SW 0 0% 24% [irq/50-spi1.3] 57 2 root IW 0 0% 2% [kworker/0:1-eve] So it's not the kthread of the SPI subsystem that is consuming most of the CPU time, but really the threaded handler of the MAX14830 driver. What your patch does is avoid offloading to the SPI subsystem kthread some cleanup work when it can be done in the current context, if I understand correctly. > You can generally get a good idea of what's going on with regard to > context switching at the SPI level from the SPI tracepoints, and about > any latencies in there too. I'll have a look there, but I don't really have any latency issue, rather a CPU consumption issue, which is quite different. Thanks, Thomas -- Thomas Petazzoni, CTO, Bootlin Embedded Linux and Kernel engineering https://bootlin.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: High CPU load when using MAX14830 SPI UART controller 2020-07-21 13:39 ` Thomas Petazzoni @ 2020-07-21 22:30 ` Mark Brown 0 siblings, 0 replies; 11+ messages in thread From: Mark Brown @ 2020-07-21 22:30 UTC (permalink / raw) To: Thomas Petazzoni Cc: linux-spi, Jan Kundrát, Alexandre Belloni, nicolas.ferre [-- Attachment #1: Type: text/plain, Size: 2197 bytes --] On Tue, Jul 21, 2020 at 03:39:44PM +0200, Thomas Petazzoni wrote: > Mark Brown <broonie@kernel.org> wrote: > > > I'm not sure what you mean by "it might want to consider error > > > checking". Could you explain? > > It wasn't checking the return values of SPI API calls. > Right, but this is unlikely to be the source of the CPU consumption > issue we're seeing. Indeed, that was just something I noticed while reviewing. > > Actually looking at the code in __spi_pump_messages() again I think that > > in the case where we don't have any cleanup to do we should be able to > > avoid kicking the thread for that which should help a bit for spi-atmel. > > Can you give the patch below a go (compile tested only, not even tried > > to boot)? > I gave it a try, and unfortunately there is no difference. But it is > not too surprising, as top shows something like this: > 80 2 root SW 0 0% 24% [irq/50-spi1.3] > 57 2 root IW 0 0% 2% [kworker/0:1-eve] > So it's not the kthread of the SPI subsystem that is consuming most of > the CPU time, but really the threaded handler of the MAX14830 driver. > What your patch does is avoid offloading to the SPI subsystem kthread > some cleanup work when it can be done in the current context, if I > understand correctly. It does, yes. Depending on the particular performance characteristics of the system it can sometimes help noticably to cut out those context thrashes if just the context switching ends up being a noticable overhead, regardless of how much work each thread does when it does get control. In any case that's now merged in -next, I got some testing on other boards, so hopefully it'll help someone. > > You can generally get a good idea of what's going on with regard to > > context switching at the SPI level from the SPI tracepoints, and about > > any latencies in there too. > I'll have a look there, but I don't really have any latency issue, > rather a CPU consumption issue, which is quite different. Yeah, that's basically where the overhead in the SPI subsystem itself (rather than the specific driver) tends to be unfortunately. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: High CPU load when using MAX14830 SPI UART controller 2020-05-19 14:33 High CPU load when using MAX14830 SPI UART controller Thomas Petazzoni 2020-05-19 15:24 ` Mark Brown @ 2020-05-20 10:44 ` Jan Kundrát 2020-07-21 13:51 ` Thomas Petazzoni 1 sibling, 1 reply; 11+ messages in thread From: Jan Kundrát @ 2020-05-20 10:44 UTC (permalink / raw) To: Thomas Petazzoni; +Cc: linux-spi, Mark Brown > With the sending of 20 bytes data frames every 16 ms, it means data is > send ~62.5 per seconds. > > Looking at /proc/interrupts, I get ~260 interrupts per second on the > MAX14830, which means that the "batch reading" implemented in > 2b4bac48c10848ccffd484e7cd025dc085c1bd32 is effective but not to the > point where I have a single interrupt per received frame. Hi, I was getting something similar when I first wrote that patch. The TL;DR version is that even though this UART has 128 byte FIFO per each direction of each UART, I wanted my patch to stay conservative and only batch reads in an opportunistic manner -- I wanted not to risk a possible RX FIFO overflow. That's why the IRQ handler is eager to trigger as soon as there are some data in the RX FIFO. Our setup has no HW flow control (no RTS/CTS, no software flow control), so one has to hope that the IRQ latency is low enough... The MAX14830 supports also another mode of operation where the interrupt fires only when the RX buffer gets filled over a configurable minimal threshold. This can be combined with yet another interrupt which fires whenever there's some data in the RX FIFO for "too long". Unfortunately, this timer starts over again upon reception of any additional data, so it's the *youngest* byte in the RX FIFO that controls triggering of this delayed interrupt. I am not sure how to balance the latency ("how soon is userspace notified about this byte") with CPU utilization ("don't read data byte-by-byte from this 128 byte buffer"). If there's some cross-driver support for "something like that" in the TTY layer, I'm sure that max310x.c could be adopted to make use of that. Otherwise, we could always add a pair of DT properties for controling: a) the "start reading from RX FIFO" timeout, b) "allow up to X bytes in the RX FIFO, I know that my platform has enough CPU and SPI bandwidth to finish reading that before the RX FIFO overflows" BTW, there are also real SPI bus throughput limitations, see 2258761213cb239e5e6c11b4ec9b1700fcb4fdcd for some numbers from my platform. The chip supports up to 26 MHz (but not all bus masters have clock granularities fine enough to use this), that's 3.25 MBps of raw throughput divided among four UARTs. Reading each register takes at least two bytes, one has to check the top-level IRQ status register (to decide which UART to check), then read per-UART IRQ registers, and only then start reading the data. Also, batched reading can only happen if userspace explicitly ignores framing errors, parity errors, etc, otherwise the driver will have to read byte-by-byte, and check the line status register, etc. Anyway, I'll be happy to review and test any patches you might have; we're using these chips and we are interested in their performance and error-free operation. With kind regards, Jan ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: High CPU load when using MAX14830 SPI UART controller 2020-05-20 10:44 ` Jan Kundrát @ 2020-07-21 13:51 ` Thomas Petazzoni 2020-07-21 14:20 ` Jan Kundrát 0 siblings, 1 reply; 11+ messages in thread From: Thomas Petazzoni @ 2020-07-21 13:51 UTC (permalink / raw) To: Jan Kundrát; +Cc: linux-spi, Mark Brown Hello Jan, On Wed, 20 May 2020 12:44:43 +0200 Jan Kundrát <jan.kundrat@cesnet.cz> wrote: > I was getting something similar when I first wrote that patch. The TL;DR > version is that even though this UART has 128 byte FIFO per each direction > of each UART, I wanted my patch to stay conservative and only batch reads > in an opportunistic manner -- I wanted not to risk a possible RX FIFO > overflow. That's why the IRQ handler is eager to trigger as soon as there > are some data in the RX FIFO. Our setup has no HW flow control (no RTS/CTS, > no software flow control), so one has to hope that the IRQ latency is low > enough... > > The MAX14830 supports also another mode of operation where the interrupt > fires only when the RX buffer gets filled over a configurable minimal > threshold. This can be combined with yet another interrupt which fires > whenever there's some data in the RX FIFO for "too long". Unfortunately, > this timer starts over again upon reception of any additional data, so it's > the *youngest* byte in the RX FIFO that controls triggering of this delayed > interrupt. Do you have some more details about this mode of operation? I looked a bit at the MAX14830 datasheet and code, but couldn't spot what allows an interrupt to be triggered after X bytes have been received *OR* if data has been sitting in the FIFO for too long. > I am not sure how to balance the latency ("how soon is userspace notified > about this byte") with CPU utilization ("don't read data byte-by-byte from > this 128 byte buffer"). If there's some cross-driver support for "something > like that" in the TTY layer, I'm sure that max310x.c could be adopted to > make use of that. Otherwise, we could always add a pair of DT properties > for controling: > > a) the "start reading from RX FIFO" timeout, > b) "allow up to X bytes in the RX FIFO, I know that my platform has enough > CPU and SPI bandwidth to finish reading that before the RX FIFO overflows" > > BTW, there are also real SPI bus throughput limitations, see > 2258761213cb239e5e6c11b4ec9b1700fcb4fdcd for some numbers from my platform. > The chip supports up to 26 MHz (but not all bus masters have clock > granularities fine enough to use this), that's 3.25 MBps of raw throughput > divided among four UARTs. Reading each register takes at least two bytes, > one has to check the top-level IRQ status register (to decide which UART to > check), then read per-UART IRQ registers, and only then start reading the > data. Also, batched reading can only happen if userspace explicitly ignores > framing errors, parity errors, etc, otherwise the driver will have to read > byte-by-byte, and check the line status register, etc. Yes, there is definitely some significant overhead with all those SPI transactions. But here I'm mostly concerned about CPU usage rather than actual bandwidth on the UART. Thomas -- Thomas Petazzoni, CTO, Bootlin Embedded Linux and Kernel engineering https://bootlin.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: High CPU load when using MAX14830 SPI UART controller 2020-07-21 13:51 ` Thomas Petazzoni @ 2020-07-21 14:20 ` Jan Kundrát 2020-07-21 21:49 ` Andy Shevchenko 2020-07-28 8:22 ` Thomas Petazzoni 0 siblings, 2 replies; 11+ messages in thread From: Jan Kundrát @ 2020-07-21 14:20 UTC (permalink / raw) To: Thomas Petazzoni; +Cc: linux-spi, Mark Brown >> The MAX14830 supports also another mode of operation where the interrupt >> fires only when the RX buffer gets filled over a configurable minimal >> threshold. This can be combined with yet another interrupt which fires >> whenever there's some data in the RX FIFO for "too long". Unfortunately, >> this timer starts over again upon reception of any additional >> data, so it's >> the *youngest* byte in the RX FIFO that controls triggering of >> this delayed >> interrupt. > > Do you have some more details about this mode of operation? I looked a > bit at the MAX14830 datasheet and code, but couldn't spot what allows > an interrupt to be triggered after X bytes have been received *OR* if > data has been sitting in the FIFO for too long. I have no code, but according to the datasheet, it's the "RTimeout" bit (Line Status Register, bit 0). If properly configured (RxTimeOut set and the interrupt routing enabled via LSRIntEn[0], the "RTimeoutIEn" bit), we're supposed to get ISR[0] set upon this timeout. I have not tried it, I just read the datasheet a few years ago. When you have patches, I'll be happy to test them (likely only in September, though, because of vacations). With kind regards, Jan ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: High CPU load when using MAX14830 SPI UART controller 2020-07-21 14:20 ` Jan Kundrát @ 2020-07-21 21:49 ` Andy Shevchenko 2020-07-28 8:22 ` Thomas Petazzoni 1 sibling, 0 replies; 11+ messages in thread From: Andy Shevchenko @ 2020-07-21 21:49 UTC (permalink / raw) To: Jan Kundrát; +Cc: Thomas Petazzoni, linux-spi, Mark Brown On Tue, Jul 21, 2020 at 5:32 PM Jan Kundrát <jan.kundrat@cesnet.cz> wrote: > I have no code, but according to the datasheet, it's the "RTimeout" bit > (Line Status Register, bit 0). If properly configured (RxTimeOut set and > the interrupt routing enabled via LSRIntEn[0], the "RTimeoutIEn" bit), > we're supposed to get ISR[0] set upon this timeout. And usually timeout condition means no data for the time equal to receive 4 more words at given baudrate and bits. (For 8-bit words on 115200 it will be time as 10*4 / 115200 ~= 360us) -- With Best Regards, Andy Shevchenko ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: High CPU load when using MAX14830 SPI UART controller 2020-07-21 14:20 ` Jan Kundrát 2020-07-21 21:49 ` Andy Shevchenko @ 2020-07-28 8:22 ` Thomas Petazzoni 1 sibling, 0 replies; 11+ messages in thread From: Thomas Petazzoni @ 2020-07-28 8:22 UTC (permalink / raw) To: Jan Kundrát; +Cc: linux-spi, Mark Brown Hello Jan, On Tue, 21 Jul 2020 16:20:02 +0200 Jan Kundrát <jan.kundrat@cesnet.cz> wrote: > I have no code, but according to the datasheet, it's the "RTimeout" bit > (Line Status Register, bit 0). If properly configured (RxTimeOut set and > the interrupt routing enabled via LSRIntEn[0], the "RTimeoutIEn" bit), > we're supposed to get ISR[0] set upon this timeout. > > I have not tried it, I just read the datasheet a few years ago. When you > have patches, I'll be happy to test them (likely only in September, though, > because of vacations). Wow, this was really useful! I had somehow missed this when glancing through datasheet, but now that you pointed out, I implemented the following patch, which: - Uses the RX timeout interrupt to fire an interrupt when there's data in the RX FIFO *and* no characters have been received for a duration equivalent to the reception time of 4 characters. - Uses the RX FIFO trigger interrupt to trigger an interrupt when the RX FIFO is half full. This ensure that if we have a continuous flow of data, we do get interrupts. Thanks to that, in my scenario of receiving 20 bytes of data every 16ms, instead of having multiple interrupts each picking max 3-4 bytes of data from the RX FIFO, I get a single interrupt that picks up the full 20 bytes of data in one go. Result: CPU consumption goes down from 25% to 5%. Here is the patch that I have so far. I'm waiting on more testing to happen, and if more extensive testing is successful, I'll submit the patch properly. The question is what are the right thresholds. In a separate e-mail, Andy Shevchenko suggested the "4 characters timeout", which I used arbitrarily. Same question for the RX FIFO trigger at half the FIFO size. Best regards, Thomas commit e958c6087aa889f421323314cb33ad9756ee033e Author: Thomas Petazzoni <thomas.petazzoni@bootlin.com> Date: Wed Jul 22 16:18:14 2020 +0200 serial: max310x: rework RX interrupt handling Currently, the RX interrupt logic uses the RXEMPTY interrupt, with the RXEMPTYINV bit set, which means we get an RX interrupt as soon as the RX FIFO is non-empty. However, with the MAX310X having a FIFO of 128 bytes, this makes very poor use of the FIFO: we trigger an interrupt as soon as the RX FIFO has one byte, which means a lot of interrupts, each only collecting a few bytes from the FIFO, causing a significant CPU load. Instead this commit relies on two other RX interrupt events: - MAX310X_IRQ_RXFIFO_BIT, which triggers when the RX FIFO has reached a certain threshold, which we define to be half of the FIFO size. This ensure we get an interrupt before the RX FIFO fills up. - MAX310X_LSR_RXTO_BIT, which triggers when the RX FIFO has received some bytes, and then no more bytes are received for a certain time. Arbitrarily, this time is defined to the time is takes to receive 4 characters. On a Microchip SAMA5D3 platform that is receiving 20 bytes every 16ms over one MAX310X UART, this patch has allowed to reduce the CPU consumption of the interrupt handler thread from ~25% to 6-7%. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com> diff --git a/drivers/tty/serial/max310x.c b/drivers/tty/serial/max310x.c index 8434bd5a8ec78..a1c80850d77ed 100644 --- a/drivers/tty/serial/max310x.c +++ b/drivers/tty/serial/max310x.c @@ -1056,9 +1056,9 @@ static int max310x_startup(struct uart_port *port) max310x_port_update(port, MAX310X_MODE1_REG, MAX310X_MODE1_TRNSCVCTRL_BIT, 0); - /* Configure MODE2 register & Reset FIFOs*/ - val = MAX310X_MODE2_RXEMPTINV_BIT | MAX310X_MODE2_FIFORST_BIT; - max310x_port_write(port, MAX310X_MODE2_REG, val); + /* Reset FIFOs*/ + max310x_port_write(port, MAX310X_MODE2_REG, + MAX310X_MODE2_FIFORST_BIT); max310x_port_update(port, MAX310X_MODE2_REG, MAX310X_MODE2_FIFORST_BIT, 0); @@ -1086,8 +1086,27 @@ static int max310x_startup(struct uart_port *port) /* Clear IRQ status register */ max310x_port_read(port, MAX310X_IRQSTS_REG); - /* Enable RX, TX, CTS change interrupts */ - val = MAX310X_IRQ_RXEMPTY_BIT | MAX310X_IRQ_TXEMPTY_BIT; + /* + * Let's ask for an interrupt after a timeout equivalent to + * the receiving time of 4 characters after the last character + * has been received. + */ + max310x_port_write(port, MAX310X_RXTO_REG, 4); + + /* + * Make sure we also get RX interrupts when the RX FIFO is + * filling up quickly, so get an interrupt when half of the RX + * FIFO has been filled in. + */ + max310x_port_write(port, MAX310X_FIFOTRIGLVL_REG, + MAX310X_FIFOTRIGLVL_RX(MAX310X_FIFO_SIZE / 2)); + + /* Enable RX timeout interrupt in LSR */ + max310x_port_write(port, MAX310X_LSR_IRQEN_REG, + MAX310X_LSR_RXTO_BIT); + + /* Enable LSR, RX FIFO trigger, CTS change interrupts */ + val = MAX310X_IRQ_LSR_BIT | MAX310X_IRQ_RXFIFO_BIT | MAX310X_IRQ_TXEMPTY_BIT; max310x_port_write(port, MAX310X_IRQEN_REG, val | MAX310X_IRQ_CTS_BIT); return 0; -- Thomas Petazzoni, CTO, Bootlin Embedded Linux and Kernel engineering https://bootlin.com ^ permalink raw reply related [flat|nested] 11+ messages in thread
end of thread, other threads:[~2020-07-28 8:22 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-05-19 14:33 High CPU load when using MAX14830 SPI UART controller Thomas Petazzoni 2020-05-19 15:24 ` Mark Brown 2020-05-20 10:18 ` Thomas Petazzoni 2020-05-20 11:26 ` Mark Brown 2020-07-21 13:39 ` Thomas Petazzoni 2020-07-21 22:30 ` Mark Brown 2020-05-20 10:44 ` Jan Kundrát 2020-07-21 13:51 ` Thomas Petazzoni 2020-07-21 14:20 ` Jan Kundrát 2020-07-21 21:49 ` Andy Shevchenko 2020-07-28 8:22 ` Thomas Petazzoni
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).