Linux-SPI Archive on lore.kernel.org
 help / color / Atom feed
* High CPU load when using MAX14830 SPI UART controller
@ 2020-05-19 14:33 Thomas Petazzoni
  2020-05-19 15:24 ` Mark Brown
  2020-05-20 10:44 ` Jan Kundrát
  0 siblings, 2 replies; 11+ messages in thread
From: Thomas Petazzoni @ 2020-05-19 14:33 UTC (permalink / raw)
  To: linux-spi; +Cc: Mark Brown, Jan Kundrát

Hello,

I have an ARM platform based on the Microchip SAMA5D3 processor
(single Cortex-A5 at 536 Mhz), which has a MAX14830 (datasheet at [0],
driver at drivers/tty/serial/max310x.c) UART controller on SPI. I'm
running an unmodified Linux 5.6 kernel. The SPI controller driver for
that platform is drivers/spi/spi-atmel.c.

[0] https://datasheets.maximintegrated.com/en/ds/MAX14830.pdf

I have a remote machine sending data at 57600 bps over one of the four
UARTs provided by the MAX14830. The data is sent 20 bytes at a time,
every 16 ms. This means 1250 bytes per second total, i.e a pretty
ridiculous amount of data being transferred.

But interestingly, this simple data transfer causes a constant 32% CPU
load on the SAMA5D3 in the kernel side. top shows:

Mem: 38204K used, 208404K free, 156K shrd, 4672K buff, 14660K cached
CPU:   0% usr  32% sys   0% nic  66% idle   0% io   0% irq   0% sirq

So, 32% of the time spent in the "system". More precisely, it shows:

  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
   80     2 root     SW       0   0%  21% [irq/50-spi1.3]
   79     2 root     SW       0   0%   5% [spi1]
  174     2 root     IW       0   0%   5% [kworker/u2:2-ev]
   82     2 root     IW       0   0%   2% [kworker/0:2-eve]
  256   243 root     S     1356   1%   1% ./progread /dev/ttyMAX0

"progread" is the program reading the data out of the UART (it does
nothing with it, except check it is valid). irq/50-spi1.3 is the
threaded IRQ of the MAX14830 driver. spi1 is I assume the kthread of
the SPI subsystem.

With the sending of 20 bytes data frames every 16 ms, it means data is
send ~62.5 per seconds.

Looking at /proc/interrupts, I get ~260 interrupts per second on the
MAX14830, which means that the "batch reading" implemented in
2b4bac48c10848ccffd484e7cd025dc085c1bd32 is effective but not to the
point where I have a single interrupt per received frame. Every
second, I'm also getting 7500-7700 interrupts from the SPI controller
of the SAMA5.

"perf top" is not extremely helpful:

   PerfTop:    1097 irqs/sec  kernel:90.0%  exact:  0.0% [4000Hz cycles],  (all, 1 CPU)
-------------------------------------------------------------------------------

    40.10%  [kernel]       [k] finish_task_switch
     8.52%  [kernel]       [k] atmel_spi_transfer_one_message
     3.92%  [kernel]       [k] schedule_timeout
     2.41%  [kernel]       [k] __softirqentry_text_start
     2.32%  [kernel]       [k] __sched_text_start
     2.26%  [kernel]       [k] arch_cpu_idle
     1.92%  [kernel]       [k] __spi_pump_messages
     1.79%  [kernel]       [k] spi_finalize_current_message
     1.62%  [kernel]       [k] __spi_validate
     1.29%  [kernel]       [k] schedule
     0.84%  [kernel]       [k] _regmap_raw_read
     0.83%  [kernel]       [k] complete
     0.76%  [kernel]       [k] __spi_sync
     0.76%  [kernel]       [k] __ksize
     0.65%  [kernel]       [k] _regmap_read
     0.64%  libc-2.30.so   [.] strcmp
     0.60%  [kernel]       [k] mmioset
     0.59%  [kernel]       [k] del_timer.part.0
     0.58%  [kernel]       [k] wait_for_completion_timeout
     0.57%  [kernel]       [k] regmap_readable

perf record/report gives:

    19.42%  spi1             [kernel.kallsyms]  [k] finish_task_switch
    17.42%  kworker/u2:1-ev  [kernel.kallsyms]  [k] finish_task_switch
     9.22%  irq/50-spi1.3    [kernel.kallsyms]  [k] atmel_spi_transfer_one_message
     7.76%  irq/50-spi1.3    [kernel.kallsyms]  [k] finish_task_switch
     4.15%  irq/50-spi1.3    [kernel.kallsyms]  [k] schedule_timeout
     4.15%  kworker/0:2-eve  [kernel.kallsyms]  [k] finish_task_switch
     2.25%  swapper          [kernel.kallsyms]  [k] arch_cpu_idle
     1.94%  irq/50-spi1.3    [kernel.kallsyms]  [k] __spi_pump_messages
     1.84%  irq/50-spi1.3    [kernel.kallsyms]  [k] __spi_validate
     1.78%  irq/50-spi1.3    [kernel.kallsyms]  [k] spi_finalize_current_message

Does the very high CPU time spent in finish_task_switch() simply means
that we're scheduling too much ?

Any hints on how to improve this situation ?

Best regards,

Thomas Petazzoni
-- 
Thomas Petazzoni, CTO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: High CPU load when using MAX14830 SPI UART controller
  2020-05-19 14:33 High CPU load when using MAX14830 SPI UART controller Thomas Petazzoni
@ 2020-05-19 15:24 ` Mark Brown
  2020-05-20 10:18   ` Thomas Petazzoni
  2020-05-20 10:44 ` Jan Kundrát
  1 sibling, 1 reply; 11+ messages in thread
From: Mark Brown @ 2020-05-19 15:24 UTC (permalink / raw)
  To: Thomas Petazzoni; +Cc: linux-spi, Jan Kundrát


[-- Attachment #1: Type: text/plain, Size: 1983 bytes --]

On Tue, May 19, 2020 at 04:33:53PM +0200, Thomas Petazzoni wrote:

> perf record/report gives:

>     19.42%  spi1             [kernel.kallsyms]  [k] finish_task_switch
>     17.42%  kworker/u2:1-ev  [kernel.kallsyms]  [k] finish_task_switch
>      9.22%  irq/50-spi1.3    [kernel.kallsyms]  [k] atmel_spi_transfer_one_message
>      7.76%  irq/50-spi1.3    [kernel.kallsyms]  [k] finish_task_switch
>      4.15%  irq/50-spi1.3    [kernel.kallsyms]  [k] schedule_timeout
>      4.15%  kworker/0:2-eve  [kernel.kallsyms]  [k] finish_task_switch
>      2.25%  swapper          [kernel.kallsyms]  [k] arch_cpu_idle
>      1.94%  irq/50-spi1.3    [kernel.kallsyms]  [k] __spi_pump_messages
>      1.84%  irq/50-spi1.3    [kernel.kallsyms]  [k] __spi_validate
>      1.78%  irq/50-spi1.3    [kernel.kallsyms]  [k] spi_finalize_current_message

> Does the very high CPU time spent in finish_task_switch() simply means
> that we're scheduling too much ?

> Any hints on how to improve this situation ?

Not thinking about this too deeply...

The copybreak for DMA in spi-ateml is 16 bytes so at 20 byte reads
(assuming that's what actually ends up as a SPI transfer) it might be
worth looking at tuning that and seeing if PIO helps, though I'm not
seeing any DMA stuff showing up there.

The other thing that jumps out looking at the code is that in the
interrupts to complete transfers the driver will just complete() and
wake the main transfer thread up, for the reads where we always have two
transfers per message this will cause an extra context switch.  If the
driver were able to initiate the next transfer directly from interrupt
context then it should schedule less (and generally reduce latency too).

I can't really see much obvious in the serial driver - it might want to
consider error checking, or if it doesn't care if transfers complete it
might want to switch to async SPI calls, but nothing that looks like
there's anything obvious for SPI related that you could do here.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: High CPU load when using MAX14830 SPI UART controller
  2020-05-19 15:24 ` Mark Brown
@ 2020-05-20 10:18   ` Thomas Petazzoni
  2020-05-20 11:26     ` Mark Brown
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Petazzoni @ 2020-05-20 10:18 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi, Jan Kundrát, Alexandre Belloni, nicolas.ferre

Hello Mark,

Thanks a lot for your fast answer, and feedback!

On Tue, 19 May 2020 16:24:49 +0100
Mark Brown <broonie@kernel.org> wrote:

> > Does the very high CPU time spent in finish_task_switch() simply means
> > that we're scheduling too much ?  
> 
> > Any hints on how to improve this situation ?  
> 
> Not thinking about this too deeply...
> 
> The copybreak for DMA in spi-ateml is 16 bytes so at 20 byte reads
> (assuming that's what actually ends up as a SPI transfer) it might be
> worth looking at tuning that and seeing if PIO helps, though I'm not
> seeing any DMA stuff showing up there.

I've already tried disabling DMA entirely by dropping the 'dmas'
property from the DT, and it didn't change anything, even though some
transfers where larger than 16 bytes: in practice with how the UART SPI
controller works, whenever I'm receiving 20 bytes, I'm typically seeing
one transfer of 17 or 18 bytes, and then another one or two transfers
of 1-2 bytes each.

> The other thing that jumps out looking at the code is that in the
> interrupts to complete transfers the driver will just complete() and
> wake the main transfer thread up, for the reads where we always have two
> transfers per message this will cause an extra context switch.  If the
> driver were able to initiate the next transfer directly from interrupt
> context then it should schedule less (and generally reduce latency too).

Indeed, I see. The spi-atmel driver is however really structured around
synchronously handling each xfer of the spi_message. I guess what we
would want to see is atmel_spi_transfer_one_message() start the first
transfer, and then wait on a completion that indicates the completion
of *all* xfers. And then, it's the interrupt handler that dequeues the
next xfer from the spi_message and submits them, so that we don't go
back to the calling thread for each xfer.

> I can't really see much obvious in the serial driver - it might want to
> consider error checking

I'm not sure what you mean by "it might want to consider error
checking". Could you explain?

>, or if it doesn't care if transfers complete it might want to switch
>to async SPI calls, but nothing that looks like
> there's anything obvious for SPI related that you could do here.

I'm not sure how it can use async transfers. Most of the transfers are
initiated using regmap_read(), which inherently is synchronous: we need
the value of the register being read to decide if we received some
data, how much data we've received, etc. The only thing that could be
asynchronous I guess is retrieving the data from the UART RX FIFO: the
completion callback could push this data up to the TTY layer rather.

On the MAX14830 driver side, I've however noticed that the tx_work
workqueue is woken up everytime the TXEMPTY interrupt fires, and this
happens even in my RX-only workload. I've disabled it entirely, it
saves a few % of CPU time, but really not much.

Mark: in this situation where a threaded interrupt handler in max310x
is doing SPI transfers, are those SPI transfers then offloaded to the
SPI workqueue, causing another context switch, or are they done
directly within the context of the interrupt thread ? I see
__spi_sync() has some logic to push out the messages from the calling
context, I guess as opposed to offloading them to the SPI workqueue ?

Overall, do we consider it "normal" to have a ~30% CPU load for what
looks like a very light workload ?

Thanks!

Thomas
-- 
Thomas Petazzoni, CTO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: High CPU load when using MAX14830 SPI UART controller
  2020-05-19 14:33 High CPU load when using MAX14830 SPI UART controller Thomas Petazzoni
  2020-05-19 15:24 ` Mark Brown
@ 2020-05-20 10:44 ` Jan Kundrát
  2020-07-21 13:51   ` Thomas Petazzoni
  1 sibling, 1 reply; 11+ messages in thread
From: Jan Kundrát @ 2020-05-20 10:44 UTC (permalink / raw)
  To: Thomas Petazzoni; +Cc: linux-spi, Mark Brown

> With the sending of 20 bytes data frames every 16 ms, it means data is
> send ~62.5 per seconds.
>
> Looking at /proc/interrupts, I get ~260 interrupts per second on the
> MAX14830, which means that the "batch reading" implemented in
> 2b4bac48c10848ccffd484e7cd025dc085c1bd32 is effective but not to the
> point where I have a single interrupt per received frame.

Hi,
I was getting something similar when I first wrote that patch. The TL;DR 
version is that even though this UART has 128 byte FIFO per each direction 
of each UART, I wanted my patch to stay conservative and only batch reads 
in an opportunistic manner -- I wanted not to risk a possible RX FIFO 
overflow. That's why the IRQ handler is eager to trigger as soon as there 
are some data in the RX FIFO. Our setup has no HW flow control (no RTS/CTS, 
no software flow control), so one has to hope that the IRQ latency is low 
enough...

The MAX14830 supports also another mode of operation where the interrupt 
fires only when the RX buffer gets filled over a configurable minimal 
threshold. This can be combined with yet another interrupt which fires 
whenever there's some data in the RX FIFO for "too long". Unfortunately, 
this timer starts over again upon reception of any additional data, so it's 
the *youngest* byte in the RX FIFO that controls triggering of this delayed 
interrupt.

I am not sure how to balance the latency ("how soon is userspace notified 
about this byte") with CPU utilization ("don't read data byte-by-byte from 
this 128 byte buffer"). If there's some cross-driver support for "something 
like that" in the TTY layer, I'm sure that max310x.c could be adopted to 
make use of that. Otherwise, we could always add a pair of DT properties 
for controling:

a) the "start reading from RX FIFO" timeout,
b) "allow up to X bytes in the RX FIFO, I know that my platform has enough 
CPU and SPI bandwidth to finish reading that before the RX FIFO overflows"

BTW, there are also real SPI bus throughput limitations, see 
2258761213cb239e5e6c11b4ec9b1700fcb4fdcd for some numbers from my platform. 
The chip supports up to 26 MHz (but not all bus masters have clock 
granularities fine enough to use this), that's 3.25 MBps of raw throughput 
divided among four UARTs. Reading each register takes at least two bytes, 
one has to check the top-level IRQ status register (to decide which UART to 
check), then read per-UART IRQ registers, and only then start reading the 
data. Also, batched reading can only happen if userspace explicitly ignores 
framing errors, parity errors, etc, otherwise the driver will have to read 
byte-by-byte, and check the line status register, etc.

Anyway, I'll be happy to review and test any patches you might have; we're 
using these chips and we are interested in their performance and error-free 
operation.

With kind regards,
Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: High CPU load when using MAX14830 SPI UART controller
  2020-05-20 10:18   ` Thomas Petazzoni
@ 2020-05-20 11:26     ` Mark Brown
  2020-07-21 13:39       ` Thomas Petazzoni
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Brown @ 2020-05-20 11:26 UTC (permalink / raw)
  To: Thomas Petazzoni
  Cc: linux-spi, Jan Kundrát, Alexandre Belloni, nicolas.ferre


[-- Attachment #1: Type: text/plain, Size: 5949 bytes --]

On Wed, May 20, 2020 at 12:18:19PM +0200, Thomas Petazzoni wrote:
> Mark Brown <broonie@kernel.org> wrote:

> > The other thing that jumps out looking at the code is that in the
> > interrupts to complete transfers the driver will just complete() and
> > wake the main transfer thread up, for the reads where we always have two
> > transfers per message this will cause an extra context switch.  If the
> > driver were able to initiate the next transfer directly from interrupt
> > context then it should schedule less (and generally reduce latency too).

> Indeed, I see. The spi-atmel driver is however really structured around
> synchronously handling each xfer of the spi_message. I guess what we
> would want to see is atmel_spi_transfer_one_message() start the first
> transfer, and then wait on a completion that indicates the completion
> of *all* xfers. And then, it's the interrupt handler that dequeues the
> next xfer from the spi_message and submits them, so that we don't go
> back to the calling thread for each xfer.

Right.  You'd need to rearrange it to go through the entire message
setting up DMA mappings that will be needed then had everything off to
interrupt context or have extra complexity and optionally go back up to
task context if there's anything complicated to handle.

It might also be possible to rearrange the transfers so that the entire
message is done by the hardware if you can chain things together, though
that only helps with DMA and it'll involve adding dummy TX and RX
segments and might be more trouble than it's worth.

> > I can't really see much obvious in the serial driver - it might want to
> > consider error checking

> I'm not sure what you mean by "it might want to consider error
> checking". Could you explain?

It wasn't checking the return values of SPI API calls.

> >, or if it doesn't care if transfers complete it might want to switch
> >to async SPI calls, but nothing that looks like
> > there's anything obvious for SPI related that you could do here.

> I'm not sure how it can use async transfers. Most of the transfers are
> initiated using regmap_read(), which inherently is synchronous: we need
> the value of the register being read to decide if we received some

Yeah, that's mostly useful on the transmit side I think.

> data, how much data we've received, etc. The only thing that could be
> asynchronous I guess is retrieving the data from the UART RX FIFO: the
> completion callback could push this data up to the TTY layer rather.

That might save you some context thrashing, or allow batching of
switches.

> Mark: in this situation where a threaded interrupt handler in max310x
> is doing SPI transfers, are those SPI transfers then offloaded to the
> SPI workqueue, causing another context switch, or are they done
> directly within the context of the interrupt thread ? I see
> __spi_sync() has some logic to push out the messages from the calling
> context, I guess as opposed to offloading them to the SPI workqueue ?

It should be doing transfers in calling context if the controller is
idle, the SPI thread should only be used when the controller is already
busy or to clean up when the controller goes idle (which will
unfortunately happen rather a lot in your use case).

Actually looking at the code in __spi_pump_messages() again I think that
in the case where we don't have any cleanup to do we should be able to
avoid kicking the thread for that which should help a bit for spi-atmel.
Can you give the patch below a go (compile tested only, not even tried
to boot)?

You can generally get a good idea of what's going on with regard to
context switching at the SPI level from the SPI tracepoints, and about
any latencies in there too.

> Overall, do we consider it "normal" to have a ~30% CPU load for what
> looks like a very light workload ?

It feels a bit high.  It might possibly go down under load if there's
other stuff going on, though I don't think that applies in this use case
- it's mainly cases where we can avoid idling the hardware due to having
a lot of batched up work.


diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
index c083ee3995e4..9bfc622466c7 100644
--- a/drivers/spi/spi.c
+++ b/drivers/spi/spi.c
@@ -1311,6 +1311,14 @@ void spi_finalize_current_transfer(struct spi_controller *ctlr)
 }
 EXPORT_SYMBOL_GPL(spi_finalize_current_transfer);
 
+static void spi_idle_runtime_pm(struct spi_controller *ctlr)
+{
+	if (ctlr->auto_runtime_pm) {
+		pm_runtime_mark_last_busy(ctlr->dev.parent);
+		pm_runtime_put_autosuspend(ctlr->dev.parent);
+	}
+}
+
 /**
  * __spi_pump_messages - function which processes spi message queue
  * @ctlr: controller to process queue for
@@ -1355,10 +1363,17 @@ static void __spi_pump_messages(struct spi_controller *ctlr, bool in_kthread)
 			return;
 		}
 
-		/* Only do teardown in the thread */
+		/* Defer any non-atomic teardown to the thread */
 		if (!in_kthread) {
-			kthread_queue_work(&ctlr->kworker,
-					   &ctlr->pump_messages);
+			if (!ctlr->dummy_rx && !ctlr->dummy_tx &&
+			    !ctlr->unprepare_transfer_hardware) {
+				spi_idle_runtime_pm(ctlr);
+				ctlr->busy = false;
+				trace_spi_controller_idle(ctlr);
+			} else {
+				kthread_queue_work(&ctlr->kworker,
+						   &ctlr->pump_messages);
+			}
 			spin_unlock_irqrestore(&ctlr->queue_lock, flags);
 			return;
 		}
@@ -1375,10 +1390,7 @@ static void __spi_pump_messages(struct spi_controller *ctlr, bool in_kthread)
 		    ctlr->unprepare_transfer_hardware(ctlr))
 			dev_err(&ctlr->dev,
 				"failed to unprepare transfer hardware\n");
-		if (ctlr->auto_runtime_pm) {
-			pm_runtime_mark_last_busy(ctlr->dev.parent);
-			pm_runtime_put_autosuspend(ctlr->dev.parent);
-		}
+		spi_idle_runtime_pm(ctlr);
 		trace_spi_controller_idle(ctlr);
 
 		spin_lock_irqsave(&ctlr->queue_lock, flags);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: High CPU load when using MAX14830 SPI UART controller
  2020-05-20 11:26     ` Mark Brown
@ 2020-07-21 13:39       ` Thomas Petazzoni
  2020-07-21 22:30         ` Mark Brown
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Petazzoni @ 2020-07-21 13:39 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi, Jan Kundrát, Alexandre Belloni, nicolas.ferre

Hello Mark,

Sorry for the very long delay in getting back to you on this. I'm
finally back on this topic.

On Wed, 20 May 2020 12:26:59 +0100
Mark Brown <broonie@kernel.org> wrote:

> Right.  You'd need to rearrange it to go through the entire message
> setting up DMA mappings that will be needed then had everything off to
> interrupt context or have extra complexity and optionally go back up to
> task context if there's anything complicated to handle.

This overall requires some fairly significant surgery in the spi-atmel
driver, which has code paths for PIO, peripheral DMA and dmaengine DMA,
all of which need to be adjusted.

> > I'm not sure what you mean by "it might want to consider error
> > checking". Could you explain?  
> 
> It wasn't checking the return values of SPI API calls.

Right, but this is unlikely to be the source of the CPU consumption
issue we're seeing.

> > Mark: in this situation where a threaded interrupt handler in max310x
> > is doing SPI transfers, are those SPI transfers then offloaded to the
> > SPI workqueue, causing another context switch, or are they done
> > directly within the context of the interrupt thread ? I see
> > __spi_sync() has some logic to push out the messages from the calling
> > context, I guess as opposed to offloading them to the SPI workqueue ?  
> 
> It should be doing transfers in calling context if the controller is
> idle, the SPI thread should only be used when the controller is already
> busy or to clean up when the controller goes idle (which will
> unfortunately happen rather a lot in your use case).
> 
> Actually looking at the code in __spi_pump_messages() again I think that
> in the case where we don't have any cleanup to do we should be able to
> avoid kicking the thread for that which should help a bit for spi-atmel.
> Can you give the patch below a go (compile tested only, not even tried
> to boot)?

I gave it a try, and unfortunately there is no difference. But it is
not too surprising, as top shows something like this:

   80     2 root     SW       0   0%  24% [irq/50-spi1.3]
   57     2 root     IW       0   0%   2% [kworker/0:1-eve]

So it's not the kthread of the SPI subsystem that is consuming most of
the CPU time, but really the threaded handler of the MAX14830 driver.
What your patch does is avoid offloading to the SPI subsystem kthread
some cleanup work when it can be done in the current context, if I
understand correctly.

> You can generally get a good idea of what's going on with regard to
> context switching at the SPI level from the SPI tracepoints, and about
> any latencies in there too.

I'll have a look there, but I don't really have any latency issue,
rather a CPU consumption issue, which is quite different.

Thanks,

Thomas
-- 
Thomas Petazzoni, CTO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: High CPU load when using MAX14830 SPI UART controller
  2020-05-20 10:44 ` Jan Kundrát
@ 2020-07-21 13:51   ` Thomas Petazzoni
  2020-07-21 14:20     ` Jan Kundrát
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Petazzoni @ 2020-07-21 13:51 UTC (permalink / raw)
  To: Jan Kundrát; +Cc: linux-spi, Mark Brown

Hello Jan,

On Wed, 20 May 2020 12:44:43 +0200
Jan Kundrát <jan.kundrat@cesnet.cz> wrote:

> I was getting something similar when I first wrote that patch. The TL;DR 
> version is that even though this UART has 128 byte FIFO per each direction 
> of each UART, I wanted my patch to stay conservative and only batch reads 
> in an opportunistic manner -- I wanted not to risk a possible RX FIFO 
> overflow. That's why the IRQ handler is eager to trigger as soon as there 
> are some data in the RX FIFO. Our setup has no HW flow control (no RTS/CTS, 
> no software flow control), so one has to hope that the IRQ latency is low 
> enough...
> 
> The MAX14830 supports also another mode of operation where the interrupt 
> fires only when the RX buffer gets filled over a configurable minimal 
> threshold. This can be combined with yet another interrupt which fires 
> whenever there's some data in the RX FIFO for "too long". Unfortunately, 
> this timer starts over again upon reception of any additional data, so it's 
> the *youngest* byte in the RX FIFO that controls triggering of this delayed 
> interrupt.

Do you have some more details about this mode of operation? I looked a
bit at the MAX14830 datasheet and code, but couldn't spot what allows
an interrupt to be triggered after X bytes have been received *OR* if
data has been sitting in the FIFO for too long.

> I am not sure how to balance the latency ("how soon is userspace notified 
> about this byte") with CPU utilization ("don't read data byte-by-byte from 
> this 128 byte buffer"). If there's some cross-driver support for "something 
> like that" in the TTY layer, I'm sure that max310x.c could be adopted to 
> make use of that. Otherwise, we could always add a pair of DT properties 
> for controling:
> 
> a) the "start reading from RX FIFO" timeout,
> b) "allow up to X bytes in the RX FIFO, I know that my platform has enough 
> CPU and SPI bandwidth to finish reading that before the RX FIFO overflows"
> 
> BTW, there are also real SPI bus throughput limitations, see 
> 2258761213cb239e5e6c11b4ec9b1700fcb4fdcd for some numbers from my platform. 
> The chip supports up to 26 MHz (but not all bus masters have clock 
> granularities fine enough to use this), that's 3.25 MBps of raw throughput 
> divided among four UARTs. Reading each register takes at least two bytes, 
> one has to check the top-level IRQ status register (to decide which UART to 
> check), then read per-UART IRQ registers, and only then start reading the 
> data. Also, batched reading can only happen if userspace explicitly ignores 
> framing errors, parity errors, etc, otherwise the driver will have to read 
> byte-by-byte, and check the line status register, etc.

Yes, there is definitely some significant overhead with all those SPI
transactions. But here I'm mostly concerned about CPU usage rather than
actual bandwidth on the UART.

Thomas
-- 
Thomas Petazzoni, CTO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: High CPU load when using MAX14830 SPI UART controller
  2020-07-21 13:51   ` Thomas Petazzoni
@ 2020-07-21 14:20     ` Jan Kundrát
  2020-07-21 21:49       ` Andy Shevchenko
  2020-07-28  8:22       ` Thomas Petazzoni
  0 siblings, 2 replies; 11+ messages in thread
From: Jan Kundrát @ 2020-07-21 14:20 UTC (permalink / raw)
  To: Thomas Petazzoni; +Cc: linux-spi, Mark Brown

>> The MAX14830 supports also another mode of operation where the interrupt 
>> fires only when the RX buffer gets filled over a configurable minimal 
>> threshold. This can be combined with yet another interrupt which fires 
>> whenever there's some data in the RX FIFO for "too long". Unfortunately, 
>> this timer starts over again upon reception of any additional 
>> data, so it's 
>> the *youngest* byte in the RX FIFO that controls triggering of 
>> this delayed 
>> interrupt.
>
> Do you have some more details about this mode of operation? I looked a
> bit at the MAX14830 datasheet and code, but couldn't spot what allows
> an interrupt to be triggered after X bytes have been received *OR* if
> data has been sitting in the FIFO for too long.

I have no code, but according to the datasheet, it's the "RTimeout" bit 
(Line Status Register, bit 0). If properly configured (RxTimeOut set and 
the interrupt routing enabled via LSRIntEn[0], the "RTimeoutIEn" bit), 
we're supposed to get ISR[0] set upon this timeout.

I have not tried it, I just read the datasheet a few years ago. When you 
have patches, I'll be happy to test them (likely only in September, though, 
because of vacations).

With kind regards,
Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: High CPU load when using MAX14830 SPI UART controller
  2020-07-21 14:20     ` Jan Kundrát
@ 2020-07-21 21:49       ` Andy Shevchenko
  2020-07-28  8:22       ` Thomas Petazzoni
  1 sibling, 0 replies; 11+ messages in thread
From: Andy Shevchenko @ 2020-07-21 21:49 UTC (permalink / raw)
  To: Jan Kundrát; +Cc: Thomas Petazzoni, linux-spi, Mark Brown

On Tue, Jul 21, 2020 at 5:32 PM Jan Kundrát <jan.kundrat@cesnet.cz> wrote:

> I have no code, but according to the datasheet, it's the "RTimeout" bit
> (Line Status Register, bit 0). If properly configured (RxTimeOut set and
> the interrupt routing enabled via LSRIntEn[0], the "RTimeoutIEn" bit),
> we're supposed to get ISR[0] set upon this timeout.

And usually timeout condition means no data for the time equal to
receive 4 more words at given baudrate and bits.
(For 8-bit words on 115200 it will be time as 10*4 / 115200 ~= 360us)

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: High CPU load when using MAX14830 SPI UART controller
  2020-07-21 13:39       ` Thomas Petazzoni
@ 2020-07-21 22:30         ` Mark Brown
  0 siblings, 0 replies; 11+ messages in thread
From: Mark Brown @ 2020-07-21 22:30 UTC (permalink / raw)
  To: Thomas Petazzoni
  Cc: linux-spi, Jan Kundrát, Alexandre Belloni, nicolas.ferre


[-- Attachment #1: Type: text/plain, Size: 2197 bytes --]

On Tue, Jul 21, 2020 at 03:39:44PM +0200, Thomas Petazzoni wrote:
> Mark Brown <broonie@kernel.org> wrote:

> > > I'm not sure what you mean by "it might want to consider error
> > > checking". Could you explain?  

> > It wasn't checking the return values of SPI API calls.

> Right, but this is unlikely to be the source of the CPU consumption
> issue we're seeing.

Indeed, that was just something I noticed while reviewing.

> > Actually looking at the code in __spi_pump_messages() again I think that
> > in the case where we don't have any cleanup to do we should be able to
> > avoid kicking the thread for that which should help a bit for spi-atmel.
> > Can you give the patch below a go (compile tested only, not even tried
> > to boot)?

> I gave it a try, and unfortunately there is no difference. But it is
> not too surprising, as top shows something like this:

>    80     2 root     SW       0   0%  24% [irq/50-spi1.3]
>    57     2 root     IW       0   0%   2% [kworker/0:1-eve]

> So it's not the kthread of the SPI subsystem that is consuming most of
> the CPU time, but really the threaded handler of the MAX14830 driver.
> What your patch does is avoid offloading to the SPI subsystem kthread
> some cleanup work when it can be done in the current context, if I
> understand correctly.

It does, yes.  Depending on the particular performance characteristics
of the system it can sometimes help noticably to cut out those context
thrashes if just the context switching ends up being a noticable
overhead, regardless of how much work each thread does when it does get
control.  In any case that's now merged in -next, I got some testing on
other boards, so hopefully it'll help someone.

> > You can generally get a good idea of what's going on with regard to
> > context switching at the SPI level from the SPI tracepoints, and about
> > any latencies in there too.

> I'll have a look there, but I don't really have any latency issue,
> rather a CPU consumption issue, which is quite different.

Yeah, that's basically where the overhead in the SPI subsystem itself
(rather than the specific driver) tends to be unfortunately.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: High CPU load when using MAX14830 SPI UART controller
  2020-07-21 14:20     ` Jan Kundrát
  2020-07-21 21:49       ` Andy Shevchenko
@ 2020-07-28  8:22       ` Thomas Petazzoni
  1 sibling, 0 replies; 11+ messages in thread
From: Thomas Petazzoni @ 2020-07-28  8:22 UTC (permalink / raw)
  To: Jan Kundrát; +Cc: linux-spi, Mark Brown

Hello Jan,

On Tue, 21 Jul 2020 16:20:02 +0200
Jan Kundrát <jan.kundrat@cesnet.cz> wrote:

> I have no code, but according to the datasheet, it's the "RTimeout" bit 
> (Line Status Register, bit 0). If properly configured (RxTimeOut set and 
> the interrupt routing enabled via LSRIntEn[0], the "RTimeoutIEn" bit), 
> we're supposed to get ISR[0] set upon this timeout.
> 
> I have not tried it, I just read the datasheet a few years ago. When you 
> have patches, I'll be happy to test them (likely only in September, though, 
> because of vacations).

Wow, this was really useful! I had somehow missed this when glancing
through datasheet, but now that you pointed out, I implemented the
following patch, which:

 - Uses the RX timeout interrupt to fire an interrupt when there's data
   in the RX FIFO *and* no characters have been received for a duration
   equivalent to the reception time of 4 characters.

 - Uses the RX FIFO trigger interrupt to trigger an interrupt when the
   RX FIFO is half full. This ensure that if we have a continuous flow
   of data, we do get interrupts.

Thanks to that, in my scenario of receiving 20 bytes of data every
16ms, instead of having multiple interrupts each picking max 3-4 bytes
of data from the RX FIFO, I get a single interrupt that picks up the
full 20 bytes of data in one go. Result: CPU consumption goes down from
25% to 5%.

Here is the patch that I have so far. I'm waiting on more testing to
happen, and if more extensive testing is successful, I'll submit the
patch properly. The question is what are the right thresholds. In a
separate e-mail, Andy Shevchenko suggested the "4 characters timeout",
which I used arbitrarily. Same question for the RX FIFO trigger at half
the FIFO size.

Best regards,

Thomas

commit e958c6087aa889f421323314cb33ad9756ee033e
Author: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Date:   Wed Jul 22 16:18:14 2020 +0200

    serial: max310x: rework RX interrupt handling
    
    Currently, the RX interrupt logic uses the RXEMPTY interrupt, with the
    RXEMPTYINV bit set, which means we get an RX interrupt as soon as the
    RX FIFO is non-empty.
    
    However, with the MAX310X having a FIFO of 128 bytes, this makes very
    poor use of the FIFO: we trigger an interrupt as soon as the RX FIFO
    has one byte, which means a lot of interrupts, each only collecting a
    few bytes from the FIFO, causing a significant CPU load.
    
    Instead this commit relies on two other RX interrupt events:
    
     - MAX310X_IRQ_RXFIFO_BIT, which triggers when the RX FIFO has reached
       a certain threshold, which we define to be half of the FIFO
       size. This ensure we get an interrupt before the RX FIFO fills up.
    
     - MAX310X_LSR_RXTO_BIT, which triggers when the RX FIFO has received
       some bytes, and then no more bytes are received for a certain
       time. Arbitrarily, this time is defined to the time is takes to
       receive 4 characters.
    
    On a Microchip SAMA5D3 platform that is receiving 20 bytes every 16ms
    over one MAX310X UART, this patch has allowed to reduce the CPU
    consumption of the interrupt handler thread from ~25% to 6-7%.
    
    Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>

diff --git a/drivers/tty/serial/max310x.c b/drivers/tty/serial/max310x.c
index 8434bd5a8ec78..a1c80850d77ed 100644
--- a/drivers/tty/serial/max310x.c
+++ b/drivers/tty/serial/max310x.c
@@ -1056,9 +1056,9 @@ static int max310x_startup(struct uart_port *port)
 	max310x_port_update(port, MAX310X_MODE1_REG,
 			    MAX310X_MODE1_TRNSCVCTRL_BIT, 0);
 
-	/* Configure MODE2 register & Reset FIFOs*/
-	val = MAX310X_MODE2_RXEMPTINV_BIT | MAX310X_MODE2_FIFORST_BIT;
-	max310x_port_write(port, MAX310X_MODE2_REG, val);
+	/* Reset FIFOs*/
+	max310x_port_write(port, MAX310X_MODE2_REG,
+			   MAX310X_MODE2_FIFORST_BIT);
 	max310x_port_update(port, MAX310X_MODE2_REG,
 			    MAX310X_MODE2_FIFORST_BIT, 0);
 
@@ -1086,8 +1086,27 @@ static int max310x_startup(struct uart_port *port)
 	/* Clear IRQ status register */
 	max310x_port_read(port, MAX310X_IRQSTS_REG);
 
-	/* Enable RX, TX, CTS change interrupts */
-	val = MAX310X_IRQ_RXEMPTY_BIT | MAX310X_IRQ_TXEMPTY_BIT;
+	/*
+	 * Let's ask for an interrupt after a timeout equivalent to
+	 * the receiving time of 4 characters after the last character
+	 * has been received.
+	 */
+	max310x_port_write(port, MAX310X_RXTO_REG, 4);
+
+	/*
+	 * Make sure we also get RX interrupts when the RX FIFO is
+	 * filling up quickly, so get an interrupt when half of the RX
+	 * FIFO has been filled in.
+	 */
+	max310x_port_write(port, MAX310X_FIFOTRIGLVL_REG,
+			   MAX310X_FIFOTRIGLVL_RX(MAX310X_FIFO_SIZE / 2));
+
+	/* Enable RX timeout interrupt in LSR */
+	max310x_port_write(port, MAX310X_LSR_IRQEN_REG,
+			   MAX310X_LSR_RXTO_BIT);
+
+	/* Enable LSR, RX FIFO trigger, CTS change interrupts */
+	val = MAX310X_IRQ_LSR_BIT  | MAX310X_IRQ_RXFIFO_BIT | MAX310X_IRQ_TXEMPTY_BIT;
 	max310x_port_write(port, MAX310X_IRQEN_REG, val | MAX310X_IRQ_CTS_BIT);
 
 	return 0;




-- 
Thomas Petazzoni, CTO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, back to index

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-19 14:33 High CPU load when using MAX14830 SPI UART controller Thomas Petazzoni
2020-05-19 15:24 ` Mark Brown
2020-05-20 10:18   ` Thomas Petazzoni
2020-05-20 11:26     ` Mark Brown
2020-07-21 13:39       ` Thomas Petazzoni
2020-07-21 22:30         ` Mark Brown
2020-05-20 10:44 ` Jan Kundrát
2020-07-21 13:51   ` Thomas Petazzoni
2020-07-21 14:20     ` Jan Kundrát
2020-07-21 21:49       ` Andy Shevchenko
2020-07-28  8:22       ` Thomas Petazzoni

Linux-SPI Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-spi/0 linux-spi/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-spi linux-spi/ https://lore.kernel.org/linux-spi \
		linux-spi@vger.kernel.org
	public-inbox-index linux-spi

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-spi


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git