All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] A new SPI API for fast, low-latency regmap peripheral access
@ 2022-05-12 14:34 David Jander
  2022-05-12 20:37 ` Mark Brown
  2022-05-13 12:10 ` Andrew Lunn
  0 siblings, 2 replies; 25+ messages in thread
From: David Jander @ 2022-05-12 14:34 UTC (permalink / raw)
  To: linux-spi; +Cc: Mark Brown, Marc Kleine-Budde, Oleksij Rempel


Hi Mark, all,

Sorry for sending an RFC without actual patches. What I am about to propose
would require a lot of work, and I am sure I will not get it right without
first asking for comments. I also assume that I might not be the only one
asking/proposing this, and may ignore the existence of discussions that may
have happened in the past. If I am committing a crime, please accept my
apologies and let me know.

TL;DR: drivers/spi/spi.c API has too much overhead for some use cases.

1. How did I get here?
----------------------

Software involved: Linux v5.18-rc1:
 drivers/spi/spi.c
 drivers/spi/spi-imx.c (note in text for extra patches)
 drivers/net/can/spi/mcp251xfd/*

Hardware involved:
 i.MX8MP SoC with external SPI CAN controller MCP2518FD.

Using an i.MX8MP SoC with an external CAN interface via SPI, the mcp2518fd
controller, I noticed suspiciously high latency when running a simple tool that
would just echo CAN messages back. After applying Marc Kleine-Budde's patches
here [1], things got a lot better, but I was already looking too close.
Analyzing the SPI/CAN timing on a scope, I noticed at first a big delay from
CS to the start of the actual SPI transfer and started digging to find out
what caused this delay.

2. What did I do?
-----------------

Looking at spi.c, I noticed a lot of code that is being executed even for the
SYNC code-path, that should avoid the workqueue, doing statistics mainly,
involving quite a few spinlocks. I wanted to know what the impact of all of
that code was, so I hacked together a new API postfixed with *_fast that did
roughly the same as the spi_sync_* API, but without all of this accounting
code. Specifically, I created substitutes for the following functions, which
are all SPI API calls used by the MCP2518FD driver:

 spi_sync()
 spi_sync_transfer()
 spi_write()
 spi_write_then_read()
 spi_async() (replaced with a sync equivalent)

3. Measurement results
----------------------

I distinguish between 3 different kernels in the following measurement results:

 kernel A: Vanilla Linux 5.18-rc1
 kernel B: Linux 5.18-rc1 with Marc's polling patches from linux-next applied
 to spi-imx.c [1]
 kernel C: Linux 5.18-rc1 with polling patches and hacked *_fast SPI API.

The measurements were conducted by running "canecho.c" [2] on the target
board, while executing the following command on another machine connected via
CAN (baud-rate 250kBaud):

$ cangen can0 -g 10 -n 1000 -p 2 -I 0x077 -L 8 -D r

For CPU load measurements, the following command was used instead:

$ cangen can0 -g 0 -n 50000 -p 2 -I 0x077 -L 8 -D r

The machine running cangen is able to load the CAN bus to 100% capacity
consistently this way.

3.1. SPI signal timing measurements

Scope images are available on request.

3.1.1. Gap between RX CAN frame EOF and TX echo response SOF:

Kernel A: 380us
Kernel B: 310us
Kernel C: 152us

3.1.2. Total time the SPI controller IRQ line is low:

Kernel A: 160us
Kernel B: 144us
Kernel C: 55us

3.1.3. Time from SPI CS active to actual start of transfer:

Kernel A: ca. 10us
Kernel B: 9.8us
Kernel C: 2.6us

3.1.4. Time of CS high between 1st and 2nd SPI sync access from IRQ thread:

kernel A: ca 25us
kernel B: ca 30us
kernel C: 5us

3.2. CPU usage measurements with 100% bus load running canecho at 250kBaud:

kernel A: 13.3% [irq/210-spi2.0], 78.5% idle
kernel B: 10.2% [irq/210-spi2.0], 81.6% idle
kernel C: 4.4%  [irq/210-spi2.0], 92.9% idle

Overall performance improvements from kernel B to kernel C:

CAN message round trip time: 50% faster
CPU load: 61% less

4. Rationale
------------

There are many use-cases for SPI in the Linux kernel, and they all have
special requirements and trade-offs: On one side one wants to transfer large
amount of data to the peripheral efficiently (FLASH memory) and without
blocking. On the opposite side there is the need to modify registers in a
SPI-mapped peripheral, ideally using regmap.
There are two APIs already: sync and async, that should cover these use-cases.
Unfortunately both share a large portion of the code path, which causes the
sync use-case for small, fast register manipulation to be bogged down by all
the accounting and statistics code. There's also sometimes memcpy()'s involved
to put buffers into DMA space (driver dependent), even though DMA isn't used.

So a "peripheral" driver (the P from SPI coincidentally), that accesses a
SPI-based regmap doing register manipulation, leading to several very short
and small, fast transfers end's up with an unreasonable CPU load and access
latency, even when using the *sync* API.

Assuming the *sync* API cannot be changed, this leads to the need to introduce
a new API optimized specifically for this use-case. IMHO it is also reasonable
to say that when accessing registers on a peripheral device, the whole
statistics and accounting machinery in spi.c isn't really so valuable, and
definitely not worth its overhead in a production system.

5. Details of the SPI transfers involved
----------------------------------------

The MCP2518FD CAN driver does a series of small SPI transfers when running a
single CAN message from cangen to canecho and back:

 1. CAN message RX, IRQ line asserted
 2. Hard IRQ empty, starts IRQ thread
 3. IRQ thread interrogates MCP2518FD via register access:
 3.1. SPI transfer 1: CS low, 72bit xfer, CS high
 3.2. SPI transfer 2: CS low, 200bit xfer, CS high
 3.3. SPI transfer 3: CS low, 72bit xfer, CS high
 3.4. SPI transfer 4: CS low, 104bit xfer, CS high
 4. IRQ thread ended, RX message gets delivered to user-space
 5. canecho.c recv()
 6. canecho.c send()
 7. TX message gets delivered to CAN driver
 8. CAN driver does spi_async to queue 2 xfers (replace by spi_sync equivalent
 in kernel C):
 8.1. SPI message 1: CS low, 168bit xfer, CS high, CS low, 48bit xfer, CS high
 9. CAN message SOF starts appearing on the bus just before last CS high.

6. Some regions of code that were inspected
-------------------------------------------

The code in spi.c that gets executed contains a lot of ifs and foresees a lot
of different situations, so it might not be trivial to look at a single place
and find a smoking gun. It is more the sum of everything that just takes a
long time to execute, even on this relatively fast ARM Cortex-A53 running at
1.2GHz.
Some places I tried to single out:

6.1. Accounting spinlocks:

Spinlocks are supposed to be fast, especially for the case that they are not
contested, but in such critical paths their impact shouldn't be neglected.

SPI_STATISTICS_ADD_TO_FIELD: This macro defined in spi.h has a spinlock, and
it is used 4 times directly in __spi_sync(). It is also used in
spi_transfer_one_message() which is called from there. Removing the spinlocks
(thus introducing races) makes the code measurably faster (several us).

spi_statistics_add_transfer_stats(): Called twice from
spi_transfer_one_message(), and also contains a spinlock. Removing these again
has a measurable impact of several us.

6.2. Misc other places:

ptp_read_system_prets(): Called once, since the hardware lacks a usable TS
counter. Removing this did not have a significant impact, although it was
still detectable, but barely so.

spi_set_cs(): Removing all delay code and leaving the bare minimum for GPIO
based CS activation again has a measurable impact. Most (all?) simple SPI
peripheral chips don't have any special CS->clock->CS timing requirements, so
it might be a good idea to have a simpler version of this function.

7. Requirements and compromises for the new API
-----------------------------------------------

Since this hypothetical new API would be used only for very short, very fast
transfers where latency and overhead should be minimized, the best way to do
it is obviate all scheduling work and do it strictly synchronous and based on
polling. The context switch of even a hard-IRQ can quickly cost a lot more CPU
cycles than busy waiting for 48 bits to be shifted through the transmitter at
20+MHz clock. This requires that SPI drivers offer low-level functions that do
such simple transfers on polling basis. The patches [1] from Marc Kleine-Budde
already do this, but it is the SPI driver that choses whether to use polling or
IRQ based transfers based on heuristics calculating the theoretical transfer
time given the clock frequency and its size. While it improves the performance
in a lot of cases already, peripheral drivers have no choice but to still go
through all the heavy code in spi.c.
Since these are low-latency applications, chances are very high that the
hardware is also designed for low-latency access, which implies that CS
control via GPIO most probably uses local GPIO controllers instead of I2C GPIO
expanders for example, so CS access can be assumed to be fast and direct and
not involve any context switches. It could be argued that it might even be
beneficial to have an API that can be called from hard IRQ context, but
experiments in this case showed that the gain of doing the CAN message read
out directly in hard-IRQ and removing the IRQ thread is minimal. But better
use-cases could be conceived, so this possibility might need consideration
also.
Obviously all statistics accounting should be skipped for this API, since it
simply impacts performance far too much.
Care should be taken to solve locking in such a way, that it doesn't impact
performance for the fast API, while still allowing safe concurrency with
spi_sync and spi_async. I did not go as far as to solve this issue. I just
used a simple spinlock and carefully avoided using any call to the old API for
doing these proof-of-concept measurements.

8. Conclusion
-------------

Performance of spi.c API for the specified use-cases is not ideal.
Unfortunately there is no single smoking gun to be pointed at, but instead
many different bits which are not needed for the given use-case that add to
the bloat and ultimately have a big combined performance impact.
The stated usage scenario is fairly common in the Linux kernel. A simple
investigation counted 60+ IIO drivers and 9 input drivers alone that use
spi_sync*() for example, up to a total of 171 .c files. In contrast only 11 .c
files use the spi_async*() calls. This does not account for all users of
regmap_spi.
Due to this, IMHO one can ask for a better, simpler, more efficient API for
these use-cases, am I want to propose to create it.

Thanks a lot if you read this far. Please let me know if such a thing is even
thinkable in mainline Linux.


[1] https://lore.kernel.org/all/20220502175457.1977983-9-mkl@pengutronix.de/
[2] https://github.com/linux-can/can-tests/blob/master/raw/canecho.c

Best regards,

-- 
David Jander
Protonic Holland.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-12 14:34 [RFC] A new SPI API for fast, low-latency regmap peripheral access David Jander
@ 2022-05-12 20:37 ` Mark Brown
  2022-05-12 22:12   ` Mark Brown
  2022-05-13 12:46   ` David Jander
  2022-05-13 12:10 ` Andrew Lunn
  1 sibling, 2 replies; 25+ messages in thread
From: Mark Brown @ 2022-05-12 20:37 UTC (permalink / raw)
  To: David Jander; +Cc: linux-spi, Marc Kleine-Budde, Oleksij Rempel

[-- Attachment #1: Type: text/plain, Size: 14077 bytes --]

On Thu, May 12, 2022 at 04:34:45PM +0200, David Jander wrote:

> Sorry for sending an RFC without actual patches. What I am about to propose
> would require a lot of work, and I am sure I will not get it right without
> first asking for comments. I also assume that I might not be the only one
> asking/proposing this, and may ignore the existence of discussions that may
> have happened in the past. If I am committing a crime, please accept my
> apologies and let me know.

> TL;DR: drivers/spi/spi.c API has too much overhead for some use cases.

It would be really helpful if you could describe in concrete terms your
actual use case here.  According to your mail you are testing with a
single threaded echo server which is obviously not a realistic
application and will be a bit of a limiting factor, it would be good to
understand what's going on in a way that's less microbenchmarky.

High level it feels like you are approaching this at the wrong level, as
far as I can tell you are proposing a completely separate API for
drivers which cuts out all the locking which doesn't seem like a very
usable API for them as the performance characteristics are going to end
up being very system and application specific, it's going to be hard for
them to assess which API to use.  I'm not seeing anything here that says
add a new API, there's certainly room to make the existing API more
performant but I don't see the jump to a new API.

> There are many use-cases for SPI in the Linux kernel, and they all have
> special requirements and trade-offs: On one side one wants to transfer large
> amount of data to the peripheral efficiently (FLASH memory) and without
> blocking. On the opposite side there is the need to modify registers in a
> SPI-mapped peripheral, ideally using regmap.

> There are two APIs already: sync and async, that should cover these use-cases.
> Unfortunately both share a large portion of the code path, which causes the
> sync use-case for small, fast register manipulation to be bogged down by all
> the accounting and statistics code. There's also sometimes memcpy()'s involved
> to put buffers into DMA space (driver dependent), even though DMA isn't used.

That's not really a good description of what's going on with those APIs
or choosing between them - there's certainly no short/long message
distinction intended, the distinction is more about if the user needs to
get the results of the transfer before proceeding with a side order of
the sync APIs being more convenient to use.  In general if a user has a
lot of work to do and doesn't specifically need to block for it there
will often be a performance advantage in using the async API, even if
the individual messages are short.  Submitting asynchronously means that
we are more likely to be able to start pushing a new message immediately
after completion of one message which minimises dead time on the bus,
and opens up opportunities for preparing one message while the current
one is in flight that don't otherwise exist (opportunities than we
currently make much use of).  Conversely a large message where we need
the result in order to proceed is going to do just fine with the sync
API, it is actually a synchronous operation after all.

One example of lots of potentially short messages is regmap's cache sync
code, it uses async writes to sync the register map so that it can
overlap working out what to sync and marshalling the data with the
actual writes - that tends to pan out as meaning that the sync completes
faster since we can often identify and queue several more registers to
write in the time it takes to send the first one which works out faster
even with PIO controllers and bouncing to the worker thread, the context
switching ends up being less than the time taken to work out what to
send for even a fairly small number of registers.

> Assuming the *sync* API cannot be changed, this leads to the need to introduce
> a new API optimized specifically for this use-case. IMHO it is also reasonable
> to say that when accessing registers on a peripheral device, the whole
> statistics and accounting machinery in spi.c isn't really so valuable, and
> definitely not worth its overhead in a production system.

That seems like a massive assumption, one could equally say that any
application that is saturating the SPI bus is going to be likely to want
to be able to monitor the performance and utilisation of the bus in
order to facilitate optimisation and so want the statistics.

> 5. Details of the SPI transfers involved
> ----------------------------------------
> 
> The MCP2518FD CAN driver does a series of small SPI transfers when running a
> single CAN message from cangen to canecho and back:
> 
>  1. CAN message RX, IRQ line asserted
>  2. Hard IRQ empty, starts IRQ thread
>  3. IRQ thread interrogates MCP2518FD via register access:
>  3.1. SPI transfer 1: CS low, 72bit xfer, CS high
>  3.2. SPI transfer 2: CS low, 200bit xfer, CS high
>  3.3. SPI transfer 3: CS low, 72bit xfer, CS high
>  3.4. SPI transfer 4: CS low, 104bit xfer, CS high
>  4. IRQ thread ended, RX message gets delivered to user-space
>  5. canecho.c recv()
>  6. canecho.c send()
>  7. TX message gets delivered to CAN driver
>  8. CAN driver does spi_async to queue 2 xfers (replace by spi_sync equivalent
>  in kernel C):
>  8.1. SPI message 1: CS low, 168bit xfer, CS high, CS low, 48bit xfer, CS high
>  9. CAN message SOF starts appearing on the bus just before last CS high.

Note that this is all totally single threaded and sequential which is
going to change the performance characteristics substantially, for
example adding and driving another echo server or even just having the
remote end push another message into flight before it waits for a
response would get more mileage out of the async API.

> 6.1. Accounting spinlocks:

> Spinlocks are supposed to be fast, especially for the case that they are not
> contested, but in such critical paths their impact shouldn't be neglected.

> SPI_STATISTICS_ADD_TO_FIELD: This macro defined in spi.h has a spinlock, and
> it is used 4 times directly in __spi_sync(). It is also used in
> spi_transfer_one_message() which is called from there. Removing the spinlocks
> (thus introducing races) makes the code measurably faster (several us).

> spi_statistics_add_transfer_stats(): Called twice from
> spi_transfer_one_message(), and also contains a spinlock. Removing these again
> has a measurable impact of several us.

So for example a sysctl to suppress stats, or making the stats code
fancier with per cpu data or whatever so it doesn't need to lock in the
hot path would help here (obviously the latter is a bit nicer).  Might
be interesting seeing if it's the irqsave bit that's causing trouble
here, I'm not sure that's actually required other than for the error
counters.

> spi_set_cs(): Removing all delay code and leaving the bare minimum for GPIO
> based CS activation again has a measurable impact. Most (all?) simple SPI
> peripheral chips don't have any special CS->clock->CS timing requirements, so
> it might be a good idea to have a simpler version of this function.

Surely the thing there would just be to set the delays to zero if they
can actually be zero (and add a special case for actually zero delay
which we don't currently have)?  But in any case devices do always have
some minimum requirements for delays at various points around asserting
chip select, plus there's considerations around providing enough ramp
time for signals to reach appropriate levels which are more system level
than chip level.  The required delays are normally very small so
effectively pan out as zero in a lot of systems but the faster things
run (both on the bus and for the SoC) the more visible they get and more
attention needs to be paid.  It should be possible to do something to
assess a delay as being effectively zero and round down which would feed
in here when we're managing the chip select from software but we can't
just ignore the delays.

I note that for example that the MPC2515 quotes minimum chip select
setup, hold and disable times of 50ns - those are very small, but they
are non-zero.

> Since this hypothetical new API would be used only for very short, very fast
> transfers where latency and overhead should be minimized, the best way to do
> it is obviate all scheduling work and do it strictly synchronous and based on
> polling. The context switch of even a hard-IRQ can quickly cost a lot more CPU
> cycles than busy waiting for 48 bits to be shifted through the transmitter at
> 20+MHz clock. This requires that SPI drivers offer low-level functions that do
> such simple transfers on polling basis. The patches [1] from Marc Kleine-Budde
> already do this, but it is the SPI driver that choses whether to use polling or
> IRQ based transfers based on heuristics calculating the theoretical transfer
> time given the clock frequency and its size. While it improves the performance
> in a lot of cases already, peripheral drivers have no choice but to still go
> through all the heavy code in spi.c.

There's a whole pile of assumptions in there about the specific system
you're running on and how it's going to perform.  Like I said above it
really feels like this is the wrong level to approach things at, it's
pushing decisions to the client driver that are really system specific.
Why would a client doing "short" transfers (short itself being a fuzzy
term) not want to use this interface, and what happens when for example
someone puts one of these CAN controllers on a USB dongle which simply
can't implement a non-blocking mode?  We should aim to do things which
just benefit any client driver using the APIs idiomatically without them
having to make assumptions about either the performance characteristics
of the system they're running on or the features it has, especially if
those assumptions would make the driver unusuable on some systems.

> Since these are low-latency applications, chances are very high that the
> hardware is also designed for low-latency access, which implies that CS
> control via GPIO most probably uses local GPIO controllers instead of I2C GPIO
> expanders for example, so CS access can be assumed to be fast and direct and

High chances are not guarantees, and none of this sounds like things
that should require specific coding in the client drivers.

> not involve any context switches. It could be argued that it might even be
> beneficial to have an API that can be called from hard IRQ context, but
> experiments in this case showed that the gain of doing the CAN message read
> out directly in hard-IRQ and removing the IRQ thread is minimal. But better
> use-cases could be conceived, so this possibility might need consideration
> also.

That's something that the async API (or the sync API in the contended
case) can enable - if we have another transfer queued then we would if
someone did the work be able to arrange to start pushing it immediately
the prior transfer completes.

> Care should be taken to solve locking in such a way, that it doesn't impact
> performance for the fast API, while still allowing safe concurrency with
> spi_sync and spi_async. I did not go as far as to solve this issue. I just
> used a simple spinlock and carefully avoided using any call to the old API for
> doing these proof-of-concept measurements.

Apart from the hard bits...  :P

The only bits of the existing code that you've specifically identified
as taking substantial time here are the delays and the statistics, both
of these seem like areas which could just be improved in place without
requiring changes outside of the SPI subsystem that benefit all users.
It sounds like the bits you've profiled as causing trouble are delays
and stats synchronisation which does sound plausible but those do also
seem like they can be specifically targetted - being smarter about when
we actually do a delay, and either improving the locking or providing
more optimisation for the stats code.

If there are other bits of the message setup code which are getting in
the way and don't have obvious paths for optimisation (the validation
potentially?) then if your application is spamming a lot of the same
operation (eg, with the status reading in the interrupt handler) then
quite a while ago Martin Sparl was looking at providing an interface
which would allow client drivers to pre-cook messages so that they could
be submitted multiple times without going through the validation that we
normally do (and perhaps get some driver preparation done as well).  He
was looking at it for other reasons but it seems like a productive
approach for cutting down on the setup overhead, it would require more
up front work in the client but cut down on the amount of work done per
operation and seems like it should scale well over different systems.

> Performance of spi.c API for the specified use-cases is not ideal.
> Unfortunately there is no single smoking gun to be pointed at, but instead
> many different bits which are not needed for the given use-case that add to
> the bloat and ultimately have a big combined performance impact.
> The stated usage scenario is fairly common in the Linux kernel. A simple
> investigation counted 60+ IIO drivers and 9 input drivers alone that use
> spi_sync*() for example, up to a total of 171 .c files. In contrast only 11 .c
> files use the spi_async*() calls. This does not account for all users of
> regmap_spi.
> Due to this, IMHO one can ask for a better, simpler, more efficient API for
> these use-cases, am I want to propose to create it.

I see the problem, what I don't see is why it requires a new externally
visible API to solve it beyond the suggestion about potentially
preparing messages for repeated use if that's even something that's
really registering.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-12 20:37 ` Mark Brown
@ 2022-05-12 22:12   ` Mark Brown
  2022-05-13 12:46   ` David Jander
  1 sibling, 0 replies; 25+ messages in thread
From: Mark Brown @ 2022-05-12 22:12 UTC (permalink / raw)
  To: David Jander; +Cc: linux-spi, Marc Kleine-Budde, Oleksij Rempel

[-- Attachment #1: Type: text/plain, Size: 583 bytes --]

On Thu, May 12, 2022 at 09:37:44PM +0100, Mark Brown wrote:

> I see the problem, what I don't see is why it requires a new externally
> visible API to solve it beyond the suggestion about potentially
> preparing messages for repeated use if that's even something that's
> really registering.

BTW what I mean here is that even if it turns out that it really
is essential to have a completely separate path it's not clear to
me that we can't just have the core decide to do that based on
the characteristics of the message and system rather than having
to change the client drivers.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-12 14:34 [RFC] A new SPI API for fast, low-latency regmap peripheral access David Jander
  2022-05-12 20:37 ` Mark Brown
@ 2022-05-13 12:10 ` Andrew Lunn
  1 sibling, 0 replies; 25+ messages in thread
From: Andrew Lunn @ 2022-05-13 12:10 UTC (permalink / raw)
  To: David Jander; +Cc: linux-spi, Mark Brown, Marc Kleine-Budde, Oleksij Rempel

> 6.1. Accounting spinlocks:
> 
> Spinlocks are supposed to be fast, especially for the case that they are not
> contested, but in such critical paths their impact shouldn't be neglected.
> 
> SPI_STATISTICS_ADD_TO_FIELD: This macro defined in spi.h has a spinlock, and
> it is used 4 times directly in __spi_sync(). It is also used in
> spi_transfer_one_message() which is called from there. Removing the spinlocks
> (thus introducing races) makes the code measurably faster (several us).
> 
> spi_statistics_add_transfer_stats(): Called twice from
> spi_transfer_one_message(), and also contains a spinlock. Removing these again
> has a measurable impact of several us.

Maybe something from the network stack can be learnt here:

https://www.kernel.org/doc/html/latest/locking/seqlock.html

https://elixir.bootlin.com/linux/v5.18-rc6/source/include/linux/u64_stats_sync.h

	Andrew

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-12 20:37 ` Mark Brown
  2022-05-12 22:12   ` Mark Brown
@ 2022-05-13 12:46   ` David Jander
  2022-05-13 19:36     ` Mark Brown
  1 sibling, 1 reply; 25+ messages in thread
From: David Jander @ 2022-05-13 12:46 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi, Marc Kleine-Budde, Oleksij Rempel


Hi Mark,

Thanks a lot for your reply!

On Thu, 12 May 2022 21:37:39 +0100
Mark Brown <broonie@kernel.org> wrote:

> On Thu, May 12, 2022 at 04:34:45PM +0200, David Jander wrote:
> 
> > Sorry for sending an RFC without actual patches. What I am about to propose
> > would require a lot of work, and I am sure I will not get it right without
> > first asking for comments. I also assume that I might not be the only one
> > asking/proposing this, and may ignore the existence of discussions that may
> > have happened in the past. If I am committing a crime, please accept my
> > apologies and let me know.  
> 
> > TL;DR: drivers/spi/spi.c API has too much overhead for some use cases.  
> 
> It would be really helpful if you could describe in concrete terms your
> actual use case here.  According to your mail you are testing with a
> single threaded echo server which is obviously not a realistic
> application and will be a bit of a limiting factor, it would be good to
> understand what's going on in a way that's less microbenchmarky.

I agree that from the SPI bus standpoint I did not take multiple SPI users
into account, but that is kind of a special scenario on purpose. I am also a
hardware designer, and as such I would never hook up an MCP2518FD to an SPI
bus that I know will also be used for other peripherals at the same time. It
would kill CAN bus performance. In essence, it would be a broken design. I
know this is short sighted, because others might still want to do this, so you
have a good point there.
On what respects the user-space application talking to the network interface,
of course it could be multi-threaded, or even multiple applications using CAN,
but that doesn't change much of what happens in the kernel. The CAN interface
is still one physical interface, so at the CAN driver level all transactions
are serialized and can be viewed as single threaded from the SPI driver point
of view.
Maybe I need to explain some more details about the nature of the CAN bus
(network): CAN is based on small frames of up to ca 130 bits max, at bit-rates
that can go up to 1Mbit/s. It is intended for semi-hard real-time communication
with low or very-low latency. An application should ideally be able to receive
a single CAN frame and send out a response almost immediately. This is not a
micro-benchmark in this sense, but a realistic scenario of how CAN bus
applications work. The J1939 protocol for example is mostly based of
single-frame command-response schemes, which do not allow for in-flight
commands. This means that each command needs to be responded to before the next
command can be sent by the other party. This makes it essential, that reaction
to incoming messages and processing of them be as fast as possible, specially
for single messages. The CAN networking layer in the kernel even has a special
CAN receive off-loading ring-buffer mechanism that can be called in hard-IRQ
context [1]. Regular (ideal) CAN controllers are connected to a fast local bus,
so that the driver can access interrupt-flag and receive data registers in
real-time from hard-IRQ context, and that works very well.
Obviously, an SPI connected CAN controller is far from ideal, but in practice
it is possible to get quite close to local-bus CAN controller performance...
just not with current spi.c code.

> High level it feels like you are approaching this at the wrong level, as
> far as I can tell you are proposing a completely separate API for
> drivers which cuts out all the locking which doesn't seem like a very
> usable API for them as the performance characteristics are going to end
> up being very system and application specific, it's going to be hard for
> them to assess which API to use.  I'm not seeing anything here that says
> add a new API, there's certainly room to make the existing API more
> performant but I don't see the jump to a new API.

Of course it would be great if we could just make the current sync API more
performant, but I had given up trying. Maybe if we can somehow implement the
fast-path in the same API that might work, but that seems like a daunting task
to me. This is an RFC, so I am open to suggestions of course.

> > There are many use-cases for SPI in the Linux kernel, and they all have
> > special requirements and trade-offs: On one side one wants to transfer large
> > amount of data to the peripheral efficiently (FLASH memory) and without
> > blocking. On the opposite side there is the need to modify registers in a
> > SPI-mapped peripheral, ideally using regmap.  
> 
> > There are two APIs already: sync and async, that should cover these use-cases.
> > Unfortunately both share a large portion of the code path, which causes the
> > sync use-case for small, fast register manipulation to be bogged down by all
> > the accounting and statistics code. There's also sometimes memcpy()'s involved
> > to put buffers into DMA space (driver dependent), even though DMA isn't used.  
> 
> That's not really a good description of what's going on with those APIs
> or choosing between them - there's certainly no short/long message
> distinction intended, the distinction is more about if the user needs to
> get the results of the transfer before proceeding with a side order of
> the sync APIs being more convenient to use.  In general if a user has a
> lot of work to do and doesn't specifically need to block for it there
> will often be a performance advantage in using the async API, even if
> the individual messages are short.  Submitting asynchronously means that

I agree that the async API has perfectly valid use-cases, even for shorter
messages, but this is very rarely the case for register-based peripherals like
this CAN controller. To prove my point, have a look at the CPU load figures I
showed in my first email: The current sync API uses 2.5 times the CPU time of
the fast API. And this is still without a context switch, but it tells
you something about the portion of CPU cycles consumed by the polling loop. I
have a hard time believing that doing this same task asynchronously wouldn't
use significantly more CPU time even. Maybe we should measure it to settle the
case?
In any case, it would obliterate latency.
Just for illustration: In kernel 5.16, the spi-imx.c driver still didn't have
a polling mode and relied on a workqueue. With this SPI driver the CAN driver
was so slow, it couldn't keep up with CAN traffic at anything higher than 10%
bus load.... not due to CPU load but due to latency, which makes CPU load
impossible to asses or to compare in that case unfortunately.

> we are more likely to be able to start pushing a new message immediately
> after completion of one message which minimises dead time on the bus,

10us per SPI transfer with the fast API vs 30us per transfer on the current
sync API, with the possibility of a single context switch taking several
microseconds more (at least), make me think that this is not always the case,
but maybe I am missing something?

> and opens up opportunities for preparing one message while the current
> one is in flight that don't otherwise exist (opportunities than we
> currently make much use of).  Conversely a large message where we need
> the result in order to proceed is going to do just fine with the sync
> API, it is actually a synchronous operation after all.

I agree for the case of big sync transfers (under the assumption that
interruptible sleep is used of course).
The point I am trying to make is that the time required for small, fast
transfers (16-64bits on 20+MHz clock, a very common use-case) is often far
less, or at best in the order of magnitude of the time all the extra code in
spi.c takes to execute, and definitely much lower than what a context switch
would take. A simple look at the scope signals of such SPI transfers just
tells all the story.
What this means, is that you can optimize for concurrency however much you
want, you will ALWAYS be slower and consume more CPU time than simple fast
low-overhead uninterruptible polling approach. Independent of whether you need
the result for the next operation or not.

> One example of lots of potentially short messages is regmap's cache sync
> code, it uses async writes to sync the register map so that it can
> overlap working out what to sync and marshalling the data with the
> actual writes - that tends to pan out as meaning that the sync completes
> faster since we can often identify and queue several more registers to
> write in the time it takes to send the first one which works out faster
> even with PIO controllers and bouncing to the worker thread, the context
> switching ends up being less than the time taken to work out what to
> send for even a fairly small number of registers.

Can you point to an example of this? I think it will depend a lot on SPI clock
and the size and amount of registers in the regmap... and whether multiple
registers can be written in one SPI transfer or not.

> > Assuming the *sync* API cannot be changed, this leads to the need to introduce
> > a new API optimized specifically for this use-case. IMHO it is also reasonable
> > to say that when accessing registers on a peripheral device, the whole
> > statistics and accounting machinery in spi.c isn't really so valuable, and
> > definitely not worth its overhead in a production system.  
> 
> That seems like a massive assumption, one could equally say that any
> application that is saturating the SPI bus is going to be likely to want
> to be able to monitor the performance and utilisation of the bus in
> order to facilitate optimisation and so want the statistics.

I agree. But do you really need to have that code in the hot-path always,
instead of only during development, debugging and testing?

> > 5. Details of the SPI transfers involved
> > ----------------------------------------
> > 
> > The MCP2518FD CAN driver does a series of small SPI transfers when running a
> > single CAN message from cangen to canecho and back:
> > 
> >  1. CAN message RX, IRQ line asserted
> >  2. Hard IRQ empty, starts IRQ thread
> >  3. IRQ thread interrogates MCP2518FD via register access:
> >  3.1. SPI transfer 1: CS low, 72bit xfer, CS high
> >  3.2. SPI transfer 2: CS low, 200bit xfer, CS high
> >  3.3. SPI transfer 3: CS low, 72bit xfer, CS high
> >  3.4. SPI transfer 4: CS low, 104bit xfer, CS high
> >  4. IRQ thread ended, RX message gets delivered to user-space
> >  5. canecho.c recv()
> >  6. canecho.c send()
> >  7. TX message gets delivered to CAN driver
> >  8. CAN driver does spi_async to queue 2 xfers (replace by spi_sync equivalent
> >  in kernel C):
> >  8.1. SPI message 1: CS low, 168bit xfer, CS high, CS low, 48bit xfer, CS high
> >  9. CAN message SOF starts appearing on the bus just before last CS high.  
> 
> Note that this is all totally single threaded and sequential which is
> going to change the performance characteristics substantially, for
> example adding and driving another echo server or even just having the
> remote end push another message into flight before it waits for a
> response would get more mileage out of the async API.

AFAIK, it doesn't really matter how many echo servers you have in user-space.
In the end, inside the CAN driver it is just one queue that talks to the SPI
driver in exactly the same way. And like I explained above, this is a pretty
common situation in CAN networks like J1939, where 90% of the time a single
remote-ECU is talking to a single process on the local machine, and the only
thing that matters is single-in-flight response time.

> > 6.1. Accounting spinlocks:  
> 
> > Spinlocks are supposed to be fast, especially for the case that they are not
> > contested, but in such critical paths their impact shouldn't be neglected.  
> 
> > SPI_STATISTICS_ADD_TO_FIELD: This macro defined in spi.h has a spinlock, and
> > it is used 4 times directly in __spi_sync(). It is also used in
> > spi_transfer_one_message() which is called from there. Removing the spinlocks
> > (thus introducing races) makes the code measurably faster (several us).  
> 
> > spi_statistics_add_transfer_stats(): Called twice from
> > spi_transfer_one_message(), and also contains a spinlock. Removing these again
> > has a measurable impact of several us.  
> 
> So for example a sysctl to suppress stats, or making the stats code
> fancier with per cpu data or whatever so it doesn't need to lock in the
> hot path would help here (obviously the latter is a bit nicer).  Might
> be interesting seeing if it's the irqsave bit that's causing trouble
> here, I'm not sure that's actually required other than for the error
> counters.

Sure. It would be great if that was possible, and if it could at least get
close to the performance of my proposal. I assumed it wasn't, because I
thought the accounting interface being part of the user-space API was set in
stone. But if it can be disabled (efficiently), then of course that might be a
far better solution.

> > spi_set_cs(): Removing all delay code and leaving the bare minimum for GPIO
> > based CS activation again has a measurable impact. Most (all?) simple SPI
> > peripheral chips don't have any special CS->clock->CS timing requirements, so
> > it might be a good idea to have a simpler version of this function.  
> 
> Surely the thing there would just be to set the delays to zero if they
> can actually be zero (and add a special case for actually zero delay
> which we don't currently have)?  But in any case devices do always have
> some minimum requirements for delays at various points around asserting
> chip select, plus there's considerations around providing enough ramp
> time for signals to reach appropriate levels which are more system level
> than chip level.  The required delays are normally very small so
> effectively pan out as zero in a lot of systems but the faster things
> run (both on the bus and for the SoC) the more visible they get and more
> attention needs to be paid.  It should be possible to do something to
> assess a delay as being effectively zero and round down which would feed
> in here when we're managing the chip select from software but we can't
> just ignore the delays.
> 
> I note that for example that the MPC2515 quotes minimum chip select
> setup, hold and disable times of 50ns - those are very small, but they
> are non-zero.

In practice, when moving a GPIO pin from code and then start an SPI transfer
immediately, you couldn't get close to 50ns even if you did your best to try.
This is of course under the assumption that the GPIO is local (not an I2C
expander or something like that) and the IO pin slew-rate is in the same
range as the SPI signals... which is probably the case in 99% (100%?) of such
hardware.
SPI peripherals very rarely have CS setup and hold time requirements that
deviate far from the period time of a SPI clock cycle, let alone are so long
that software manipulating a GPIO pin needs to be delayed artificially.
Keep in mind that the _theoretical_ granularity of the current software
implementation is 10us... that is 200 time the CS setup/hold time of the chip
you mention! So if you introduce even the minimal non-zero delay, you'd kill
SPI bandwidth by a factor of 2 for 50 bit transfers.

> > Since this hypothetical new API would be used only for very short, very fast
> > transfers where latency and overhead should be minimized, the best way to do
> > it is obviate all scheduling work and do it strictly synchronous and based on
> > polling. The context switch of even a hard-IRQ can quickly cost a lot more CPU
> > cycles than busy waiting for 48 bits to be shifted through the transmitter at
> > 20+MHz clock. This requires that SPI drivers offer low-level functions that do
> > such simple transfers on polling basis. The patches [1] from Marc Kleine-Budde
> > already do this, but it is the SPI driver that choses whether to use polling or
> > IRQ based transfers based on heuristics calculating the theoretical transfer
> > time given the clock frequency and its size. While it improves the performance
> > in a lot of cases already, peripheral drivers have no choice but to still go
> > through all the heavy code in spi.c.  
> 
> There's a whole pile of assumptions in there about the specific system
> you're running on and how it's going to perform.  Like I said above it
> really feels like this is the wrong level to approach things at, it's
> pushing decisions to the client driver that are really system specific.

Is that always the case? I mean, client drivers should know best what the
requirements are for their SPI transfers, right? Does it need to optimize
latency for very short fast transfers, or does it need to send fire-and-forget
bulk transfers? Does it need to do transfers from interrupt context? Does it
need strictly local-GPIO based CS?
I agree that this will make some combinations of client drivers with some SPI
controllers or GPIO controllers incompatible, but that's just the way it is.
An MCP2518FD is simply unworkable without local-GPIO CS or polling mode SPI.

> Why would a client doing "short" transfers (short itself being a fuzzy
> term) not want to use this interface, and what happens when for example

With "short" I always mean that it takes a time which lies in the order of
magnitude or less than that of context-switch overhead.

> someone puts one of these CAN controllers on a USB dongle which simply
> can't implement a non-blocking mode?  We should aim to do things which

It would need a local microcontroller that does buffering, so the problem
would be translated to the USB domain and the latencies achievable there.
If you are thinking of some sort of SPI over USB interface, that would just
not work for this chip.

> just benefit any client driver using the APIs idiomatically without them
> having to make assumptions about either the performance characteristics
> of the system they're running on or the features it has, especially if
> those assumptions would make the driver unusuable on some systems.

I get what you mean. Yet besides this CAN controller I can point to another
example where this is not so obvious: Oleksij Rempel recently contributed a
IIO-ADC based driver for the TSC2046 touchscreen controller. While this
controller works at a very slow SPI clock where these things are much less of
a problem, it is very sensitive to the SPI CS setup time. Right now, in
order to get the timing right, the driver will add extra data to the beginning
of the SPI transfer [2], which depends on the SPI clock to calculate an exact
delay. Any deviation of CS setup time introduces uncertainty to this timing.
Fortunately each board can specify and tune this time in their device-tree.

> > Since these are low-latency applications, chances are very high that the
> > hardware is also designed for low-latency access, which implies that CS
> > control via GPIO most probably uses local GPIO controllers instead of I2C GPIO
> > expanders for example, so CS access can be assumed to be fast and direct and  
> 
> High chances are not guarantees, and none of this sounds like things
> that should require specific coding in the client drivers.

Ok, but if the hardware is designed to require low latency of SPI messages,
based on a certain SPI clock, bus load and maximum CS setup time? In those
cases, an API that introduces so much overhead, that the maximum achievable
SPI bus load is less than 10% and CS setup time alone exceeds the complete
transfer time needed, might be just unusable.
What I am saying is that IMHO one cannot assume that any combination of SPI
peripheral with any SPI controller is a workable combination. That's just not
the case. A hardware developer that didn't take that into consideration didn't
do his job. But if it is the software that kills the use-case... then it is
not the hardware developers fault anymore... although I'd argue it is at least
in part, because IMHO one should only use hardware components (and
combinations) that are supported in mainline Linux, or account fot the needed
software development ;-)

> > not involve any context switches. It could be argued that it might even be
> > beneficial to have an API that can be called from hard IRQ context, but
> > experiments in this case showed that the gain of doing the CAN message read
> > out directly in hard-IRQ and removing the IRQ thread is minimal. But better
> > use-cases could be conceived, so this possibility might need consideration
> > also.  
> 
> That's something that the async API (or the sync API in the contended
> case) can enable - if we have another transfer queued then we would if
> someone did the work be able to arrange to start pushing it immediately
> the prior transfer completes.

Don't know if I follow you here. You mean that a hard-IRQ schedules the start
of an async transfer and later the interrupt thread function just picks up the
result? That would be a cool way of reading out some registers indeed.
Unfortunately though, again, the context switch from hard-IRQ to interrupt
thread would be preempted by the SPI worker queue, introducing an unbearable
delay due to context-switch overhead alone... at least on my machine (quad
core ARM Cortex-A53).

> > Care should be taken to solve locking in such a way, that it doesn't impact
> > performance for the fast API, while still allowing safe concurrency with
> > spi_sync and spi_async. I did not go as far as to solve this issue. I just
> > used a simple spinlock and carefully avoided using any call to the old API for
> > doing these proof-of-concept measurements.  
> 
> Apart from the hard bits...  :P

Yeah... I am not an expert in locking mechanisms in the kernel. I figured
someone more versed in this could come up with an efficient solution.

> The only bits of the existing code that you've specifically identified
> as taking substantial time here are the delays and the statistics, both
> of these seem like areas which could just be improved in place without
> requiring changes outside of the SPI subsystem that benefit all users.
> It sounds like the bits you've profiled as causing trouble are delays
> and stats synchronisation which does sound plausible but those do also
> seem like they can be specifically targetted - being smarter about when
> we actually do a delay, and either improving the locking or providing
> more optimisation for the stats code.

I tried... but in the end I got the impression that nothing really takes that
much time. It's the whole bunch of code together that is just slow.
To be fair, "slow" is a relative term. We are talking microseconds here.
And I think that's the whole point: The use-case here is comparing direct
local-bus access to peripheral registers (in the order of 100ns) with SPI
transactions that in theory don't take more than 5-10us. It is very easy to
get into single microsecond territory with a few ifs, even if they are not
taken, and even on a decently fast machine.
With the advent of quad-spi in peripheral chips (as opposed to FLASH memory),
this will be even more critical I suppose.

> If there are other bits of the message setup code which are getting in
> the way and don't have obvious paths for optimisation (the validation
> potentially?) then if your application is spamming a lot of the same
> operation (eg, with the status reading in the interrupt handler) then
> quite a while ago Martin Sparl was looking at providing an interface
> which would allow client drivers to pre-cook messages so that they could
> be submitted multiple times without going through the validation that we
> normally do (and perhaps get some driver preparation done as well).  He
> was looking at it for other reasons but it seems like a productive
> approach for cutting down on the setup overhead, it would require more
> up front work in the client but cut down on the amount of work done per
> operation and seems like it should scale well over different systems.

That sounds indeed interesting. I suspect that the message setup is also
relatively CPU intensive.
Do you have a pointer to Martin's work or discussion?
The idea of being able to setup a message (or transfer) once and then only
modify the TX buffer and re-send it avoiding all the message validation and
checking sounds like a worthwhile optimization. That together with a way to
disable all the accounting code efficiently, sounds like it might be just what
is needed here.

> > Performance of spi.c API for the specified use-cases is not ideal.
> > Unfortunately there is no single smoking gun to be pointed at, but instead
> > many different bits which are not needed for the given use-case that add to
> > the bloat and ultimately have a big combined performance impact.
> > The stated usage scenario is fairly common in the Linux kernel. A simple
> > investigation counted 60+ IIO drivers and 9 input drivers alone that use
> > spi_sync*() for example, up to a total of 171 .c files. In contrast only 11 .c
> > files use the spi_async*() calls. This does not account for all users of
> > regmap_spi.
> > Due to this, IMHO one can ask for a better, simpler, more efficient API for
> > these use-cases, am I want to propose to create it.  
> 
> I see the problem, what I don't see is why it requires a new externally
> visible API to solve it beyond the suggestion about potentially
> preparing messages for repeated use if that's even something that's
> really registering.

I doubted it was possible, but you are convincing me that it maybe can be done.
Like I said above: re-use of messages and skipping all the accounting code
would probably be good enough.

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/can/dev/rx-offload.c?h=v5.18-rc6#n204

[2]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iio/adc/ti-tsc2046.c?h=v5.18-rc6#n356

Best regards,

-- 
David Jander
Protonic Holland.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-13 12:46   ` David Jander
@ 2022-05-13 19:36     ` Mark Brown
  2022-05-16 16:28       ` Marc Kleine-Budde
  0 siblings, 1 reply; 25+ messages in thread
From: Mark Brown @ 2022-05-13 19:36 UTC (permalink / raw)
  To: David Jander; +Cc: linux-spi, Marc Kleine-Budde, Oleksij Rempel

[-- Attachment #1: Type: text/plain, Size: 26751 bytes --]

On Fri, May 13, 2022 at 02:46:45PM +0200, David Jander wrote:
> Mark Brown <broonie@kernel.org> wrote:
> > On Thu, May 12, 2022 at 04:34:45PM +0200, David Jander wrote:

> I agree that from the SPI bus standpoint I did not take multiple SPI users
> into account, but that is kind of a special scenario on purpose. I am also a
> hardware designer, and as such I would never hook up an MCP2518FD to an SPI
> bus that I know will also be used for other peripherals at the same time. It
> would kill CAN bus performance. In essence, it would be a broken design. I
> know this is short sighted, because others might still want to do this, so you
> have a good point there.

Right, and of course other system integrators may not be bothered about
CAN bus performance (eg, just needing to send an occasional command vs a
constant data stream).

> On what respects the user-space application talking to the network interface,
> of course it could be multi-threaded, or even multiple applications using CAN,
> but that doesn't change much of what happens in the kernel. The CAN interface
> is still one physical interface, so at the CAN driver level all transactions
> are serialized and can be viewed as single threaded from the SPI driver point
> of view.

That's not entirely true - there's some degree of buffering in your CAN
controller, and obviously more than one device transacting on the bus,
so you can improve throughput by having a new transaction ready to
proceed as soon as the bus becomes free.  Eliminating the round trip to
the host CPU as much as possible is generally going to be beneficial for
a load like you've got, assuming throughput is the main goal.

> Maybe I need to explain some more details about the nature of the CAN bus
> (network): CAN is based on small frames of up to ca 130 bits max, at bit-rates
> that can go up to 1Mbit/s. It is intended for semi-hard real-time communication
> with low or very-low latency. An application should ideally be able to receive
> a single CAN frame and send out a response almost immediately. This is not a
> micro-benchmark in this sense, but a realistic scenario of how CAN bus
> applications work. The J1939 protocol for example is mostly based of
> single-frame command-response schemes, which do not allow for in-flight
> commands. This means that each command needs to be responded to before the next

My understanding was that CAN is a multi-device bus so even if you're
doing single threaded RPC to an individual target there may be multiple
targest being transacted with at once?

> > High level it feels like you are approaching this at the wrong level, as
> > far as I can tell you are proposing a completely separate API for
> > drivers which cuts out all the locking which doesn't seem like a very
> > usable API for them as the performance characteristics are going to end
> > up being very system and application specific, it's going to be hard for
> > them to assess which API to use.  I'm not seeing anything here that says
> > add a new API, there's certainly room to make the existing API more
> > performant but I don't see the jump to a new API.

> Of course it would be great if we could just make the current sync API more
> performant, but I had given up trying. Maybe if we can somehow implement the
> fast-path in the same API that might work, but that seems like a daunting task
> to me. This is an RFC, so I am open to suggestions of course.

Well, you've identifed a couple of candidates for optimisation here...
I would start off by trying to address the specific hot points that
you've identified and see where that takes us, like I said nothing I'm
hearing sounds like anything other than trying to optimise the hot path
and it doesn't seem like anything was really tried there yet.

> > That's not really a good description of what's going on with those APIs
> > or choosing between them - there's certainly no short/long message
> > distinction intended, the distinction is more about if the user needs to
> > get the results of the transfer before proceeding with a side order of
> > the sync APIs being more convenient to use.  In general if a user has a
> > lot of work to do and doesn't specifically need to block for it there
> > will often be a performance advantage in using the async API, even if
> > the individual messages are short.  Submitting asynchronously means that

> I agree that the async API has perfectly valid use-cases, even for shorter
> messages, but this is very rarely the case for register-based peripherals like
> this CAN controller. To prove my point, have a look at the CPU load figures I
> showed in my first email: The current sync API uses 2.5 times the CPU time of
> the fast API. And this is still without a context switch, but it tells
> you something about the portion of CPU cycles consumed by the polling loop. I
> have a hard time believing that doing this same task asynchronously wouldn't
> use significantly more CPU time even. Maybe we should measure it to settle the
> case?

How much of that time is due to the API being async vs just nobody
having tried to optimise the slow bits?  Also bear in mind that
a huge proportion of systems have multiple CPUs available these days so
CPU time isn't wall clock time - one of the advantages is that we can
start sending the initial transfers while working out what else to send
which is useful if there's anything non-trivial going on there.  If the
driver is doing no work in between messages and the messages are all
short enough to do using PIO on the system concerned then yes, it's not
going to get us terribly far unless we can arrange to push the entire
queue from DMA or something which would be complicated.

> > we are more likely to be able to start pushing a new message immediately
> > after completion of one message which minimises dead time on the bus,

> 10us per SPI transfer with the fast API vs 30us per transfer on the current
> sync API, with the possibility of a single context switch taking several
> microseconds more (at least), make me think that this is not always the case,
> but maybe I am missing something?

It's not 100% guaranteed but it's fairly easy to build up, especially if
there's anything non-trivial going on between register writes or any
writes get large enough to be worth DMAing.

> > and opens up opportunities for preparing one message while the current
> > one is in flight that don't otherwise exist (opportunities than we
> > currently make much use of).  Conversely a large message where we need
> > the result in order to proceed is going to do just fine with the sync
> > API, it is actually a synchronous operation after all.
> 
> I agree for the case of big sync transfers (under the assumption that
> interruptible sleep is used of course).
> The point I am trying to make is that the time required for small, fast
> transfers (16-64bits on 20+MHz clock, a very common use-case) is often far
> less, or at best in the order of magnitude of the time all the extra code in
> spi.c takes to execute, and definitely much lower than what a context switch
> would take. A simple look at the scope signals of such SPI transfers just
> tells all the story.
> What this means, is that you can optimize for concurrency however much you
> want, you will ALWAYS be slower and consume more CPU time than simple fast
> low-overhead uninterruptible polling approach. Independent of whether you need
> the result for the next operation or not.

This is the entire reason that spi_sync() runs inline when the bus is
uncontended rather than punting to the workqueue - but it just does that
without requiring that client drivers use a completely separate API to
trigger this behaviour.  If a client needs to wait for the message to
finish before doing the next one or is only doing a small number of
transfers before it needs to wait or do something else then spi_sync()
is absolutely a good interface for it, the more work is being done in a
single batch either for each message or in number of messages the more
likely it is that we will be able to get something out of spi_async(),
either now or in the future.

> > One example of lots of potentially short messages is regmap's cache sync
> > code, it uses async writes to sync the register map so that it can
> > overlap working out what to sync and marshalling the data with the
> > actual writes - that tends to pan out as meaning that the sync completes
> > faster since we can often identify and queue several more registers to
> > write in the time it takes to send the first one which works out faster
> > even with PIO controllers and bouncing to the worker thread, the context
> > switching ends up being less than the time taken to work out what to
> > send for even a fairly small number of registers.

> Can you point to an example of this? I think it will depend a lot on SPI clock
> and the size and amount of registers in the regmap... and whether multiple
> registers can be written in one SPI transfer or not.

Look at register maps like wm5102 for an example of a larger register
map where the wins start to add up - I can't remember exactly but I
suspect that was one of my main targets when I was benchmarking this.

IIRC for smaller cases the difference between sync and async ended up
being mostly a wash, especially with multiple CPUs available but even on
a single CPU you ended up doing a couple of context switches or
something (one switching to the SPI pump, then another switching back
when everything was done) which is non-zero but in this context totally
fine.

> > > Assuming the *sync* API cannot be changed, this leads to the need to introduce
> > > a new API optimized specifically for this use-case. IMHO it is also reasonable
> > > to say that when accessing registers on a peripheral device, the whole
> > > statistics and accounting machinery in spi.c isn't really so valuable, and
> > > definitely not worth its overhead in a production system.  

> > That seems like a massive assumption, one could equally say that any
> > application that is saturating the SPI bus is going to be likely to want
> > to be able to monitor the performance and utilisation of the bus in
> > order to facilitate optimisation and so want the statistics.

> I agree. But do you really need to have that code in the hot-path always,
> instead of only during development, debugging and testing?

As Andrew pointed out the networking code manages to have stats code in
hot paths that a large number of people care deeply about, I am fairly
confident that the overhead can be reduced to the point where it is much
less impactful next to the actual I/O.  The closer we can keep people's
production and development setups the easier life tends to be for them,
and even if it's something selected at build time having to use a
completely separate set of calls in the client driver is going to be
needlessly painful to use.

Not saying there's nothing to improve here, just that a completely
separate API doesn't seem like the answer.

> > >  1. CAN message RX, IRQ line asserted
> > >  2. Hard IRQ empty, starts IRQ thread
> > >  3. IRQ thread interrogates MCP2518FD via register access:
> > >  3.1. SPI transfer 1: CS low, 72bit xfer, CS high
> > >  3.2. SPI transfer 2: CS low, 200bit xfer, CS high
> > >  3.3. SPI transfer 3: CS low, 72bit xfer, CS high
> > >  3.4. SPI transfer 4: CS low, 104bit xfer, CS high
> > >  4. IRQ thread ended, RX message gets delivered to user-space
> > >  5. canecho.c recv()
> > >  6. canecho.c send()
> > >  7. TX message gets delivered to CAN driver
> > >  8. CAN driver does spi_async to queue 2 xfers (replace by spi_sync equivalent
> > >  in kernel C):
> > >  8.1. SPI message 1: CS low, 168bit xfer, CS high, CS low, 48bit xfer, CS high
> > >  9. CAN message SOF starts appearing on the bus just before last CS high.  
> > Note that this is all totally single threaded and sequential which is
> > going to change the performance characteristics substantially, for
> > example adding and driving another echo server or even just having the
> > remote end push another message into flight before it waits for a
> > response would get more mileage out of the async API.

> AFAIK, it doesn't really matter how many echo servers you have in user-space.

You should be able to get a situation where one end of the link is
reading a message from it's CAN controller, the other end of the link is
sending a message to it's CAN controller and one of the CAN controllers
is sending on the bus.  Ideally (assuming CAN and the controllers cope
well with this) at least one of the controllers should always have a
message queued so that as soon as the bus goes idle another message can
start sending.  It did look like that was viable from the datasheet I
glanced through but...

> In the end, inside the CAN driver it is just one queue that talks to the SPI
> driver in exactly the same way. And like I explained above, this is a pretty
> common situation in CAN networks like J1939, where 90% of the time a single
> remote-ECU is talking to a single process on the local machine, and the only
> thing that matters is single-in-flight response time.

...if that's not really what an application is going to do I guess it
doesn't matter here.  The test looked like it was going for throughput
rather than latency.

> > So for example a sysctl to suppress stats, or making the stats code
> > fancier with per cpu data or whatever so it doesn't need to lock in the
> > hot path would help here (obviously the latter is a bit nicer).  Might
> > be interesting seeing if it's the irqsave bit that's causing trouble
> > here, I'm not sure that's actually required other than for the error
> > counters.

> Sure. It would be great if that was possible, and if it could at least get
> close to the performance of my proposal. I assumed it wasn't, because I
> thought the accounting interface being part of the user-space API was set in
> stone. But if it can be disabled (efficiently), then of course that might be a
> far better solution.

Worst case if it's not possible to make it fast enough and people are
relying on the numbers then we could have a control that defaults it to
on, then if someone explicitly chooses to turn it off then they can
presumably cope.

> > > spi_set_cs(): Removing all delay code and leaving the bare minimum for GPIO
> > > based CS activation again has a measurable impact. Most (all?) simple SPI
> > > peripheral chips don't have any special CS->clock->CS timing requirements, so
> > > it might be a good idea to have a simpler version of this function.  

> > Surely the thing there would just be to set the delays to zero if they
> > can actually be zero (and add a special case for actually zero delay
> > which we don't currently have)?  But in any case devices do always have

...

> > attention needs to be paid.  It should be possible to do something to
> > assess a delay as being effectively zero and round down which would feed
> > in here when we're managing the chip select from software but we can't
> > just ignore the delays.

> In practice, when moving a GPIO pin from code and then start an SPI transfer
> immediately, you couldn't get close to 50ns even if you did your best to try.
> This is of course under the assumption that the GPIO is local (not an I2C
> expander or something like that) and the IO pin slew-rate is in the same
> range as the SPI signals... which is probably the case in 99% (100%?) of such
> hardware.

That's the "assess the delay as effectively zero" case I mentioned
above, we should definitely do that.

> SPI peripherals very rarely have CS setup and hold time requirements that
> deviate far from the period time of a SPI clock cycle, let alone are so long
> that software manipulating a GPIO pin needs to be delayed artificially.

There's a bunch of things which have a requirement expressed in terms of
their master clock rather than the SPI clock (see SPI_DELAY_UNIT_SCK) or
absolute numbers which can push things up, or special requirements when
in low power states (not just low power states where the MCLK speed is
reduced!).  It's not the default case by any mens but this control is
there for a reason.

> > There's a whole pile of assumptions in there about the specific system
> > you're running on and how it's going to perform.  Like I said above it
> > really feels like this is the wrong level to approach things at, it's
> > pushing decisions to the client driver that are really system specific.

> Is that always the case? I mean, client drivers should know best what the
> requirements are for their SPI transfers, right? Does it need to optimize

The definitions of "short" and "fast" are going to be per system things,
and the imagination of system integrators can be surprising.  Even with
something like CAN you're looking at it for what sounds like the
original ECU application but someone else might decide that it's a good
fit for some completely different application that uses the bus in a
different way and so don't build the hardware in a way that's expected.

Clients should just do whatever they need to do and let the core figure
out how that maps on to the particular controller hardware that the
system has, the core should be able to scale things gracefully even for
surprising system designs.  We don't want to have clients doing things
that make the driver totally unusable on some systems, or which cause
performance to fall off a cliff if the system behaves unexpectedly.

> latency for very short fast transfers, or does it need to send fire-and-forget
> bulk transfers?

That's already in the API, a client can send whatever transfer length it
needs and it can send them async.  To put it another way, why would a
driver ever choose something other than the fast API?

>                 Does it need to do transfers from interrupt context? Does it
> need strictly local-GPIO based CS?

We would need something new to have transfers run entirely in hard
interrupt context (ie, waiting for the transfer to complete rather than
starting/stopping) and TBH I'd take some convincing that it was a use
case that makes sense - it's really rude to everything else in the
system, and we do have tools for configuring the scheduling threaded
work.

Requiring GPIO based chip selects definitely feels suspicious, even
ignoring system integrators being surprising there's perfectly
performant and flexible controllers that don't have GPIO based chip
selects at all.

> > Why would a client doing "short" transfers (short itself being a fuzzy
> > term) not want to use this interface, and what happens when for example

> With "short" I always mean that it takes a time which lies in the order of
> magnitude or less than that of context-switch overhead.

Both the transfer and context switch times being variable of course -
there's going to be cutoffs somewhere and those cutoffs have much wider
ranges than a driver author might anticipate.  A small microcontroller
style SoC and a cutting edge mobile phone SoC are going to have rather
different characteristics on the CPU side.

> > just benefit any client driver using the APIs idiomatically without them
> > having to make assumptions about either the performance characteristics
> > of the system they're running on or the features it has, especially if
> > those assumptions would make the driver unusuable on some systems.

> I get what you mean. Yet besides this CAN controller I can point to another
> example where this is not so obvious: Oleksij Rempel recently contributed a
> IIO-ADC based driver for the TSC2046 touchscreen controller. While this
> controller works at a very slow SPI clock where these things are much less of
> a problem, it is very sensitive to the SPI CS setup time. Right now, in
> order to get the timing right, the driver will add extra data to the beginning
> of the SPI transfer [2], which depends on the SPI clock to calculate an exact
> delay. Any deviation of CS setup time introduces uncertainty to this timing.
> Fortunately each board can specify and tune this time in their device-tree.

I really don't see the concern or conflict here?  The DT property there
is the settling time before the samples are usable which is perfectly
normal for ADCs.

> > High chances are not guarantees, and none of this sounds like things
> > that should require specific coding in the client drivers.

> Ok, but if the hardware is designed to require low latency of SPI messages,
> based on a certain SPI clock, bus load and maximum CS setup time? In those
> cases, an API that introduces so much overhead, that the maximum achievable
> SPI bus load is less than 10% and CS setup time alone exceeds the complete
> transfer time needed, might be just unusable.

If the performance is that bad then why only improve the performance for
some subset of client drivers that happen to use some different API
rather than for any driver?  I'm not sure there's any devices out there
that explicitly want poor performance.

> > > not involve any context switches. It could be argued that it might even be
> > > beneficial to have an API that can be called from hard IRQ context, but
> > > experiments in this case showed that the gain of doing the CAN message read
> > > out directly in hard-IRQ and removing the IRQ thread is minimal. But better
> > > use-cases could be conceived, so this possibility might need consideration
> > > also.  

> > That's something that the async API (or the sync API in the contended
> > case) can enable - if we have another transfer queued then we would if
> > someone did the work be able to arrange to start pushing it immediately
> > the prior transfer completes.

> Don't know if I follow you here. You mean that a hard-IRQ schedules the start
> of an async transfer and later the interrupt thread function just picks up the
> result? That would be a cool way of reading out some registers indeed.

I was thinking more about the case where there's a series of operations
to perform so that when one completes we can immediately initiate the
next, for example in a multi transfer message starting the second
transfer as the first transfer finishes.  However the case you mention
where we know the operation we will want to do when the device
interrupts and have it similarly prepared for immediate start from
interrupt context is another valid one which makes a lot of sense.

> Unfortunately though, again, the context switch from hard-IRQ to interrupt
> thread would be preempted by the SPI worker queue, introducing an unbearable
> delay due to context-switch overhead alone... at least on my machine (quad
> core ARM Cortex-A53).

To be clear we don't do this yet.  The whole point would be to avoid the
need to involve any thread in the hot path, in a way that would benefit
as many client and controller drivers as possible.

> > If there are other bits of the message setup code which are getting in
> > the way and don't have obvious paths for optimisation (the validation
> > potentially?) then if your application is spamming a lot of the same
> > operation (eg, with the status reading in the interrupt handler) then
> > quite a while ago Martin Sparl was looking at providing an interface
> > which would allow client drivers to pre-cook messages so that they could
> > be submitted multiple times without going through the validation that we
> > normally do (and perhaps get some driver preparation done as well).  He
> > was looking at it for other reasons but it seems like a productive
> > approach for cutting down on the setup overhead, it would require more
> > up front work in the client but cut down on the amount of work done per
> > operation and seems like it should scale well over different systems.

> That sounds indeed interesting. I suspect that the message setup is also
> relatively CPU intensive.
> Do you have a pointer to Martin's work or discussion?

Not off hand sorry, and it was more at the design stage than
implementation IIRC.  Hopefuly lore can help?  I'll try to dig out some
references later but it was quite a while ago.  His application was even
lower latency than what you're proposing, basically getting messages
prepared so that all the management of the chip select and controller
as well as the data transfer could be done through a sequence of DMA
operations with no intervention from the CPU and we just needed to queue
those DMA operations (on a Pi, which has suitable controller and DMA
hardware) and was therefore both faster and lower CPU load in the hot
path.  Extremely cool, but hard to abstract on the controller side.

> The idea of being able to setup a message (or transfer) once and then only
> modify the TX buffer and re-send it avoiding all the message validation and
> checking sounds like a worthwhile optimization. That together with a way to
> disable all the accounting code efficiently, sounds like it might be just what
> is needed here.

High level (and certainly for your application I think) that would be
making an API that calls __spi_validate() and then sets a flag saying
validation was done.  Then when a message is actually sent if the flag
was set then don't bother doing the validation, and have an unvalidate
(or rather unprepare I guess) API to be called when done with the
message.  So something like

	spi_prepare_message(msg);

	spi_sync(msg);
	spi_sync(msg);

	spi_unprepare_message(msg);

(we already have a prepare on the controller side so probably best think
of another name there.)  We'd probably need to require that callers
check to see if things are DMA mapped and handle mapping/unmapping if
they access the data in between uses.

> > I see the problem, what I don't see is why it requires a new externally
> > visible API to solve it beyond the suggestion about potentially
> > preparing messages for repeated use if that's even something that's
> > really registering.

> I doubted it was possible, but you are convincing me that it maybe can be done.
> Like I said above: re-use of messages and skipping all the accounting code
> would probably be good enough.

OK, great.  You've definitely pointed out some things that could do with
improvement in the existing code regardless of any API level stuff.
I'll try to poke at the delay code myself.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-13 19:36     ` Mark Brown
@ 2022-05-16 16:28       ` Marc Kleine-Budde
  2022-05-16 17:46         ` Mark Brown
  0 siblings, 1 reply; 25+ messages in thread
From: Marc Kleine-Budde @ 2022-05-16 16:28 UTC (permalink / raw)
  To: Mark Brown; +Cc: David Jander, linux-spi, Oleksij Rempel

[-- Attachment #1: Type: text/plain, Size: 31959 bytes --]

On 13.05.2022 20:36:51, Mark Brown wrote:
> > On what respects the user-space application talking to the network interface,
> > of course it could be multi-threaded, or even multiple applications using CAN,
> > but that doesn't change much of what happens in the kernel. The CAN interface
> > is still one physical interface, so at the CAN driver level all transactions
> > are serialized and can be viewed as single threaded from the SPI driver point
> > of view.
> 
> That's not entirely true - there's some degree of buffering in your CAN
> controller, and obviously more than one device transacting on the bus,
> so you can improve throughput by having a new transaction ready to
> proceed as soon as the bus becomes free.  Eliminating the round trip to
> the host CPU as much as possible is generally going to be beneficial for
> a load like you've got, assuming throughput is the main goal.

The TX path in the mcp251xfd driver uses spi_async() call directly from
the ndo_start_xmit callback. I've a proof of concept patch (not
mainline) that adds xmit_more() support, resulting in sending more than
one TX CAN frame with one SPI transfer.

The mainline mcp251xfd CAN driver already supports software timer based
IRQ coalescing, where the RX and/or TX-complete IRQ stay disabled for a
configurable amount of time after serving the interrupt. That
dramatically reduces the number of SPI transfers as several
RX/TX-complete CAN frames can be transferred in one go.

Both of these measures trade latency for throughput, but David is
looking especially for low latency.

> > Maybe I need to explain some more details about the nature of the CAN bus
> > (network): CAN is based on small frames of up to ca 130 bits max, at bit-rates
> > that can go up to 1Mbit/s. It is intended for semi-hard real-time communication
> > with low or very-low latency. An application should ideally be able to receive
> > a single CAN frame and send out a response almost immediately. This is not a
> > micro-benchmark in this sense, but a realistic scenario of how CAN bus
> > applications work. The J1939 protocol for example is mostly based of
> > single-frame command-response schemes, which do not allow for in-flight
> > commands. This means that each command needs to be responded to before the next
> 
> My understanding was that CAN is a multi-device bus so even if you're
> doing single threaded RPC to an individual target there may be multiple
> targest being transacted with at once?

There are several use cases for CAN busses. Some prefer low latency
(from RX-IRQ to CAN frame in user space), while others have to deal with
100% CAN bus load. 100% bus load translates to 8k...16k CAN frames per
second, depending on CAN frame size and bit rate. Without coalescing
there's 1 IRQ per CAN frame, with coalescing it's much better (< 1/10 of
IRQ rate).

> > > High level it feels like you are approaching this at the wrong level, as
> > > far as I can tell you are proposing a completely separate API for
> > > drivers which cuts out all the locking which doesn't seem like a very
> > > usable API for them as the performance characteristics are going to end
> > > up being very system and application specific, it's going to be hard for
> > > them to assess which API to use.  I'm not seeing anything here that says
> > > add a new API, there's certainly room to make the existing API more
> > > performant but I don't see the jump to a new API.
> 
> > Of course it would be great if we could just make the current sync API more
> > performant, but I had given up trying. Maybe if we can somehow implement the
> > fast-path in the same API that might work, but that seems like a daunting task
> > to me. This is an RFC, so I am open to suggestions of course.
> 
> Well, you've identifed a couple of candidates for optimisation here...
> I would start off by trying to address the specific hot points that
> you've identified and see where that takes us, like I said nothing I'm
> hearing sounds like anything other than trying to optimise the hot path
> and it doesn't seem like anything was really tried there yet.

I thinks the statistics are an easy target.

> > > That's not really a good description of what's going on with those APIs
> > > or choosing between them - there's certainly no short/long message
> > > distinction intended, the distinction is more about if the user needs to
> > > get the results of the transfer before proceeding with a side order of
> > > the sync APIs being more convenient to use.  In general if a user has a
> > > lot of work to do and doesn't specifically need to block for it there
> > > will often be a performance advantage in using the async API, even if
> > > the individual messages are short.  Submitting asynchronously means that
> 
> > I agree that the async API has perfectly valid use-cases, even for shorter
> > messages, but this is very rarely the case for register-based peripherals like
> > this CAN controller. To prove my point, have a look at the CPU load figures I
> > showed in my first email: The current sync API uses 2.5 times the CPU time of
> > the fast API. And this is still without a context switch, but it tells
> > you something about the portion of CPU cycles consumed by the polling loop. I
> > have a hard time believing that doing this same task asynchronously wouldn't
> > use significantly more CPU time even. Maybe we should measure it to settle the
> > case?
> 
> How much of that time is due to the API being async vs just nobody
> having tried to optimise the slow bits?

You mean slow bits of the mcp251xfd driver?

The TX path of the mcp251xfd driver is fire and forget (see above) and
the spi_async() API fits quite good here....

> Also bear in mind that
> a huge proportion of systems have multiple CPUs available these days so
> CPU time isn't wall clock time - one of the advantages is that we can
> start sending the initial transfers while working out what else to send
> which is useful if there's anything non-trivial going on there.  If the
> driver is doing no work in between messages and the messages are all
> short enough to do using PIO on the system concerned then yes,

If the chip throws an IRQ it's more complicated. The driver has to read
the IRQ status register before it knows which IRQ to handle. For the RX
IRQ it has to read the fill level from the RX FIFO, then it can finally
read the RX'ed CAN frames. The driver does read as many frames as
possible with one transfer.

So the hot path is the RX IRQ, TX-complete IRQ handling is similar but
not as important (as the hardware has a buffer for outgoing TX frames),
the other IRQ sources are various kinds of error handling.

> it's not
> going to get us terribly far unless we can arrange to push the entire
> queue from DMA or something which would be complicated.

One optimization is to not only read the the IRQ status register, but
also the RX FIFO status register, in hope the IRQ is a RX IRQ. If there
are pending TX frames, the TX FIFO status register can be read, too.

The driver uses regmap, which makes this a bit more complicated. I hope
I can hide this behind regmap's caching feature.

The memory layout of this chip is not really great for this. Reading IRQ
status + RX FIFO status (with one transfer) means a transfer length of
72 byes, in other words an overhead of 64 bytes. :/

What's the typical length of a SPI FIFO? On the imx6 it's 64 bytes. If
the SPI host controller needs to refill the TX FIFO you probably have an
additional IRQ per transfer, unless you are in polling mode.

In our measurements of the spi-imx driver we found that IRQ based PIO
mode is faster than DMA mode for these medium sized transfers, while
polled PIO mode is faster for small transfers. I don't have the exact
cutoff numbers. Even for this driver they depends on the used SoC.

You see, it's quite a guessing game, depending on the SPI host driver
(quality) and SoC, if 2 small or 1 lager transfer is better.

> > > we are more likely to be able to start pushing a new message immediately
> > > after completion of one message which minimises dead time on the bus,
> 
> > 10us per SPI transfer with the fast API vs 30us per transfer on the current
> > sync API, with the possibility of a single context switch taking several
> > microseconds more (at least), make me think that this is not always the case,
> > but maybe I am missing something?
> 
> It's not 100% guaranteed but it's fairly easy to build up, especially if
> there's anything non-trivial going on between register writes or any
> writes get large enough to be worth DMAing.

It's mostly a register _read_ problem, not a _write_ problem.

Compared to the regmap the mcp251xfd driver does trivial stuff in the RX
path. We need the information of the 1st read to do the 2nd read and the
information of the 2nd read to do the actual read of the CAN frame. The
length of these individual transfers is not worth DMAing - polled PIO is
the fastest approach here.

The writes in the hot path of the interrupt handler are done in a single
call to spi_sync_transfer() (consisting of several transfers). This
might be converted into an async transfer.

> > > and opens up opportunities for preparing one message while the current
> > > one is in flight that don't otherwise exist (opportunities than we
> > > currently make much use of).  Conversely a large message where we need
> > > the result in order to proceed is going to do just fine with the sync
> > > API, it is actually a synchronous operation after all.
> > 
> > I agree for the case of big sync transfers (under the assumption that
> > interruptible sleep is used of course).
> > The point I am trying to make is that the time required for small, fast
> > transfers (16-64bits on 20+MHz clock, a very common use-case) is often far
> > less, or at best in the order of magnitude of the time all the extra code in
> > spi.c takes to execute, and definitely much lower than what a context switch
> > would take. A simple look at the scope signals of such SPI transfers just
> > tells all the story.
> > What this means, is that you can optimize for concurrency however much you
> > want, you will ALWAYS be slower and consume more CPU time than simple fast
> > low-overhead uninterruptible polling approach. Independent of whether you need
> > the result for the next operation or not.
> 
> This is the entire reason that spi_sync() runs inline when the bus is
> uncontended rather than punting to the workqueue - but it just does that
> without requiring that client drivers use a completely separate API to
> trigger this behaviour.  If a client needs to wait for the message to
> finish before doing the next one or is only doing a small number of
> transfers before it needs to wait or do something else then spi_sync()
> is absolutely a good interface for it, the more work is being done in a
> single batch either for each message or in number of messages the more
> likely it is that we will be able to get something out of spi_async(),
> either now or in the future.

As David measured the overhead of checking, housekeeping, statistics,
etc... in spi.c for sync transfers is more than twice as big as the
actual transfer itself.

> > > One example of lots of potentially short messages is regmap's cache sync
> > > code, it uses async writes to sync the register map so that it can
> > > overlap working out what to sync and marshalling the data with the
> > > actual writes - that tends to pan out as meaning that the sync completes
> > > faster since we can often identify and queue several more registers to
> > > write in the time it takes to send the first one which works out faster
> > > even with PIO controllers and bouncing to the worker thread, the context
> > > switching ends up being less than the time taken to work out what to
> > > send for even a fairly small number of registers.

All writes in the hot path either async (for TX) already or they are
several writes of the same value to the same register with toggling CS
in between. These are done with a single spi_sync_transfer() (with
work^W opportunity on our side to convert to async, too).

> > Can you point to an example of this? I think it will depend a lot on SPI clock
> > and the size and amount of registers in the regmap... and whether multiple
> > registers can be written in one SPI transfer or not.
> 
> Look at register maps like wm5102 for an example of a larger register
> map where the wins start to add up - I can't remember exactly but I
> suspect that was one of my main targets when I was benchmarking this.
> 
> IIRC for smaller cases the difference between sync and async ended up
> being mostly a wash, especially with multiple CPUs available but even on
> a single CPU you ended up doing a couple of context switches or
> something (one switching to the SPI pump, then another switching back
> when everything was done) which is non-zero but in this context totally
> fine.
> 
> > > > Assuming the *sync* API cannot be changed, this leads to the need to introduce
> > > > a new API optimized specifically for this use-case. IMHO it is also reasonable
> > > > to say that when accessing registers on a peripheral device, the whole
> > > > statistics and accounting machinery in spi.c isn't really so valuable, and
> > > > definitely not worth its overhead in a production system.  
> 
> > > That seems like a massive assumption, one could equally say that any
> > > application that is saturating the SPI bus is going to be likely to want
> > > to be able to monitor the performance and utilisation of the bus in
> > > order to facilitate optimisation and so want the statistics.
> 
> > I agree. But do you really need to have that code in the hot-path always,
> > instead of only during development, debugging and testing?
> 
> As Andrew pointed out the networking code manages to have stats code in
> hot paths that a large number of people care deeply about, I am fairly
> confident that the overhead can be reduced to the point where it is much
> less impactful next to the actual I/O.  The closer we can keep people's
> production and development setups the easier life tends to be for them,
> and even if it's something selected at build time having to use a
> completely separate set of calls in the client driver is going to be
> needlessly painful to use.
> 
> Not saying there's nothing to improve here, just that a completely
> separate API doesn't seem like the answer.
> 
> > > >  1. CAN message RX, IRQ line asserted
> > > >  2. Hard IRQ empty, starts IRQ thread
> > > >  3. IRQ thread interrogates MCP2518FD via register access:
> > > >  3.1. SPI transfer 1: CS low, 72bit xfer, CS high
> > > >  3.2. SPI transfer 2: CS low, 200bit xfer, CS high
> > > >  3.3. SPI transfer 3: CS low, 72bit xfer, CS high
> > > >  3.4. SPI transfer 4: CS low, 104bit xfer, CS high
> > > >  4. IRQ thread ended, RX message gets delivered to user-space
> > > >  5. canecho.c recv()
> > > >  6. canecho.c send()
> > > >  7. TX message gets delivered to CAN driver
> > > >  8. CAN driver does spi_async to queue 2 xfers (replace by spi_sync equivalent
> > > >  in kernel C):
> > > >  8.1. SPI message 1: CS low, 168bit xfer, CS high, CS low, 48bit xfer, CS high
> > > >  9. CAN message SOF starts appearing on the bus just before last CS high.  
> > > Note that this is all totally single threaded and sequential which is
> > > going to change the performance characteristics substantially, for
> > > example adding and driving another echo server or even just having the
> > > remote end push another message into flight before it waits for a
> > > response would get more mileage out of the async API.
> 
> > AFAIK, it doesn't really matter how many echo servers you have in user-space.
> 
> You should be able to get a situation where one end of the link is
> reading a message from it's CAN controller, the other end of the link is
> sending a message to it's CAN controller and one of the CAN controllers
> is sending on the bus.  Ideally (assuming CAN and the controllers cope
> well with this) at least one of the controllers should always have a
> message queued so that as soon as the bus goes idle another message can
> start sending.  It did look like that was viable from the datasheet I
> glanced through but...
>
> > In the end, inside the CAN driver it is just one queue that talks to the SPI
> > driver in exactly the same way. And like I explained above, this is a pretty
> > common situation in CAN networks like J1939, where 90% of the time a single
> > remote-ECU is talking to a single process on the local machine, and the only
> > thing that matters is single-in-flight response time.
> 
> ...if that's not really what an application is going to do I guess it
> doesn't matter here.  The test looked like it was going for throughput
> rather than latency.

See above, not all use cases are about throughput, David's about latency.

> > > So for example a sysctl to suppress stats, or making the stats code
> > > fancier with per cpu data or whatever so it doesn't need to lock in the
> > > hot path would help here (obviously the latter is a bit nicer).  Might
> > > be interesting seeing if it's the irqsave bit that's causing trouble
> > > here, I'm not sure that's actually required other than for the error
> > > counters.
> 
> > Sure. It would be great if that was possible, and if it could at least get
> > close to the performance of my proposal. I assumed it wasn't, because I
> > thought the accounting interface being part of the user-space API was set in
> > stone. But if it can be disabled (efficiently), then of course that might be a
> > far better solution.
> 
> Worst case if it's not possible to make it fast enough and people are
> relying on the numbers then we could have a control that defaults it to
> on, then if someone explicitly chooses to turn it off then they can
> presumably cope.
> 
> > > > spi_set_cs(): Removing all delay code and leaving the bare minimum for GPIO
> > > > based CS activation again has a measurable impact. Most (all?) simple SPI
> > > > peripheral chips don't have any special CS->clock->CS timing requirements, so
> > > > it might be a good idea to have a simpler version of this function.  
> 
> > > Surely the thing there would just be to set the delays to zero if they
> > > can actually be zero (and add a special case for actually zero delay
> > > which we don't currently have)?  But in any case devices do always have
> 
> ...
> 
> > > attention needs to be paid.  It should be possible to do something to
> > > assess a delay as being effectively zero and round down which would feed
> > > in here when we're managing the chip select from software but we can't
> > > just ignore the delays.
> 
> > In practice, when moving a GPIO pin from code and then start an SPI transfer
> > immediately, you couldn't get close to 50ns even if you did your best to try.
> > This is of course under the assumption that the GPIO is local (not an I2C
> > expander or something like that) and the IO pin slew-rate is in the same
> > range as the SPI signals... which is probably the case in 99% (100%?) of such
> > hardware.
> 
> That's the "assess the delay as effectively zero" case I mentioned
> above, we should definitely do that.
> 
> > SPI peripherals very rarely have CS setup and hold time requirements that
> > deviate far from the period time of a SPI clock cycle, let alone are so long
> > that software manipulating a GPIO pin needs to be delayed artificially.
> 
> There's a bunch of things which have a requirement expressed in terms of
> their master clock rather than the SPI clock (see SPI_DELAY_UNIT_SCK) or
> absolute numbers which can push things up, or special requirements when
> in low power states (not just low power states where the MCLK speed is
> reduced!).  It's not the default case by any mens but this control is
> there for a reason.
> 
> > > There's a whole pile of assumptions in there about the specific system
> > > you're running on and how it's going to perform.  Like I said above it
> > > really feels like this is the wrong level to approach things at, it's
> > > pushing decisions to the client driver that are really system specific.
> 
> > Is that always the case? I mean, client drivers should know best what the
> > requirements are for their SPI transfers, right? Does it need to optimize
> 
> The definitions of "short" and "fast" are going to be per system things,
> and the imagination of system integrators can be surprising.  Even with
> something like CAN you're looking at it for what sounds like the
> original ECU application but someone else might decide that it's a good
> fit for some completely different application that uses the bus in a
> different way and so don't build the hardware in a way that's expected.
> 
> Clients should just do whatever they need to do and let the core figure
> out how that maps on to the particular controller hardware that the
> system has, the core should be able to scale things gracefully even for
> surprising system designs.  We don't want to have clients doing things
> that make the driver totally unusable on some systems, or which cause
> performance to fall off a cliff if the system behaves unexpectedly.
> 
> > latency for very short fast transfers, or does it need to send fire-and-forget
> > bulk transfers?
> 
> That's already in the API, a client can send whatever transfer length it
> needs and it can send them async.  To put it another way, why would a
> driver ever choose something other than the fast API?
> 
> >                 Does it need to do transfers from interrupt context? Does it
> > need strictly local-GPIO based CS?
> 
> We would need something new to have transfers run entirely in hard
> interrupt context (ie, waiting for the transfer to complete rather than
> starting/stopping) and TBH I'd take some convincing that it was a use
> case that makes sense - it's really rude to everything else in the
> system, and we do have tools for configuring the scheduling threaded
> work.
> 
> Requiring GPIO based chip selects definitely feels suspicious, even
> ignoring system integrators being surprising there's perfectly
> performant and flexible controllers that don't have GPIO based chip
> selects at all.
> 
> > > Why would a client doing "short" transfers (short itself being a fuzzy
> > > term) not want to use this interface, and what happens when for example
> 
> > With "short" I always mean that it takes a time which lies in the order of
> > magnitude or less than that of context-switch overhead.
> 
> Both the transfer and context switch times being variable of course -
> there's going to be cutoffs somewhere and those cutoffs have much wider
> ranges than a driver author might anticipate.  A small microcontroller
> style SoC and a cutting edge mobile phone SoC are going to have rather
> different characteristics on the CPU side.
> 
> > > just benefit any client driver using the APIs idiomatically without them
> > > having to make assumptions about either the performance characteristics
> > > of the system they're running on or the features it has, especially if
> > > those assumptions would make the driver unusuable on some systems.
> 
> > I get what you mean. Yet besides this CAN controller I can point to another
> > example where this is not so obvious: Oleksij Rempel recently contributed a
> > IIO-ADC based driver for the TSC2046 touchscreen controller. While this
> > controller works at a very slow SPI clock where these things are much less of
> > a problem, it is very sensitive to the SPI CS setup time. Right now, in
> > order to get the timing right, the driver will add extra data to the beginning
> > of the SPI transfer [2], which depends on the SPI clock to calculate an exact
> > delay. Any deviation of CS setup time introduces uncertainty to this timing.
> > Fortunately each board can specify and tune this time in their device-tree.
> 
> I really don't see the concern or conflict here?  The DT property there
> is the settling time before the samples are usable which is perfectly
> normal for ADCs.
> 
> > > High chances are not guarantees, and none of this sounds like things
> > > that should require specific coding in the client drivers.
> 
> > Ok, but if the hardware is designed to require low latency of SPI messages,
> > based on a certain SPI clock, bus load and maximum CS setup time? In those
> > cases, an API that introduces so much overhead, that the maximum achievable
> > SPI bus load is less than 10% and CS setup time alone exceeds the complete
> > transfer time needed, might be just unusable.
> 
> If the performance is that bad then why only improve the performance for
> some subset of client drivers that happen to use some different API
> rather than for any driver?  I'm not sure there's any devices out there
> that explicitly want poor performance.
> 
> > > > not involve any context switches. It could be argued that it might even be
> > > > beneficial to have an API that can be called from hard IRQ context, but
> > > > experiments in this case showed that the gain of doing the CAN message read
> > > > out directly in hard-IRQ and removing the IRQ thread is minimal. But better
> > > > use-cases could be conceived, so this possibility might need consideration
> > > > also.  
> 
> > > That's something that the async API (or the sync API in the contended
> > > case) can enable - if we have another transfer queued then we would if
> > > someone did the work be able to arrange to start pushing it immediately
> > > the prior transfer completes.
> 
> > Don't know if I follow you here. You mean that a hard-IRQ schedules the start
> > of an async transfer and later the interrupt thread function just picks up the
> > result? That would be a cool way of reading out some registers indeed.
> 
> I was thinking more about the case where there's a series of operations
> to perform so that when one completes we can immediately initiate the
> next, for example in a multi transfer message starting the second
> transfer as the first transfer finishes.

In the IRQ case we need the information of the 1st transfer (IRQ status
reg) to start the 2nd (read RX FIFO status reg or TX-complete FIFO
status reg).

> However the case you mention
> where we know the operation we will want to do when the device
> interrupts and have it similarly prepared for immediate start from
> interrupt context is another valid one which makes a lot of sense.

You mean start SPI transfer from hard IRQ? In which context will this
SPI transfer be done?

> > Unfortunately though, again, the context switch from hard-IRQ to interrupt
> > thread would be preempted by the SPI worker queue, introducing an unbearable
> > delay due to context-switch overhead alone... at least on my machine (quad
> > core ARM Cortex-A53).
> 
> To be clear we don't do this yet.  The whole point would be to avoid the
> need to involve any thread in the hot path, in a way that would benefit
> as many client and controller drivers as possible.

It seems I haven't got the whole picture, can you explain a bit more?

> > > If there are other bits of the message setup code which are getting in
> > > the way and don't have obvious paths for optimisation (the validation
> > > potentially?) then if your application is spamming a lot of the same
> > > operation (eg, with the status reading in the interrupt handler) then
> > > quite a while ago Martin Sparl was looking at providing an interface
> > > which would allow client drivers to pre-cook messages so that they could
> > > be submitted multiple times without going through the validation that we
> > > normally do (and perhaps get some driver preparation done as
> > > well).

Martin's version of the mcp251xfd driver make heavily use of
pre-generated messages, maybe Martin came up with the idea while working
on this.

> > > He
> > > was looking at it for other reasons but it seems like a productive
> > > approach for cutting down on the setup overhead, it would require more
> > > up front work in the client but cut down on the amount of work done per
> > > operation and seems like it should scale well over different systems.
> 
> > That sounds indeed interesting. I suspect that the message setup is also
> > relatively CPU intensive.
> > Do you have a pointer to Martin's work or discussion?
> 
> Not off hand sorry, and it was more at the design stage than
> implementation IIRC.  Hopefuly lore can help?  I'll try to dig out some
> references later but it was quite a while ago.  His application was even
> lower latency than what you're proposing, basically getting messages
> prepared so that all the management of the chip select and controller
> as well as the data transfer could be done through a sequence of DMA
> operations with no intervention from the CPU and we just needed to queue
> those DMA operations (on a Pi, which has suitable controller and DMA
> hardware) and was therefore both faster and lower CPU load in the hot
> path.  Extremely cool, but hard to abstract on the controller side.

For register write this is interesting. Basically scatter gather for SPI
based register access? For read this is harder. Just read any
information you might need in the hot path and hope that it's faster
than individual transfers?

> > The idea of being able to setup a message (or transfer) once and then only
> > modify the TX buffer and re-send it avoiding all the message validation and
> > checking sounds like a worthwhile optimization. That together with a way to
> > disable all the accounting code efficiently, sounds like it might be just what
> > is needed here.
> 
> High level (and certainly for your application I think) that would be
> making an API that calls __spi_validate() and then sets a flag saying
> validation was done.  Then when a message is actually sent if the flag
> was set then don't bother doing the validation, and have an unvalidate
> (or rather unprepare I guess) API to be called when done with the
> message.  So something like
> 
> 	spi_prepare_message(msg);
> 
> 	spi_sync(msg);
> 	spi_sync(msg);
> 
> 	spi_unprepare_message(msg);
> 
> (we already have a prepare on the controller side so probably best think
> of another name there.)  We'd probably need to require that callers
> check to see if things are DMA mapped and handle mapping/unmapping if
> they access the data in between uses.

This sounds like a good approach to me!

> > > I see the problem, what I don't see is why it requires a new externally
> > > visible API to solve it beyond the suggestion about potentially
> > > preparing messages for repeated use if that's even something that's
> > > really registering.
> 
> > I doubted it was possible, but you are convincing me that it maybe can be done.
> > Like I said above: re-use of messages and skipping all the accounting code
> > would probably be good enough.
> 
> OK, great.  You've definitely pointed out some things that could do with
> improvement in the existing code regardless of any API level stuff.
> I'll try to poke at the delay code myself.

regards,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-16 16:28       ` Marc Kleine-Budde
@ 2022-05-16 17:46         ` Mark Brown
  2022-05-17 10:24           ` David Jander
  0 siblings, 1 reply; 25+ messages in thread
From: Mark Brown @ 2022-05-16 17:46 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: David Jander, linux-spi, Oleksij Rempel

[-- Attachment #1: Type: text/plain, Size: 9523 bytes --]

On Mon, May 16, 2022 at 06:28:51PM +0200, Marc Kleine-Budde wrote:
> On 13.05.2022 20:36:51, Mark Brown wrote:

> > Well, you've identifed a couple of candidates for optimisation here...
> > I would start off by trying to address the specific hot points that
> > you've identified and see where that takes us, like I said nothing I'm
> > hearing sounds like anything other than trying to optimise the hot path
> > and it doesn't seem like anything was really tried there yet.

> I thinks the statistics are an easy target.

Yes, that seems fairly clear.  I hadn't realised how much of an impact
that was having, this is certainly the first time anyone mentioned it on
the list.

> > How much of that time is due to the API being async vs just nobody
> > having tried to optimise the slow bits?

> You mean slow bits of the mcp251xfd driver?

No, of the SPI core.

> > Also bear in mind that
> > a huge proportion of systems have multiple CPUs available these days so
> > CPU time isn't wall clock time - one of the advantages is that we can
> > start sending the initial transfers while working out what else to send
> > which is useful if there's anything non-trivial going on there.  If the
> > driver is doing no work in between messages and the messages are all
> > short enough to do using PIO on the system concerned then yes,

> If the chip throws an IRQ it's more complicated. The driver has to read
> the IRQ status register before it knows which IRQ to handle. For the RX
> IRQ it has to read the fill level from the RX FIFO, then it can finally
> read the RX'ed CAN frames. The driver does read as many frames as
> possible with one transfer.

> So the hot path is the RX IRQ, TX-complete IRQ handling is similar but
> not as important (as the hardware has a buffer for outgoing TX frames),
> the other IRQ sources are various kinds of error handling.

Yes, indeed - like I was saying elsewhere in the message the async stuff
is mainly a win when generating a stream of data the device, as soon as
you have to wait for information from the device before you proceed
you're into the sync case.

> > it's not
> > going to get us terribly far unless we can arrange to push the entire
> > queue from DMA or something which would be complicated.

> One optimization is to not only read the the IRQ status register, but
> also the RX FIFO status register, in hope the IRQ is a RX IRQ. If there
> are pending TX frames, the TX FIFO status register can be read, too.

Yes, that'd fit in with the preprepared list of transactions idea.

> The driver uses regmap, which makes this a bit more complicated. I hope
> I can hide this behind regmap's caching feature.

We could probably do something that looks like the interface we use for
patching which allows the user to submit a batch of reads at once.

> The memory layout of this chip is not really great for this. Reading IRQ
> status + RX FIFO status (with one transfer) means a transfer length of
> 72 byes, in other words an overhead of 64 bytes. :/

At that point I'd guess you're probably better off doing several reads.

> What's the typical length of a SPI FIFO? On the imx6 it's 64 bytes. If
> the SPI host controller needs to refill the TX FIFO you probably have an
> additional IRQ per transfer, unless you are in polling mode.

It's kind of all over the place to be honest, 64 bytes is possibly on
the larger side.

> In our measurements of the spi-imx driver we found that IRQ based PIO
> mode is faster than DMA mode for these medium sized transfers, while
> polled PIO mode is faster for small transfers. I don't have the exact
> cutoff numbers. Even for this driver they depends on the used SoC.

> You see, it's quite a guessing game, depending on the SPI host driver
> (quality) and SoC, if 2 small or 1 lager transfer is better.

Yes, there's room for a lot of tuning of the threasholds for switching
to DMA - I've not seen much work in that area but it does feel like we
could get some useful wins.  I suspect that there is some room for
tuning depending on if we've got more work queued up as well, if we are
already running and can start the next transfer as DMA from hard IRQ
context that's probably going to be better at a lower threashold than if
the controller is idle and we can run the entire message with PIO in the
context of the requesting thread.

> > > 10us per SPI transfer with the fast API vs 30us per transfer on the current
> > > sync API, with the possibility of a single context switch taking several
> > > microseconds more (at least), make me think that this is not always the case,
> > > but maybe I am missing something?

> > It's not 100% guaranteed but it's fairly easy to build up, especially if
> > there's anything non-trivial going on between register writes or any
> > writes get large enough to be worth DMAing.

> It's mostly a register _read_ problem, not a _write_ problem.

Well, serialisation problem as much as anything - the issue with reads
is more needing one read to complete before we can work out what the
next one we need to do is than anything specific to reads.  That is
going to be much more likely to be an issue for reads though.

> The writes in the hot path of the interrupt handler are done in a single
> call to spi_sync_transfer() (consisting of several transfers). This
> might be converted into an async transfer.

That would certainly improve the interoperability with controllers -
there's a whole bunch where we just can't control the chip select so
anything that needs non-standard chip select handling in the middle of a
message just won't work.  Or possibly it's as well to just do a series
of spi_sync() operations.

> > This is the entire reason that spi_sync() runs inline when the bus is
> > uncontended rather than punting to the workqueue - but it just does that
> > without requiring that client drivers use a completely separate API to
> > trigger this behaviour.  If a client needs to wait for the message to
> > finish before doing the next one or is only doing a small number of
> > transfers before it needs to wait or do something else then spi_sync()
> > is absolutely a good interface for it, the more work is being done in a
> > single batch either for each message or in number of messages the more
> > likely it is that we will be able to get something out of spi_async(),
> > either now or in the future.

> As David measured the overhead of checking, housekeeping, statistics,
> etc... in spi.c for sync transfers is more than twice as big as the
> actual transfer itself.

Sure, the point here is that we don't need a new API but rather to
optimise the implementation of the current one - running spi_sync()
inline in the uncontended case is an example of doing that.

> > However the case you mention
> > where we know the operation we will want to do when the device
> > interrupts and have it similarly prepared for immediate start from
> > interrupt context is another valid one which makes a lot of sense.

> You mean start SPI transfer from hard IRQ? In which context will this
> SPI transfer be done?

Hardware ideally.

> > > > If there are other bits of the message setup code which are getting in
> > > > the way and don't have obvious paths for optimisation (the validation
> > > > potentially?) then if your application is spamming a lot of the same
> > > > operation (eg, with the status reading in the interrupt handler) then
> > > > quite a while ago Martin Sparl was looking at providing an interface
> > > > which would allow client drivers to pre-cook messages so that they could
> > > > be submitted multiple times without going through the validation that we
> > > > normally do (and perhaps get some driver preparation done as
> > > > well).

> Martin's version of the mcp251xfd driver make heavily use of
> pre-generated messages, maybe Martin came up with the idea while working
> on this.

Yes, IIRC it was a CAN bus optimising for throughput.

> > Not off hand sorry, and it was more at the design stage than
> > implementation IIRC.  Hopefuly lore can help?  I'll try to dig out some
> > references later but it was quite a while ago.  His application was even
> > lower latency than what you're proposing, basically getting messages
> > prepared so that all the management of the chip select and controller
> > as well as the data transfer could be done through a sequence of DMA
> > operations with no intervention from the CPU and we just needed to queue
> > those DMA operations (on a Pi, which has suitable controller and DMA
> > hardware) and was therefore both faster and lower CPU load in the hot
> > path.  Extremely cool, but hard to abstract on the controller side.

> For register write this is interesting. Basically scatter gather for SPI
> based register access? For read this is harder. Just read any
> information you might need in the hot path and hope that it's faster
> than individual transfers?

Yes, it's a bit more interesting for things like checking interrupt
status.

One thing that might be useful would be if we could start the initial
status read message from within the hard interrupt handler of the client
driver with the goal that by the time it's threaded interrupt handler
runs we might have the data available.  That could go wrong on a lightly
loaded system where we might end up running the threaded handler while
the transfer is still running, OTOH if it's lightly loaded that might
not matter.  Or perhaps just use a completion from the SPI operation and
not bother with the threaded handler at all.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-16 17:46         ` Mark Brown
@ 2022-05-17 10:24           ` David Jander
  2022-05-17 11:57             ` Mark Brown
  0 siblings, 1 reply; 25+ messages in thread
From: David Jander @ 2022-05-17 10:24 UTC (permalink / raw)
  To: Mark Brown; +Cc: Marc Kleine-Budde, linux-spi, Oleksij Rempel

On Mon, 16 May 2022 18:46:38 +0100
Mark Brown <broonie@kernel.org> wrote:

> On Mon, May 16, 2022 at 06:28:51PM +0200, Marc Kleine-Budde wrote:
> > On 13.05.2022 20:36:51, Mark Brown wrote:  
> 
> > > Well, you've identifed a couple of candidates for optimisation here...
> > > I would start off by trying to address the specific hot points that
> > > you've identified and see where that takes us, like I said nothing I'm
> > > hearing sounds like anything other than trying to optimise the hot path
> > > and it doesn't seem like anything was really tried there yet.  
> 
> > I thinks the statistics are an easy target.  
> 
> Yes, that seems fairly clear.  I hadn't realised how much of an impact
> that was having, this is certainly the first time anyone mentioned it on
> the list.

I did some more detailed measurements. Since the start of the IRQ on the
MCP2518FD leads to 4 small size sync transfers, a good comparative measurement
was to look at the time the interrupt line stayed active. This proved to be
fairly consistent, and spans the time from where the hard IRQ is activated,
the IRQ thread is started, IRQ status and FIFO status are read, the FIFO is
emptied of a single message and finally the IRQ flags are cleared.
I used this measurement to asses the impact of modifications done to the code
(mainly in spi.c for now). Time interrupt line stays low:

 1. Kernel 5.18-rc1 with only polling patches from spi-next: 135us

 2. #if 0 around all stats and accounting calls: 100us

 3. The _fast API of my original RFC: 55us

This shows that the accounting code is a bit less than half of the dispensable
overhead for my use-case. Indeed an easy target.

Still need to asses what the other half is exactly. For that I started
modifying the MCP2518FD driver and adding exported functions to spi.c to be
able to prepare and validate messages only once and then re-use them. This is
quite a bit of work, so I am not done yet.
Please note that this is something my OP _fast API proposal didn't entirely
save on, since I still needed to do most of what prepare_message and
map_message do sans the DMA mapping. There is also quite a lot locking going
on, so I wonder whether there is something to gain if one could just call
spi_bus_lock() at the start of several such small sync transfers and use
non-locking calls (skip the queue lock and io_mutex)? Not sure that would have
a meaningful impact, but to get an idea, I replaced the bus_lock_spinlock and
queue_lock in __spi_sync() and __spi_queued_transfer() with the bare code in
__spi_queued_transfer(), since it won't submit work to the queue in this case
anyway. The resulting interrupt-active time decreased by another 4us, which is
approximately 5% of the dispensable overhead. For the record, that's 2us per
spinlock lock/unlock pair.

> > > How much of that time is due to the API being async vs just nobody
> > > having tried to optimise the slow bits?  
> 
> > You mean slow bits of the mcp251xfd driver?  
> 
> No, of the SPI core.
> 
> > > Also bear in mind that
> > > a huge proportion of systems have multiple CPUs available these days so
> > > CPU time isn't wall clock time - one of the advantages is that we can
> > > start sending the initial transfers while working out what else to send
> > > which is useful if there's anything non-trivial going on there.  If the
> > > driver is doing no work in between messages and the messages are all
> > > short enough to do using PIO on the system concerned then yes,  
> 
> > If the chip throws an IRQ it's more complicated. The driver has to read
> > the IRQ status register before it knows which IRQ to handle. For the RX
> > IRQ it has to read the fill level from the RX FIFO, then it can finally
> > read the RX'ed CAN frames. The driver does read as many frames as
> > possible with one transfer.  
> 
> > So the hot path is the RX IRQ, TX-complete IRQ handling is similar but
> > not as important (as the hardware has a buffer for outgoing TX frames),
> > the other IRQ sources are various kinds of error handling.  
> 
> Yes, indeed - like I was saying elsewhere in the message the async stuff
> is mainly a win when generating a stream of data the device, as soon as
> you have to wait for information from the device before you proceed
> you're into the sync case.
> 
> > > it's not
> > > going to get us terribly far unless we can arrange to push the entire
> > > queue from DMA or something which would be complicated.  
> 
> > One optimization is to not only read the the IRQ status register, but
> > also the RX FIFO status register, in hope the IRQ is a RX IRQ. If there
> > are pending TX frames, the TX FIFO status register can be read, too.  
> 
> Yes, that'd fit in with the preprepared list of transactions idea.
> 
> > The driver uses regmap, which makes this a bit more complicated. I hope
> > I can hide this behind regmap's caching feature.  
> 
> We could probably do something that looks like the interface we use for
> patching which allows the user to submit a batch of reads at once.
> 
> > The memory layout of this chip is not really great for this. Reading IRQ
> > status + RX FIFO status (with one transfer) means a transfer length of
> > 72 byes, in other words an overhead of 64 bytes. :/  
> 
> At that point I'd guess you're probably better off doing several reads.

Looking at the scope, I'd agree for the i.MX8M (ARM Cortex-A53). For older
SoCs that use the same spi controller, like the i.MX6 (Cortex-A9) it might be
a wash.

> > What's the typical length of a SPI FIFO? On the imx6 it's 64 bytes. If
> > the SPI host controller needs to refill the TX FIFO you probably have an
> > additional IRQ per transfer, unless you are in polling mode.  
> 
> It's kind of all over the place to be honest, 64 bytes is possibly on
> the larger side.
> 
> > In our measurements of the spi-imx driver we found that IRQ based PIO
> > mode is faster than DMA mode for these medium sized transfers, while
> > polled PIO mode is faster for small transfers. I don't have the exact
> > cutoff numbers. Even for this driver they depends on the used SoC.  
> 
> > You see, it's quite a guessing game, depending on the SPI host driver
> > (quality) and SoC, if 2 small or 1 lager transfer is better.  
> 
> Yes, there's room for a lot of tuning of the threasholds for switching
> to DMA - I've not seen much work in that area but it does feel like we
> could get some useful wins.  I suspect that there is some room for
> tuning depending on if we've got more work queued up as well, if we are
> already running and can start the next transfer as DMA from hard IRQ
> context that's probably going to be better at a lower threashold than if
> the controller is idle and we can run the entire message with PIO in the
> context of the requesting thread.
> 
> > > > 10us per SPI transfer with the fast API vs 30us per transfer on the current
> > > > sync API, with the possibility of a single context switch taking several
> > > > microseconds more (at least), make me think that this is not always the case,
> > > > but maybe I am missing something?  
> 
> > > It's not 100% guaranteed but it's fairly easy to build up, especially if
> > > there's anything non-trivial going on between register writes or any
> > > writes get large enough to be worth DMAing.  
> 
> > It's mostly a register _read_ problem, not a _write_ problem.  
> 
> Well, serialisation problem as much as anything - the issue with reads
> is more needing one read to complete before we can work out what the
> next one we need to do is than anything specific to reads.  That is
> going to be much more likely to be an issue for reads though.
> 
> > The writes in the hot path of the interrupt handler are done in a single
> > call to spi_sync_transfer() (consisting of several transfers). This
> > might be converted into an async transfer.  
> 
> That would certainly improve the interoperability with controllers -
> there's a whole bunch where we just can't control the chip select so
> anything that needs non-standard chip select handling in the middle of a
> message just won't work.  Or possibly it's as well to just do a series
> of spi_sync() operations.
> 
> > > This is the entire reason that spi_sync() runs inline when the bus is
> > > uncontended rather than punting to the workqueue - but it just does that
> > > without requiring that client drivers use a completely separate API to
> > > trigger this behaviour.  If a client needs to wait for the message to
> > > finish before doing the next one or is only doing a small number of
> > > transfers before it needs to wait or do something else then spi_sync()
> > > is absolutely a good interface for it, the more work is being done in a
> > > single batch either for each message or in number of messages the more
> > > likely it is that we will be able to get something out of spi_async(),
> > > either now or in the future.  
> 
> > As David measured the overhead of checking, housekeeping, statistics,
> > etc... in spi.c for sync transfers is more than twice as big as the
> > actual transfer itself.  
> 
> Sure, the point here is that we don't need a new API but rather to
> optimise the implementation of the current one - running spi_sync()
> inline in the uncontended case is an example of doing that.

Ack. Currently on it, testing whether this has the potential to lead to
acceptable results. Promising so far. See above.

> > > However the case you mention
> > > where we know the operation we will want to do when the device
> > > interrupts and have it similarly prepared for immediate start from
> > > interrupt context is another valid one which makes a lot of sense.  
> 
> > You mean start SPI transfer from hard IRQ? In which context will this
> > SPI transfer be done?  
> 
> Hardware ideally.
>
> > > > > If there are other bits of the message setup code which are getting in
> > > > > the way and don't have obvious paths for optimisation (the validation
> > > > > potentially?) then if your application is spamming a lot of the same
> > > > > operation (eg, with the status reading in the interrupt handler) then
> > > > > quite a while ago Martin Sparl was looking at providing an interface
> > > > > which would allow client drivers to pre-cook messages so that they could
> > > > > be submitted multiple times without going through the validation that we
> > > > > normally do (and perhaps get some driver preparation done as
> > > > > well).  
> 
> > Martin's version of the mcp251xfd driver make heavily use of
> > pre-generated messages, maybe Martin came up with the idea while working
> > on this.  
> 
> Yes, IIRC it was a CAN bus optimising for throughput.

I will take a look at his version of the driver. Thanks for the hint.

> > > Not off hand sorry, and it was more at the design stage than
> > > implementation IIRC.  Hopefuly lore can help?  I'll try to dig out some
> > > references later but it was quite a while ago.  His application was even
> > > lower latency than what you're proposing, basically getting messages
> > > prepared so that all the management of the chip select and controller
> > > as well as the data transfer could be done through a sequence of DMA
> > > operations with no intervention from the CPU and we just needed to queue
> > > those DMA operations (on a Pi, which has suitable controller and DMA
> > > hardware) and was therefore both faster and lower CPU load in the hot
> > > path.  Extremely cool, but hard to abstract on the controller side.  
> 
> > How much of that time is due to the API being async Basically scatter gather for SPI
> > based register access? For read this is harder. Just read any
> > information you might need in the hot path and hope that it's faster
> > than individual transfers?  
> 
> Yes, it's a bit more interesting for things like checking interrupt
> status.
> 
> One thing that might be useful would be if we could start the initial
> status read message from within the hard interrupt handler of the client
> driver with the goal that by the time it's threaded interrupt handler
> runs we might have the data available.  That could go wrong on a lightly
> loaded system where we might end up running the threaded handler while
> the transfer is still running, OTOH if it's lightly loaded that might
> not matter.  Or perhaps just use a completion from the SPI operation and
> not bother with the threaded handler at all.

You mean ("ctx" == context switch):

 1. hard-IRQ, queue msg --ctx--> SPI worker, call msg->complete() which does
 thread IRQ work (but can only do additional sync xfers from this context).

vs.

 2. hard-IRQ, queue msg --ctx--> SPI worker, call completion --ctx--> IRQ
 thread wait for completion and does more xfers...

vs (and this was my idea).

 3. hard-IRQ, pump FIFO (if available) --ctx--> IRQ thread, poll FIFO, do more
 sync xfers...

Option 3 would require a separation of spi_sync_transfer into two halves. One
half just activates CS (non-sleep GPIO api!) and fills the FIFO. The second
half polls the FIFO for transfer completion. This path could only be chosen if
the SPI controller has a FIFO that can hold the whole message. In other words a
lot of spacial case handling for what it is worth probably... but still
interesting.

Option 2 is probably not that bad if the SPI worker can run on another core?

> [re-quoted from previous email:]
> [...]
> > > In the end, inside the CAN driver it is just one queue that talks to the SPI
> > > driver in exactly the same way. And like I explained above, this is a pretty
> > > common situation in CAN networks like J1939, where 90% of the time a single
> > > remote-ECU is talking to a single process on the local machine, and the only
> > > thing that matters is single-in-flight response time.  
> > 
> > ...if that's not really what an application is going to do I guess it
> > doesn't matter here.  The test looked like it was going for throughput
> > rather than latency.  
> 
> See above, not all use cases are about throughput, David's about latency.

I'd say reducing overhead in spi.c for small transfers will benefit both
throughput AND latency. To prove this, one just needs to look at the amount of
time that CS is held active. During that time no other transfer can use the
bus, and it is drastically reduced by these changes.

Best regards,

-- 
David Jander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-17 10:24           ` David Jander
@ 2022-05-17 11:57             ` Mark Brown
  2022-05-17 13:09               ` David Jander
  0 siblings, 1 reply; 25+ messages in thread
From: Mark Brown @ 2022-05-17 11:57 UTC (permalink / raw)
  To: David Jander; +Cc: Marc Kleine-Budde, linux-spi, Oleksij Rempel

[-- Attachment #1: Type: text/plain, Size: 3109 bytes --]

On Tue, May 17, 2022 at 12:24:39PM +0200, David Jander wrote:

> (mainly in spi.c for now). Time interrupt line stays low:

>  1. Kernel 5.18-rc1 with only polling patches from spi-next: 135us

>  2. #if 0 around all stats and accounting calls: 100us

>  3. The _fast API of my original RFC: 55us

> This shows that the accounting code is a bit less than half of the dispensable
> overhead for my use-case. Indeed an easy target.

Good.

> on, so I wonder whether there is something to gain if one could just call
> spi_bus_lock() at the start of several such small sync transfers and use
> non-locking calls (skip the queue lock and io_mutex)? Not sure that would have
> a meaningful impact, but to get an idea, I replaced the bus_lock_spinlock and
> queue_lock in __spi_sync() and __spi_queued_transfer() with the bare code in
> __spi_queued_transfer(), since it won't submit work to the queue in this case
> anyway. The resulting interrupt-active time decreased by another 4us, which is
> approximately 5% of the dispensable overhead. For the record, that's 2us per
> spinlock lock/unlock pair.

I do worry about how this might perform under different loads where
there are things coming in from more than one thread.

> > One thing that might be useful would be if we could start the initial
> > status read message from within the hard interrupt handler of the client
> > driver with the goal that by the time it's threaded interrupt handler
> > runs we might have the data available.  That could go wrong on a lightly
> > loaded system where we might end up running the threaded handler while
> > the transfer is still running, OTOH if it's lightly loaded that might
> > not matter.  Or perhaps just use a completion from the SPI operation and
> > not bother with the threaded handler at all.

> You mean ("ctx" == context switch):

>  1. hard-IRQ, queue msg --ctx--> SPI worker, call msg->complete() which does
>  thread IRQ work (but can only do additional sync xfers from this context).

> vs.

>  2. hard-IRQ, queue msg --ctx--> SPI worker, call completion --ctx--> IRQ
>  thread wait for completion and does more xfers...

> vs (and this was my idea).

>  3. hard-IRQ, pump FIFO (if available) --ctx--> IRQ thread, poll FIFO, do more
>  sync xfers...

Roughly 1, but with a lot of overlap with option 3.  I'm unclear what
you mean by "queue message" here.

> Option 3 would require a separation of spi_sync_transfer into two halves. One
> half just activates CS (non-sleep GPIO api!) and fills the FIFO. The second
> half polls the FIFO for transfer completion. This path could only be chosen if
> the SPI controller has a FIFO that can hold the whole message. In other words a
> lot of spacial case handling for what it is worth probably... but still
> interesting.

Yes, that's the whole point.  This also flows nicely when you've got a
queue since you can restart the hardware from the interrupt context
without waiting to complete the transfer that just finished.

> Option 2 is probably not that bad if the SPI worker can run on another core?

Pretty much anything benefits with another core.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-17 11:57             ` Mark Brown
@ 2022-05-17 13:09               ` David Jander
  2022-05-17 13:43                 ` Mark Brown
  0 siblings, 1 reply; 25+ messages in thread
From: David Jander @ 2022-05-17 13:09 UTC (permalink / raw)
  To: Mark Brown; +Cc: Marc Kleine-Budde, linux-spi, Oleksij Rempel

On Tue, 17 May 2022 12:57:18 +0100
Mark Brown <broonie@kernel.org> wrote:

> On Tue, May 17, 2022 at 12:24:39PM +0200, David Jander wrote:
> 
> > (mainly in spi.c for now). Time interrupt line stays low:  
> 
> >  1. Kernel 5.18-rc1 with only polling patches from spi-next: 135us  
> 
> >  2. #if 0 around all stats and accounting calls: 100us  
> 
> >  3. The _fast API of my original RFC: 55us  
> 
> > This shows that the accounting code is a bit less than half of the dispensable
> > overhead for my use-case. Indeed an easy target.  
> 
> Good.
> 
> > on, so I wonder whether there is something to gain if one could just call
> > spi_bus_lock() at the start of several such small sync transfers and use
> > non-locking calls (skip the queue lock and io_mutex)? Not sure that would have
> > a meaningful impact, but to get an idea, I replaced the bus_lock_spinlock and
> > queue_lock in __spi_sync() and __spi_queued_transfer() with the bare code in
> > __spi_queued_transfer(), since it won't submit work to the queue in this case
> > anyway. The resulting interrupt-active time decreased by another 4us, which is
> > approximately 5% of the dispensable overhead. For the record, that's 2us per
> > spinlock lock/unlock pair.  
> 
> I do worry about how this might perform under different loads where
> there are things coming in from more than one thread.

I understand your concern. The biggest issue is probably the fact that client
drivers can claim exclusive access to the bus this way, and who knows if they
take care to not claim it for too long?
In the end, if multiple latency-sensitive client drivers share the same SPI
bus, besides this being a questionable HW design, this will be asking for
trouble. But I agree that usage of spi_bus_lock() by client drivers is
probably not a good idea.

> > > One thing that might be useful would be if we could start the initial
> > > status read message from within the hard interrupt handler of the client
> > > driver with the goal that by the time it's threaded interrupt handler
> > > runs we might have the data available.  That could go wrong on a lightly
> > > loaded system where we might end up running the threaded handler while
> > > the transfer is still running, OTOH if it's lightly loaded that might
> > > not matter.  Or perhaps just use a completion from the SPI operation and
> > > not bother with the threaded handler at all.  
> 
> > You mean ("ctx" == context switch):  
> 
> >  1. hard-IRQ, queue msg --ctx--> SPI worker, call msg->complete() which does
> >  thread IRQ work (but can only do additional sync xfers from this context).  
> 
> > vs.  
> 
> >  2. hard-IRQ, queue msg --ctx--> SPI worker, call completion --ctx--> IRQ
> >  thread wait for completion and does more xfers...  
> 
> > vs (and this was my idea).  
> 
> >  3. hard-IRQ, pump FIFO (if available) --ctx--> IRQ thread, poll FIFO, do more
> >  sync xfers...  
> 
> Roughly 1, but with a lot of overlap with option 3.  I'm unclear what
> you mean by "queue message" here.

In the above, I meant:

 "queue message": 
	list_add_tail(&msg->queue, &ctlr->queue);
	kthread_queue_work(ctlr->kworker, &ctlr->pump_messages);

 "pump FIFO":
	A special function in the spi driver that fills the TX FIFO so that a
	transfer starts and returns immediately.

 "poll FIFO": 
	Busy-loop wait until TX-FIFO is empty and RX-FIFO has the
	complete RX message.

> > Option 3 would require a separation of spi_sync_transfer into two halves. One
> > half just activates CS (non-sleep GPIO api!) and fills the FIFO. The second
> > half polls the FIFO for transfer completion. This path could only be chosen if
> > the SPI controller has a FIFO that can hold the whole message. In other words a
> > lot of spacial case handling for what it is worth probably... but still
> > interesting.  
> 
> Yes, that's the whole point.  This also flows nicely when you've got a
> queue since you can restart the hardware from the interrupt context
> without waiting to complete the transfer that just finished.

Ack. Only caveat I see is the requirement for CS to be settable in a
non-sleepable context. Not all GPIO pins have gpiod_set_value(). Some might
only have gpiod_set_value_cansleep(). Although the latter case is not a good
choice for a CS signal, so I doubt it can be found in the wild.
Example: I2C GPIO expanders. In such a (very unlikely) case, the spi subsystem
would need to fall back to the worker-queue, so probably not a big deal.

> > Option 2 is probably not that bad if the SPI worker can run on another core?  
> 
> Pretty much anything benefits with another core.

Ack. But there are still quite a few single-core SoCs out there, and for
those, Option 2 is the worst.

Best regards,

-- 
David Jander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-17 13:09               ` David Jander
@ 2022-05-17 13:43                 ` Mark Brown
  2022-05-17 15:16                   ` David Jander
  0 siblings, 1 reply; 25+ messages in thread
From: Mark Brown @ 2022-05-17 13:43 UTC (permalink / raw)
  To: David Jander; +Cc: Marc Kleine-Budde, linux-spi, Oleksij Rempel

[-- Attachment #1: Type: text/plain, Size: 2995 bytes --]

On Tue, May 17, 2022 at 03:09:06PM +0200, David Jander wrote:
> Mark Brown <broonie@kernel.org> wrote:
> > On Tue, May 17, 2022 at 12:24:39PM +0200, David Jander wrote:

> > > on, so I wonder whether there is something to gain if one could just call
> > > spi_bus_lock() at the start of several such small sync transfers and use
> > > non-locking calls (skip the queue lock and io_mutex)? Not sure that would have

> > I do worry about how this might perform under different loads where
> > there are things coming in from more than one thread.

> I understand your concern. The biggest issue is probably the fact that client
> drivers can claim exclusive access to the bus this way, and who knows if they
> take care to not claim it for too long?

Yes, claiming the bus for a long time.

> In the end, if multiple latency-sensitive client drivers share the same SPI
> bus, besides this being a questionable HW design, this will be asking for
> trouble. But I agree that usage of spi_bus_lock() by client drivers is
> probably not a good idea.

I think the worst case would be mixing latency sensitive and throughput
sensitive devices, or possibly tasks on a device.  As you say there's an
element of system design here.

> > >  1. hard-IRQ, queue msg --ctx--> SPI worker, call msg->complete() which does
> > >  thread IRQ work (but can only do additional sync xfers from this context).  
> > 
> > > vs.  
> > 
> > >  2. hard-IRQ, queue msg --ctx--> SPI worker, call completion --ctx--> IRQ
> > >  thread wait for completion and does more xfers...  
> > 
> > > vs (and this was my idea).  
> > 
> > >  3. hard-IRQ, pump FIFO (if available) --ctx--> IRQ thread, poll FIFO, do more
> > >  sync xfers...  
> > 
> > Roughly 1, but with a lot of overlap with option 3.  I'm unclear what
> > you mean by "queue message" here.
> 
> In the above, I meant:

>  "queue message": 
> 	list_add_tail(&msg->queue, &ctlr->queue);
> 	kthread_queue_work(ctlr->kworker, &ctlr->pump_messages);

OK, no - I'm proposing actually putting the message onto the hardware
from interrupt context.

> > Yes, that's the whole point.  This also flows nicely when you've got a
> > queue since you can restart the hardware from the interrupt context
> > without waiting to complete the transfer that just finished.

> Ack. Only caveat I see is the requirement for CS to be settable in a
> non-sleepable context. Not all GPIO pins have gpiod_set_value(). Some might
> only have gpiod_set_value_cansleep(). Although the latter case is not a good
> choice for a CS signal, so I doubt it can be found in the wild.
> Example: I2C GPIO expanders. In such a (very unlikely) case, the spi subsystem
> would need to fall back to the worker-queue, so probably not a big deal.

Yes, there's a bunch of things where we have to fall back on a queue.
There's fully bitbanged controllers, excessively large messages or delay
requirements, clock reprogramming and other things.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-17 13:43                 ` Mark Brown
@ 2022-05-17 15:16                   ` David Jander
  2022-05-17 18:17                     ` Mark Brown
  0 siblings, 1 reply; 25+ messages in thread
From: David Jander @ 2022-05-17 15:16 UTC (permalink / raw)
  To: Mark Brown; +Cc: Marc Kleine-Budde, linux-spi, Oleksij Rempel

On Tue, 17 May 2022 14:43:59 +0100
Mark Brown <broonie@kernel.org> wrote:

> On Tue, May 17, 2022 at 03:09:06PM +0200, David Jander wrote:
> > Mark Brown <broonie@kernel.org> wrote:  
> > > On Tue, May 17, 2022 at 12:24:39PM +0200, David Jander wrote:  
> 
> > > > on, so I wonder whether there is something to gain if one could just call
> > > > spi_bus_lock() at the start of several such small sync transfers and use
> > > > non-locking calls (skip the queue lock and io_mutex)? Not sure that would have  
> 
> > > I do worry about how this might perform under different loads where
> > > there are things coming in from more than one thread.  
> 
> > I understand your concern. The biggest issue is probably the fact that client
> > drivers can claim exclusive access to the bus this way, and who knows if they
> > take care to not claim it for too long?  
> 
> Yes, claiming the bus for a long time.
> 
> > In the end, if multiple latency-sensitive client drivers share the same SPI
> > bus, besides this being a questionable HW design, this will be asking for
> > trouble. But I agree that usage of spi_bus_lock() by client drivers is
> > probably not a good idea.  
> 
> I think the worst case would be mixing latency sensitive and throughput
> sensitive devices, or possibly tasks on a device.  As you say there's an
> element of system design here.

Sure. I have combined NOR-flash chips on the same SPI bus with an MCP2515 CAN
controller in the past. But I knew that the NOR-flash chip would only ever be
accessed during a firmware update or from the bootloader. Never together with
CAN communication. If I did, I would have lost CAN messages guaranteed. So
you can have compromises. I (as HW designer) in this case would never expect
the kernel to try to make this work concurrently, and IMHO we (as
kernel developers) shouldn't try in such extreme cases either.

> > > >  1. hard-IRQ, queue msg --ctx--> SPI worker, call msg->complete() which does
> > > >  thread IRQ work (but can only do additional sync xfers from this context).    
> > >   
> > > > vs.    
> > >   
> > > >  2. hard-IRQ, queue msg --ctx--> SPI worker, call completion --ctx--> IRQ
> > > >  thread wait for completion and does more xfers...    
> > >   
> > > > vs (and this was my idea).    
> > >   
> > > >  3. hard-IRQ, pump FIFO (if available) --ctx--> IRQ thread, poll FIFO, do more
> > > >  sync xfers...    
> > > 
> > > Roughly 1, but with a lot of overlap with option 3.  I'm unclear what
> > > you mean by "queue message" here.  
> > 
> > In the above, I meant:  
> 
> >  "queue message": 
> > 	list_add_tail(&msg->queue, &ctlr->queue);
> > 	kthread_queue_work(ctlr->kworker, &ctlr->pump_messages);  
> 
> OK, no - I'm proposing actually putting the message onto the hardware
> from interrupt context.

Nice! I like that idea. Do you want to somehow extend spi_async() to do this
transparently? So we just need to introduce a second function
("spi_async_await()" ?) which would wait for completion and collect the RX
buffer?

> > > Yes, that's the whole point.  This also flows nicely when you've got a
> > > queue since you can restart the hardware from the interrupt context
> > > without waiting to complete the transfer that just finished.  
> 
> > Ack. Only caveat I see is the requirement for CS to be settable in a
> > non-sleepable context. Not all GPIO pins have gpiod_set_value(). Some might
> > only have gpiod_set_value_cansleep(). Although the latter case is not a good
> > choice for a CS signal, so I doubt it can be found in the wild.
> > Example: I2C GPIO expanders. In such a (very unlikely) case, the spi subsystem
> > would need to fall back to the worker-queue, so probably not a big deal.  
> 
> Yes, there's a bunch of things where we have to fall back on a queue.
> There's fully bitbanged controllers, excessively large messages or delay
> requirements, clock reprogramming and other things.

I like it. Thanks for the discussions so far.

To sum up all possible patches you would accept if I understood correctly:

 1. Make the stats/accounting code be NOP with a sysfs or similar toggle.

 2. Enable the re-use of messages with once in lifetime prepare/map/validate.

 3. Introduce spi_async_await() (or similar), to wait for completion of an
 async message.

 4. Enable SPI drivers to tell the core (spi.c) under which conditions it can
 fire a message asynchronously without the need for the worker queue and
 implement support for those cases. Conditions involve max. transfer size, CS
 non-sleep access, etc... but it should probably be up to the SPI driver to
 decide I guess (ctlr->can_send_uninterruptible(msg)).

Do I miss something?

Minor concern about 4. above: Hopefully the decision can be made very quickly
(i.e. without trying and failing). Maybe this decision result can be cached in
the struct spi_message, so it can be re-used (see point 2)? Maybe as part of
prepare or validate?

I feel confident that these 4 modifications will have enough of a performance
impact if fully exploited by the MCP2518FD driver, that overhead will no
longer be a concern.

Best regards,

-- 
David Jander


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-17 15:16                   ` David Jander
@ 2022-05-17 18:17                     ` Mark Brown
  2022-05-19  8:12                       ` David Jander
  0 siblings, 1 reply; 25+ messages in thread
From: Mark Brown @ 2022-05-17 18:17 UTC (permalink / raw)
  To: David Jander; +Cc: Marc Kleine-Budde, linux-spi, Oleksij Rempel

[-- Attachment #1: Type: text/plain, Size: 3011 bytes --]

On Tue, May 17, 2022 at 05:16:26PM +0200, David Jander wrote:
> Mark Brown <broonie@kernel.org> wrote:

> > I think the worst case would be mixing latency sensitive and throughput
> > sensitive devices, or possibly tasks on a device.  As you say there's an
> > element of system design here.

> Sure. I have combined NOR-flash chips on the same SPI bus with an MCP2515 CAN
> controller in the past. But I knew that the NOR-flash chip would only ever be
> accessed during a firmware update or from the bootloader. Never together with
> CAN communication. If I did, I would have lost CAN messages guaranteed. So
> you can have compromises. I (as HW designer) in this case would never expect
> the kernel to try to make this work concurrently, and IMHO we (as
> kernel developers) shouldn't try in such extreme cases either.

Part of the worry here is if we manage to do something that plays badly
with tools like real time scheduling that allows people to manage this
stuff, causing problems for something that otherwise works fine.

> > OK, no - I'm proposing actually putting the message onto the hardware
> > from interrupt context.

> Nice! I like that idea. Do you want to somehow extend spi_async() to do this
> transparently? So we just need to introduce a second function
> ("spi_async_await()" ?) which would wait for completion and collect the RX
> buffer?

We shouldn't need a new API to wait for the async operation to complete,
hopefully the existing one is fine.

> To sum up all possible patches you would accept if I understood correctly:

>  1. Make the stats/accounting code be NOP with a sysfs or similar toggle.

Or otherwise make it unobtrusive (eg, with similar techniques to those
used by the networking API).

>  2. Enable the re-use of messages with once in lifetime prepare/map/validate.
> 
>  3. Introduce spi_async_await() (or similar), to wait for completion of an
>  async message.
> 
>  4. Enable SPI drivers to tell the core (spi.c) under which conditions it can
>  fire a message asynchronously without the need for the worker queue and
>  implement support for those cases. Conditions involve max. transfer size, CS
>  non-sleep access, etc... but it should probably be up to the SPI driver to
>  decide I guess (ctlr->can_send_uninterruptible(msg)).
> 
> Do I miss something?

That's roughly it, plus a general push to optimise the hot path.

> Minor concern about 4. above: Hopefully the decision can be made very quickly
> (i.e. without trying and failing). Maybe this decision result can be cached in
> the struct spi_message, so it can be re-used (see point 2)? Maybe as part of
> prepare or validate?

Yes, we need to do this at validation time to play with the reuse I
think.

> I feel confident that these 4 modifications will have enough of a performance
> impact if fully exploited by the MCP2518FD driver, that overhead will no
> longer be a concern.

Just the small matter of implementing them then :/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-17 18:17                     ` Mark Brown
@ 2022-05-19  8:12                       ` David Jander
  2022-05-19  8:24                         ` Marc Kleine-Budde
                                           ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: David Jander @ 2022-05-19  8:12 UTC (permalink / raw)
  To: Mark Brown; +Cc: Marc Kleine-Budde, linux-spi, Oleksij Rempel, Andrew Lunn

On Tue, 17 May 2022 19:17:54 +0100
Mark Brown <broonie@kernel.org> wrote:

> On Tue, May 17, 2022 at 05:16:26PM +0200, David Jander wrote:
> > Mark Brown <broonie@kernel.org> wrote:  
> [...]
> > > OK, no - I'm proposing actually putting the message onto the hardware
> > > from interrupt context.  
> 
> > Nice! I like that idea. Do you want to somehow extend spi_async() to do this
> > transparently? So we just need to introduce a second function
> > ("spi_async_await()" ?) which would wait for completion and collect the RX
> > buffer?  
> 
> We shouldn't need a new API to wait for the async operation to complete,
> hopefully the existing one is fine.

Maybe there is something I am not seeing then. The client needs to make two
function calls. One to fill the FIFO and start the transfer, and a second one
to poll the FIFO and read out the RX data. With the existing API, I can only
think of these options:

 1. Call spi_async() twice for just one transfer. While that might work, IMHO
 it is confusing and would require to store the state of the transfer (i.e.
 "need_finishing=true") in the spi_message. But wouldn't that potentially
 break existing code that (for some reason) calls spi_async() twice on the
 same spi_message?

 2. Use a completion or callback. But I don't see how that will work without a
 context switch to some code that completes the completion or calls the
 callback, which is what we are trying to avoid having in the first place.

If you know something I am ignoring here, please advice.

> > To sum up all possible patches you would accept if I understood correctly:  
> 
> >  1. Make the stats/accounting code be NOP with a sysfs or similar toggle.  
> 
> Or otherwise make it unobtrusive (eg, with similar techniques to those
> used by the networking API).

I just tried this out by re-writing the statistics code using u64_stats_sync
and per-cpu statistics, which get totaled on sysfs read access as Andrew Lunn
suggested.
The results are truly amazing!

The overhead caused by statistics in my test dropped from 43us to just 1-2us.

This was tested on a 64-bit machine though, so I don't know how it will affect
32-bit systems. Nor do I have an easy means of testing this. Any ideas?

Also, I have converted all the struct spi_statistics members to u64_stats_t.
It was easier to test this way. Some of the original types were unsigned long,
which can have different sizes on 64bit or 32bit systems... is that
intentional?

> >  2. Enable the re-use of messages with once in lifetime prepare/map/validate.
> > 
> >  3. Introduce spi_async_await() (or similar), to wait for completion of an
> >  async message.
> > 
> >  4. Enable SPI drivers to tell the core (spi.c) under which conditions it can
> >  fire a message asynchronously without the need for the worker queue and
> >  implement support for those cases. Conditions involve max. transfer size, CS
> >  non-sleep access, etc... but it should probably be up to the SPI driver to
> >  decide I guess (ctlr->can_send_uninterruptible(msg)).
> > 
> > Do I miss something?  
> 
> That's roughly it, plus a general push to optimise the hot path.

Perfect.

> > Minor concern about 4. above: Hopefully the decision can be made very quickly
> > (i.e. without trying and failing). Maybe this decision result can be cached in
> > the struct spi_message, so it can be re-used (see point 2)? Maybe as part of
> > prepare or validate?  
> 
> Yes, we need to do this at validation time to play with the reuse I
> think.
> 
> > I feel confident that these 4 modifications will have enough of a performance
> > impact if fully exploited by the MCP2518FD driver, that overhead will no
> > longer be a concern.  
> 
> Just the small matter of implementing them then :/

Well, at least we can agree on a plan. I can't promise anything, but I will
try to continue working on this. Thanks again for the discussion.

Best regards,

-- 
David Jander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-19  8:12                       ` David Jander
@ 2022-05-19  8:24                         ` Marc Kleine-Budde
  2022-05-19 12:14                         ` Andrew Lunn
  2022-05-20 15:22                         ` Mark Brown
  2 siblings, 0 replies; 25+ messages in thread
From: Marc Kleine-Budde @ 2022-05-19  8:24 UTC (permalink / raw)
  To: David Jander; +Cc: Mark Brown, linux-spi, Oleksij Rempel, Andrew Lunn

[-- Attachment #1: Type: text/plain, Size: 806 bytes --]

On 19.05.2022 10:12:38, David Jander wrote:
> I just tried this out by re-writing the statistics code using u64_stats_sync
> and per-cpu statistics, which get totaled on sysfs read access as Andrew Lunn
> suggested.
> The results are truly amazing!
> 
> The overhead caused by statistics in my test dropped from 43us to just 1-2us.

\o/

> This was tested on a 64-bit machine though, so I don't know how it will affect
> 32-bit systems. Nor do I have an easy means of testing this. Any ideas?

Test on an imx6 :)

Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-19  8:12                       ` David Jander
  2022-05-19  8:24                         ` Marc Kleine-Budde
@ 2022-05-19 12:14                         ` Andrew Lunn
  2022-05-19 14:33                           ` David Jander
  2022-05-20 15:22                         ` Mark Brown
  2 siblings, 1 reply; 25+ messages in thread
From: Andrew Lunn @ 2022-05-19 12:14 UTC (permalink / raw)
  To: David Jander; +Cc: Mark Brown, Marc Kleine-Budde, linux-spi, Oleksij Rempel

> > Or otherwise make it unobtrusive (eg, with similar techniques to those
> > used by the networking API).
> 
> I just tried this out by re-writing the statistics code using u64_stats_sync
> and per-cpu statistics, which get totaled on sysfs read access as Andrew Lunn
> suggested.
> The results are truly amazing!
> 
> The overhead caused by statistics in my test dropped from 43us to just 1-2us.

When you are potentially dealing with 10 million packets a second, you
cannot spend long on each individual packet incrementing a counter...

> This was tested on a 64-bit machine though, so I don't know how it will affect
> 32-bit systems. Nor do I have an easy means of testing this. Any ideas?

It does make a difference. On a 64 system, you can increment a counter
in a single instruction so you either see the old value, or the new
value. With 32 bit systems, which needs multiple instructions to
increment the counter, so the code takes are you cannot see anything
odd when it needs to overflow from the lower 32bits into the upper 32
bits. So 32bit systems will be a little bit more expensive. However,
not by a lot.

> Also, I have converted all the struct spi_statistics members to u64_stats_t.
> It was easier to test this way. Some of the original types were unsigned long,
> which can have different sizes on 64bit or 32bit systems... is that
> intentional?

You can keep with uint32, if you want to, and keep with the sequence
counter style locking. For networking, 32bit counters can wrap around
pretty fast, so the main counters are 64 bit. But the concept works
O.K. for smaller types.

     Andrew

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-19 12:14                         ` Andrew Lunn
@ 2022-05-19 14:33                           ` David Jander
  2022-05-19 15:21                             ` Andrew Lunn
  0 siblings, 1 reply; 25+ messages in thread
From: David Jander @ 2022-05-19 14:33 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Mark Brown, Marc Kleine-Budde, linux-spi, Oleksij Rempel

On Thu, 19 May 2022 14:14:16 +0200
Andrew Lunn <andrew@lunn.ch> wrote:

> > > Or otherwise make it unobtrusive (eg, with similar techniques to those
> > > used by the networking API).  
> > 
> > I just tried this out by re-writing the statistics code using u64_stats_sync
> > and per-cpu statistics, which get totaled on sysfs read access as Andrew Lunn
> > suggested.
> > The results are truly amazing!
> > 
> > The overhead caused by statistics in my test dropped from 43us to just 1-2us.  
> 
> When you are potentially dealing with 10 million packets a second, you
> cannot spend long on each individual packet incrementing a counter...

I guess that is a different category of hardware you are talking about
there... ;-)

> > This was tested on a 64-bit machine though, so I don't know how it will affect
> > 32-bit systems. Nor do I have an easy means of testing this. Any ideas?  
> 
> It does make a difference. On a 64 system, you can increment a counter
> in a single instruction so you either see the old value, or the new
> value. With 32 bit systems, which needs multiple instructions to
> increment the counter, so the code takes are you cannot see anything
> odd when it needs to overflow from the lower 32bits into the upper 32
> bits. So 32bit systems will be a little bit more expensive. However,
> not by a lot.

Ok, good to know.

> > Also, I have converted all the struct spi_statistics members to u64_stats_t.
> > It was easier to test this way. Some of the original types were unsigned long,
> > which can have different sizes on 64bit or 32bit systems... is that
> > intentional?  
> 
> You can keep with uint32, if you want to, and keep with the sequence
> counter style locking. For networking, 32bit counters can wrap around
> pretty fast, so the main counters are 64 bit. But the concept works
> O.K. for smaller types.

Since I am already moving the stats to per-cpu structs, wouldn't atomic_inc()
be sufficient and even faster for uint32 stats on 32bit systems?
I still would like to hear Mark's idea on the exact types. All bytes stats
were ULL and transfer counter values were UL. Possibly the reason behind this
was to make the transfer counters as efficient to increment as possible for
the given platform, as should be the case for "unsigned long".
If we move to per-cpu and seqcount, the kernel APIs (u64_stats_t) make us chose
explicitly u64 or u32 though. On most 64bit platforms, all those stats are
64bit anyway, while on most 32bit platforms the UL stats will likely be 32bit
while the ULL are 64bit. I am inclined to make them all u64, but might decide
otherwise if I can detect a performance difference on 32bit systems.
What types should I chose?

Best regards,

-- 
David Jander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-19 14:33                           ` David Jander
@ 2022-05-19 15:21                             ` Andrew Lunn
  0 siblings, 0 replies; 25+ messages in thread
From: Andrew Lunn @ 2022-05-19 15:21 UTC (permalink / raw)
  To: David Jander; +Cc: Mark Brown, Marc Kleine-Budde, linux-spi, Oleksij Rempel

> Since I am already moving the stats to per-cpu structs, wouldn't atomic_inc()
> be sufficient and even faster for uint32 stats on 32bit systems?

atomic is expensive, because it needs to synchronise across all
CPUs. And that synchoniszation across all CPUs is pointless when you
are doing per-cpu counters.

> I still would like to hear Mark's idea on the exact types. All bytes stats
> were ULL and transfer counter values were UL. Possibly the reason behind this
> was to make the transfer counters as efficient to increment as possible for
> the given platform, as should be the case for "unsigned long".

That efficiency argument is however broken by the use of a spinlock,
which is much more expensive than the actually increment. Except for a
UP machine when it becomes a NOP. But UP machines are becoming less
and less common.

      Andrew

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-19  8:12                       ` David Jander
  2022-05-19  8:24                         ` Marc Kleine-Budde
  2022-05-19 12:14                         ` Andrew Lunn
@ 2022-05-20 15:22                         ` Mark Brown
  2022-05-23 14:48                           ` David Jander
  2 siblings, 1 reply; 25+ messages in thread
From: Mark Brown @ 2022-05-20 15:22 UTC (permalink / raw)
  To: David Jander; +Cc: Marc Kleine-Budde, linux-spi, Oleksij Rempel, Andrew Lunn

[-- Attachment #1: Type: text/plain, Size: 2516 bytes --]

On Thu, May 19, 2022 at 10:12:38AM +0200, David Jander wrote:
> Mark Brown <broonie@kernel.org> wrote:
> > On Tue, May 17, 2022 at 05:16:26PM +0200, David Jander wrote:
> > > Mark Brown <broonie@kernel.org> wrote:  

> > > Nice! I like that idea. Do you want to somehow extend spi_async() to do this
> > > transparently? So we just need to introduce a second function
> > > ("spi_async_await()" ?) which would wait for completion and collect the RX
> > > buffer?  

> > We shouldn't need a new API to wait for the async operation to complete,
> > hopefully the existing one is fine.

> Maybe there is something I am not seeing then. The client needs to make two
> function calls. One to fill the FIFO and start the transfer, and a second one
> to poll the FIFO and read out the RX data. With the existing API, I can only
> think of these options:

The client just submits a message using spi_async() and can use the
complete() callback like spi_sync() does to find out when the transfer
is done.  The core is responsible for actually interacting with the
device.

>  2. Use a completion or callback. But I don't see how that will work without a
>  context switch to some code that completes the completion or calls the
>  callback, which is what we are trying to avoid having in the first place.

If the core doesn't need to context switch then the core should just be
able to complete without context switching hopefully.  At most we'd need
a mechanism to say that the completion is OK to call from atomic
context.

> > Or otherwise make it unobtrusive (eg, with similar techniques to those
> > used by the networking API).

> I just tried this out by re-writing the statistics code using u64_stats_sync
> and per-cpu statistics, which get totaled on sysfs read access as Andrew Lunn
> suggested.
> The results are truly amazing!

> The overhead caused by statistics in my test dropped from 43us to just 1-2us.

Ah, cool.  I look forward to the patches.

> This was tested on a 64-bit machine though, so I don't know how it will affect
> 32-bit systems. Nor do I have an easy means of testing this. Any ideas?

Hopefully someone with a 32 bit system who's concerned about performance
can test.

> Also, I have converted all the struct spi_statistics members to u64_stats_t.
> It was easier to test this way. Some of the original types were unsigned long,
> which can have different sizes on 64bit or 32bit systems... is that
> intentional?

I don't think so.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-20 15:22                         ` Mark Brown
@ 2022-05-23 14:48                           ` David Jander
  2022-05-23 14:59                             ` Marc Kleine-Budde
  0 siblings, 1 reply; 25+ messages in thread
From: David Jander @ 2022-05-23 14:48 UTC (permalink / raw)
  To: Mark Brown; +Cc: Marc Kleine-Budde, linux-spi, Oleksij Rempel, Andrew Lunn

On Fri, 20 May 2022 16:22:08 +0100
Mark Brown <broonie@kernel.org> wrote:

> On Thu, May 19, 2022 at 10:12:38AM +0200, David Jander wrote:
> > Mark Brown <broonie@kernel.org> wrote:  
> > > On Tue, May 17, 2022 at 05:16:26PM +0200, David Jander wrote:  
> > > > Mark Brown <broonie@kernel.org> wrote:    
> 
> > > > Nice! I like that idea. Do you want to somehow extend spi_async() to do this
> > > > transparently? So we just need to introduce a second function
> > > > ("spi_async_await()" ?) which would wait for completion and collect the RX
> > > > buffer?    
> 
> > > We shouldn't need a new API to wait for the async operation to complete,
> > > hopefully the existing one is fine.  
> 
> > Maybe there is something I am not seeing then. The client needs to make two
> > function calls. One to fill the FIFO and start the transfer, and a second one
> > to poll the FIFO and read out the RX data. With the existing API, I can only
> > think of these options:  
> 
> The client just submits a message using spi_async() and can use the
> complete() callback like spi_sync() does to find out when the transfer
> is done.  The core is responsible for actually interacting with the
> device.

Having someone call the complete() callback would imply a context switch which
we are trying to avoid. To summarize the context switches involved in my
example:

 Your proposal:

 HardIRQ: calls spi_async() --ctx--> SPI worker: calls complete() --ctx--> IRQ
 thread: checks completion.

 My proposal:

 HardIRQ: calls spi_async() --ctx--> IRQ thread: calls spi_async_await()

The SPI worker (or whoever else) preempting the IRQ thread forcibly would kill
latency, and the result would be even worse as if HardIRQ did nothing and
instead the IRQ thread just called spi_sync() as is now the case.

> >  2. Use a completion or callback. But I don't see how that will work without a
> >  context switch to some code that completes the completion or calls the
> >  callback, which is what we are trying to avoid having in the first place.  
> 
> If the core doesn't need to context switch then the core should just be
> able to complete without context switching hopefully.  At most we'd need
> a mechanism to say that the completion is OK to call from atomic
> context.

Btw, I just discovered that there was indeed another often unnecessary context
switch happening in spi_sync(). When spi_finalize_current_message() is called,
it will always queue work, even if we just managed to do everything in the
calling context:

https://elixir.bootlin.com/linux/v5.18-rc7/source/drivers/spi/spi.c#L1909

This is needed to have the SPI worker thread unprepare transfer hardware and
free the dummy buffers if required. My question is why this needs to be done
from the thread. Putting the relevant code after the call to
ctlr->transfer_one_message() in __spi_pump_messages(), saves this extra bounce
to the worker thread if no more messages are queued, saving ca 10-12us extra
time between consecutive spi_sync messages.

> > > Or otherwise make it unobtrusive (eg, with similar techniques to those
> > > used by the networking API).  
> 
> > I just tried this out by re-writing the statistics code using u64_stats_sync
> > and per-cpu statistics, which get totaled on sysfs read access as Andrew Lunn
> > suggested.
> > The results are truly amazing!  
> 
> > The overhead caused by statistics in my test dropped from 43us to just 1-2us.  
> 
> Ah, cool.  I look forward to the patches.

See here:

https://marc.info/?l=linux-spi&m=165331560907187&w=2

> > This was tested on a 64-bit machine though, so I don't know how it will affect
> > 32-bit systems. Nor do I have an easy means of testing this. Any ideas?  
> 
> Hopefully someone with a 32 bit system who's concerned about performance
> can test.

I will try to do some testing soon. I need to tear down my current debug setup
and swap the i.MX8M board for a i.MX6 board to do this. Will take a while.

> > Also, I have converted all the struct spi_statistics members to u64_stats_t.
> > It was easier to test this way. Some of the original types were unsigned long,
> > which can have different sizes on 64bit or 32bit systems... is that
> > intentional?  
> 
> I don't think so.

Ok, great. I just changed everything to u64_stats_t to keep it uniform across
platforms.

Best regards,

-- 
David Jander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-23 14:48                           ` David Jander
@ 2022-05-23 14:59                             ` Marc Kleine-Budde
  2022-05-24 11:30                               ` David Jander
  0 siblings, 1 reply; 25+ messages in thread
From: Marc Kleine-Budde @ 2022-05-23 14:59 UTC (permalink / raw)
  To: David Jander; +Cc: Mark Brown, linux-spi, Oleksij Rempel, Andrew Lunn

[-- Attachment #1: Type: text/plain, Size: 1387 bytes --]

On 23.05.2022 16:48:32, David Jander wrote:
> Btw, I just discovered that there was indeed another often unnecessary context
> switch happening in spi_sync(). When spi_finalize_current_message() is called,
> it will always queue work, even if we just managed to do everything in the
> calling context:
> 
> https://elixir.bootlin.com/linux/v5.18-rc7/source/drivers/spi/spi.c#L1909
> 
> This is needed to have the SPI worker thread unprepare transfer
> hardware and free the dummy buffers if required. My question is why
> this needs to be done from the thread. Putting the relevant code after
> the call to ctlr->transfer_one_message() in __spi_pump_messages(),
> saves this extra bounce to the worker thread if no more messages are
> queued, saving ca 10-12us extra time between consecutive spi_sync
> messages.

Martin Sperl tried to do a delayed teardown, see:

| 412e60373245 spi: core: avoid waking pump thread from spi_sync instead run teardown delayed

But that turned out be not working properly:

| https://lore.kernel.org/all/f86eaebb-0359-13be-f4a2-4f2b8832252e@nvidia.com/

Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-23 14:59                             ` Marc Kleine-Budde
@ 2022-05-24 11:30                               ` David Jander
  2022-05-24 19:46                                 ` Mark Brown
  0 siblings, 1 reply; 25+ messages in thread
From: David Jander @ 2022-05-24 11:30 UTC (permalink / raw)
  To: Marc Kleine-Budde
  Cc: Mark Brown, linux-spi, Oleksij Rempel, Andrew Lunn, Martin Sperl

On Mon, 23 May 2022 16:59:35 +0200
Marc Kleine-Budde <mkl@pengutronix.de> wrote:

> On 23.05.2022 16:48:32, David Jander wrote:
> > Btw, I just discovered that there was indeed another often unnecessary context
> > switch happening in spi_sync(). When spi_finalize_current_message() is called,
> > it will always queue work, even if we just managed to do everything in the
> > calling context:
> > 
> > https://elixir.bootlin.com/linux/v5.18-rc7/source/drivers/spi/spi.c#L1909
> > 
> > This is needed to have the SPI worker thread unprepare transfer
> > hardware and free the dummy buffers if required. My question is why
> > this needs to be done from the thread. Putting the relevant code after
> > the call to ctlr->transfer_one_message() in __spi_pump_messages(),
> > saves this extra bounce to the worker thread if no more messages are
> > queued, saving ca 10-12us extra time between consecutive spi_sync
> > messages.  
> 
> Martin Sperl tried to do a delayed teardown, see:
> 
> | 412e60373245 spi: core: avoid waking pump thread from spi_sync instead run teardown delayed
> 
> But that turned out be not working properly:
> 
> | https://lore.kernel.org/all/f86eaebb-0359-13be-f4a2-4f2b8832252e@nvidia.com/

Ah, thanks for sharing. Added Martin to CC here.

I have been struggling with this too. There are definitely dragons somewhere.
I have tried to do tear-down in the same context if possible, similar to this:

@@ -1718,6 +1718,21 @@ static void __spi_pump_messages(struct spi_controller *ctlr, bool in_kthread)
                goto out;
        }
 
+       spin_lock_irqsave(&ctlr->queue_lock, flags);
+       /* Check if we can shutdown the controller now */
+       if ((list_empty(&ctlr->queue) || !ctlr->running) && (!in_kthread)) {
+               if (!ctlr->dummy_rx && !ctlr->dummy_tx &&
+                   !ctlr->unprepare_transfer_hardware) {
+                       spi_idle_runtime_pm(ctlr);
+                       ctlr->busy = false;
+                       trace_spi_controller_idle(ctlr);
+               } else {
+                       /* Ensure non-atomic teardown is done in the thread */
+                       kthread_queue_work(ctlr->kworker,
+                                          &ctlr->pump_messages);
+               }
+       }
+       spin_unlock_irqrestore(&ctlr->queue_lock, flags);
 out:
        mutex_unlock(&ctlr->io_mutex);
 
Which is almost a copy/paste from the same piece of code above. The idea was
that in the case the whole transfer was done in the same context anyway, one
could just check after calling ctlr->transfer_one_message() if the controller
can be made idle immediately. This turned out to be racy for some reason, and
I have not yet discovered exactly why. In the occasions the code didn't hit
the race, it seemed to have a notable performance impact though, so removing
this context switch may be worth the effort.

Unfortunately, besides being racy, this change does introduce a new call to
the queue_lock spinlock, which (as my previous patch shows) negates part of
the performance gain.

This made me think about ways to reduce these spinlocks even further, so I
mapped out the different code-paths that access the queue-spinlock, the
bus-spinlock, the bus-mutex and the io-mutex. It appears the code is optimized
for concurrency, assuming the overhead of locking is negligible. Unfortunately
we now know this isn't the case when dealing with small sync transfers at
medium to high clock speeds. Currently there is a ctlr->queue and
ctlr->cur_msg and some other state variables and flags that are accessed from
three different places: spi_sync, spi_async and the worker thread. The locking
paths look approximately like this (for one single message):

 1. spi_sync()
    --> bus_lock_mutex
        --> __spi_sync()
            --> bus_lock_spinlock
                --> queue_lock
                    --> list_add_tail()
            --> __spi_pump_messages() (also entered here from WQ)
                --> queue_lock
                    --> list_first_entry()
                --> io_mutex
                    --> transfer_one_message()
                        --> spi_finalize_current_message
                            --> queue_lock
                                --> get cur_msg
                            --> queue_lock
                                --> kthread_queue_work() -> reschedule
                            --> complete() **
    (from thread) __spi_pump_messages()
                    --> queue_lock
                        --> teardown -> done.
    (from main context) wait_for_completion() **
                        --> internal spinlock.

 2. worker thread:
    --> __spi_pump_messages()
        ... same as above from first __spi_pump_messages()

 3. spi_async()
    --> bus_lock_spinlock
        --> __spi_async()
            --> queue_lock
                --> list_add_tail()
                --> kthread_queue_work()

From what I understand of this, bus_lock_mutex is used to serialize spi_sync
calls for this bus, so there cannot be any concurrency from different threads
doing spi_sync() calls to the same bus. This means, that if spi_sync was the
only path in existence, bus_lock_mutex would suffice, and all the other
spinlocks and mutexes could go. Big win. But the async path is what
complicates everything. So I have been thinking... what if we could make the
sync and the async case two separate paths, with two separate message queues?
In fact the sync path doesn't even need a queue really, since it will just
handle one message beginning to end, and not return until the message is done.
It doesn't need the worker thread either AFAICS. Or am I missing something?
In the end, both paths would converge at the io_mutex. I am tempted to try
this route. Any thoughts?

Best regards,

-- 
David Jander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-24 11:30                               ` David Jander
@ 2022-05-24 19:46                                 ` Mark Brown
  2022-05-25 14:39                                   ` David Jander
  0 siblings, 1 reply; 25+ messages in thread
From: Mark Brown @ 2022-05-24 19:46 UTC (permalink / raw)
  To: David Jander
  Cc: Marc Kleine-Budde, linux-spi, Oleksij Rempel, Andrew Lunn, Martin Sperl

[-- Attachment #1: Type: text/plain, Size: 4267 bytes --]

On Tue, May 24, 2022 at 01:30:48PM +0200, David Jander wrote:

> > But that turned out be not working properly:
> >
> > | https://lore.kernel.org/all/f86eaebb-0359-13be-f4a2-4f2b8832252e@nvidia.com/

> Ah, thanks for sharing. Added Martin to CC here.

> I have been struggling with this too. There are definitely dragons somewhere.
> I have tried to do tear-down in the same context if possible, similar to this:

There's a potential issue there with ending up spending noticable extra
time turning the controller on and off, how costly that is might be
variable.

> I have not yet discovered exactly why. In the occasions the code didn't hit
> the race, it seemed to have a notable performance impact though, so removing
> this context switch may be worth the effort.

Or come up with a mechanism for ensuring we only switch to do the
cleanup when we're not otherwise busy.

> From what I understand of this, bus_lock_mutex is used to serialize spi_sync
> calls for this bus, so there cannot be any concurrency from different threads
> doing spi_sync() calls to the same bus. This means, that if spi_sync was the
> only path in existence, bus_lock_mutex would suffice, and all the other

The bus lock is there because some devices have an unfortunate need to
do multiple SPI transfers with no other devices able to generate any
traffic on the bus in between.  It seems that even more sadly some of
the users are using it to protect against multiple calls into
themselves so we can't just do the simple thing and turn the bus locks
into noops if there's only one client on the bus.  However it *is* quite
rarely used so I'm thinking that what we could do is default to not
having it and then arrange to create it on first use, or just update
the clients to do something during initialisation to cause it to be
created.  That way only controllers with an affected client would pay
the cost.

I don't *think* it's otherwise needed but would need to go through and
verify that.

> spinlocks and mutexes could go. Big win. But the async path is what
> complicates everything. So I have been thinking... what if we could make the
> sync and the async case two separate paths, with two separate message queues?
> In fact the sync path doesn't even need a queue really, since it will just
> handle one message beginning to end, and not return until the message is done.
> It doesn't need the worker thread either AFAICS. Or am I missing something?
> In the end, both paths would converge at the io_mutex. I am tempted to try
> this route. Any thoughts?

The sync path like you say doesn't exactly need a queue itself, it's
partly looking at the queue so it can fall back to pushing work through
the thread in the case that the controller is busy (hopefully opening up
opportunities for us to minimise the latency between completing whatever
is going on already and starting the next message) and partly about
pushing the work idling the hardware into the thread so that it's
deferred a bit and we're less likely to end up spending time bouncing
the controller on and off if there's a sequence of synchronous
operations going on.  That second bit doesn't need us to actually look
at the queue though, we just need to kick the thread so it gets run at
some point and sees that the queue is empty.

Again I need to think this through properly but we can probably arrange
things so that 

>         --> __spi_sync()
>             --> bus_lock_spinlock
>                 --> queue_lock
>                     --> list_add_tail()
>             --> __spi_pump_messages() (also entered here from WQ)
>                 --> queue_lock
>                     --> list_first_entry()

the work we do under the first queue_lock here gets shuffled into
__spin_pump_messages() (possibly replace in_kthread with passing in a
message?  Would need comments.).  That'd mean we'd at least only be
taking the queue lock once.

The other potential issue with eliminating the queue entirely would be
if there's clients which are mixing async and sync operations but
expecting them to complete in order (eg, start a bunch of async stuff
then do a sync operation at the end rather than have a custom
wait/completion).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
  2022-05-24 19:46                                 ` Mark Brown
@ 2022-05-25 14:39                                   ` David Jander
  0 siblings, 0 replies; 25+ messages in thread
From: David Jander @ 2022-05-25 14:39 UTC (permalink / raw)
  To: Mark Brown
  Cc: Marc Kleine-Budde, linux-spi, Oleksij Rempel, Andrew Lunn, Martin Sperl

On Tue, 24 May 2022 20:46:16 +0100
Mark Brown <broonie@kernel.org> wrote:

> On Tue, May 24, 2022 at 01:30:48PM +0200, David Jander wrote:
> 
> > > But that turned out be not working properly:
> > >
> > > | https://lore.kernel.org/all/f86eaebb-0359-13be-f4a2-4f2b8832252e@nvidia.com/  
> 
> > Ah, thanks for sharing. Added Martin to CC here.  
> 
> > I have been struggling with this too. There are definitely dragons somewhere.
> > I have tried to do tear-down in the same context if possible, similar to this:  
> 
> There's a potential issue there with ending up spending noticable extra
> time turning the controller on and off, how costly that is might be
> variable.
> 
> > I have not yet discovered exactly why. In the occasions the code didn't hit
> > the race, it seemed to have a notable performance impact though, so removing
> > this context switch may be worth the effort.  
> 
> Or come up with a mechanism for ensuring we only switch to do the
> cleanup when we're not otherwise busy.

I think the PM part might benefit from some optimizations too...

> > From what I understand of this, bus_lock_mutex is used to serialize spi_sync
> > calls for this bus, so there cannot be any concurrency from different threads
> > doing spi_sync() calls to the same bus. This means, that if spi_sync was the
> > only path in existence, bus_lock_mutex would suffice, and all the other  
> 
> The bus lock is there because some devices have an unfortunate need to
> do multiple SPI transfers with no other devices able to generate any
> traffic on the bus in between.  It seems that even more sadly some of
> the users are using it to protect against multiple calls into
> themselves so we can't just do the simple thing and turn the bus locks
> into noops if there's only one client on the bus.  However it *is* quite
> rarely used so I'm thinking that what we could do is default to not
> having it and then arrange to create it on first use, or just update
> the clients to do something during initialisation to cause it to be
> created.  That way only controllers with an affected client would pay
> the cost.
> 
> I don't *think* it's otherwise needed but would need to go through and
> verify that.
> 
> > spinlocks and mutexes could go. Big win. But the async path is what
> > complicates everything. So I have been thinking... what if we could make the
> > sync and the async case two separate paths, with two separate message queues?
> > In fact the sync path doesn't even need a queue really, since it will just
> > handle one message beginning to end, and not return until the message is done.
> > It doesn't need the worker thread either AFAICS. Or am I missing something?
> > In the end, both paths would converge at the io_mutex. I am tempted to try
> > this route. Any thoughts?  
> 
> The sync path like you say doesn't exactly need a queue itself, it's
> partly looking at the queue so it can fall back to pushing work through
> the thread in the case that the controller is busy (hopefully opening up
> opportunities for us to minimise the latency between completing whatever
> is going on already and starting the next message) and partly about
> pushing the work idling the hardware into the thread so that it's
> deferred a bit and we're less likely to end up spending time bouncing
> the controller on and off if there's a sequence of synchronous
> operations going on.  That second bit doesn't need us to actually look
> at the queue though, we just need to kick the thread so it gets run at
> some point and sees that the queue is empty.
> 
> Again I need to think this through properly but we can probably arrange
> things so that 
> 
> >         --> __spi_sync()
> >             --> bus_lock_spinlock
> >                 --> queue_lock
> >                     --> list_add_tail()
> >             --> __spi_pump_messages() (also entered here from WQ)
> >                 --> queue_lock
> >                     --> list_first_entry()  
> 
> the work we do under the first queue_lock here gets shuffled into
> __spin_pump_messages() (possibly replace in_kthread with passing in a
> message?  Would need comments.).  That'd mean we'd at least only be
> taking the queue lock once.
> 
> The other potential issue with eliminating the queue entirely would be
> if there's clients which are mixing async and sync operations but
> expecting them to complete in order (eg, start a bunch of async stuff
> then do a sync operation at the end rather than have a custom
> wait/completion).

I thought it would be easier to look at some code at this point, so I just
sent out an RFC patch series for the discussion. As for the ordering problem
you mention, I think I have solved for that. See here:

https://marc.info/?l=linux-spi&m=165348892007041&w=2

I can understand if you ultimately reject this series though, since I could
not avoid making a small change to the client driver API. I just can't figure
out how to do it without that change. The problem is that
spi_finalize_current_message() relies on ctlr->cur_msg, and we cannot touch
that if we skip the queue. Sorry for that.

Best regards,

-- 
David Jander

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-05-25 14:40 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-12 14:34 [RFC] A new SPI API for fast, low-latency regmap peripheral access David Jander
2022-05-12 20:37 ` Mark Brown
2022-05-12 22:12   ` Mark Brown
2022-05-13 12:46   ` David Jander
2022-05-13 19:36     ` Mark Brown
2022-05-16 16:28       ` Marc Kleine-Budde
2022-05-16 17:46         ` Mark Brown
2022-05-17 10:24           ` David Jander
2022-05-17 11:57             ` Mark Brown
2022-05-17 13:09               ` David Jander
2022-05-17 13:43                 ` Mark Brown
2022-05-17 15:16                   ` David Jander
2022-05-17 18:17                     ` Mark Brown
2022-05-19  8:12                       ` David Jander
2022-05-19  8:24                         ` Marc Kleine-Budde
2022-05-19 12:14                         ` Andrew Lunn
2022-05-19 14:33                           ` David Jander
2022-05-19 15:21                             ` Andrew Lunn
2022-05-20 15:22                         ` Mark Brown
2022-05-23 14:48                           ` David Jander
2022-05-23 14:59                             ` Marc Kleine-Budde
2022-05-24 11:30                               ` David Jander
2022-05-24 19:46                                 ` Mark Brown
2022-05-25 14:39                                   ` David Jander
2022-05-13 12:10 ` Andrew Lunn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.