Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access

From: David Jander <david@protonic.nl>
To: Mark Brown <broonie@kernel.org>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>,
	linux-spi@vger.kernel.org, Oleksij Rempel <ore@pengutronix.de>,
	Andrew Lunn <andrew@lunn.ch>
Subject: Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access
Date: Mon, 23 May 2022 16:48:32 +0200	[thread overview]
Message-ID: <20220523164832.4d9a0bb8@erd992> (raw)
In-Reply-To: <YoeyIG+RGpRBm9Cc@sirena.org.uk>

On Fri, 20 May 2022 16:22:08 +0100
Mark Brown <broonie@kernel.org> wrote:

> On Thu, May 19, 2022 at 10:12:38AM +0200, David Jander wrote:
> > Mark Brown <broonie@kernel.org> wrote:  
> > > On Tue, May 17, 2022 at 05:16:26PM +0200, David Jander wrote:  
> > > > Mark Brown <broonie@kernel.org> wrote:    
> 
> > > > Nice! I like that idea. Do you want to somehow extend spi_async() to do this
> > > > transparently? So we just need to introduce a second function
> > > > ("spi_async_await()" ?) which would wait for completion and collect the RX
> > > > buffer?    
> 
> > > We shouldn't need a new API to wait for the async operation to complete,
> > > hopefully the existing one is fine.  
> 
> > Maybe there is something I am not seeing then. The client needs to make two
> > function calls. One to fill the FIFO and start the transfer, and a second one
> > to poll the FIFO and read out the RX data. With the existing API, I can only
> > think of these options:  
> 
> The client just submits a message using spi_async() and can use the
> complete() callback like spi_sync() does to find out when the transfer
> is done.  The core is responsible for actually interacting with the
> device.

Having someone call the complete() callback would imply a context switch which
we are trying to avoid. To summarize the context switches involved in my
example:

 Your proposal:

 HardIRQ: calls spi_async() --ctx--> SPI worker: calls complete() --ctx--> IRQ
 thread: checks completion.

 My proposal:

 HardIRQ: calls spi_async() --ctx--> IRQ thread: calls spi_async_await()

The SPI worker (or whoever else) preempting the IRQ thread forcibly would kill
latency, and the result would be even worse as if HardIRQ did nothing and
instead the IRQ thread just called spi_sync() as is now the case.

> >  2. Use a completion or callback. But I don't see how that will work without a
> >  context switch to some code that completes the completion or calls the
> >  callback, which is what we are trying to avoid having in the first place.  
> 
> If the core doesn't need to context switch then the core should just be
> able to complete without context switching hopefully.  At most we'd need
> a mechanism to say that the completion is OK to call from atomic
> context.

Btw, I just discovered that there was indeed another often unnecessary context
switch happening in spi_sync(). When spi_finalize_current_message() is called,
it will always queue work, even if we just managed to do everything in the
calling context:

https://elixir.bootlin.com/linux/v5.18-rc7/source/drivers/spi/spi.c#L1909

This is needed to have the SPI worker thread unprepare transfer hardware and
free the dummy buffers if required. My question is why this needs to be done
from the thread. Putting the relevant code after the call to
ctlr->transfer_one_message() in __spi_pump_messages(), saves this extra bounce
to the worker thread if no more messages are queued, saving ca 10-12us extra
time between consecutive spi_sync messages.

> > > Or otherwise make it unobtrusive (eg, with similar techniques to those
> > > used by the networking API).  
> 
> > I just tried this out by re-writing the statistics code using u64_stats_sync
> > and per-cpu statistics, which get totaled on sysfs read access as Andrew Lunn
> > suggested.
> > The results are truly amazing!  
> 
> > The overhead caused by statistics in my test dropped from 43us to just 1-2us.  
> 
> Ah, cool.  I look forward to the patches.

See here:

https://marc.info/?l=linux-spi&m=165331560907187&w=2

> > This was tested on a 64-bit machine though, so I don't know how it will affect
> > 32-bit systems. Nor do I have an easy means of testing this. Any ideas?  
> 
> Hopefully someone with a 32 bit system who's concerned about performance
> can test.

I will try to do some testing soon. I need to tear down my current debug setup
and swap the i.MX8M board for a i.MX6 board to do this. Will take a while.

> > Also, I have converted all the struct spi_statistics members to u64_stats_t.
> > It was easier to test this way. Some of the original types were unsigned long,
> > which can have different sizes on 64bit or 32bit systems... is that
> > intentional?  
> 
> I don't think so.

Ok, great. I just changed everything to u64_stats_t to keep it uniform across
platforms.

Best regards,

-- 
David Jander