SPI: performance regression when using the common message queuing infrastructure

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* SPI: performance regression when using the common message queuing infrastructure
@ 2016-07-06  9:50 Cyrille Pitchen
  2016-07-06 10:03 ` Grygorii Strashko
  2016-07-07  9:50 ` Mark Brown
  0 siblings, 2 replies; 7+ messages in thread
From: Cyrille Pitchen @ 2016-07-06  9:50 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Mark,

recently Heiko reported to us a performance regression with Atmel SPI
controllers. He noticed the issue on a sam9g15ek board and I was also able to
reproduce it on a sama5d36ek board.

We found out that the performance regression was introduced in 3.14 by commit:
8090d6d1a415d3ae1a7208995decfab8f60f4f36
spi: atmel: Refactor spi-atmel to use SPI framework queue

For the test, I connected a Spansion S25FL512 memory on the SPI1 controller of
a sama5d36ek board. Then with an oscilloscope I monitored the chip-select, clock
and MOSI signals on the SPI bus.

1 - Reading 512 bytes from the memory

# dd if=/dev/mtd6 bs=512 count=1 of=/dev/null

With the oscilloscope, I measured the time between the chip-select fell before
the Read Status command (05h) and the chip-select rose after all data had been
read by the 4-byte address Fast Read 1-1-1 command (13h).

3.14 vanilla                      : 305 ?s
3.14 commit 8090d6d1a415 reverted : 242 ?s   -21%

2 - Reading 1000 x 1024 bytes from the memory

# dd if=/dev/mtd6 bs=1024 count=1000 of=/dev/null

Still with the scope, I measured the time to read all data.

3.14 vanilla                      : 435 ms
3.14 commit 8090d6d1a415 reverted : 361 ms   -17%

Indeed the oscilloscope shows that more time is spent between messages and
transfers.

commit 8090d6d1a415 replaced the tasklet used to manage a SPI message/transfer
queue by a workqueue provided by the SPI framework.

The support of this (optional) workqueue was introduced by commit:
ffbbdd21329f3e15eeca6df2d4bc11c04d9d91c0
spi: create a message queuing infrastructure

Though the commit message claims that is common infrastructure is optional,
the patch also claims the .transfer() hook is deprecated, suggesting drivers
should implement the new .transfer_one_message() hook instead.

This is the reason why commit 8090d6d1a415 was submitted. However we lost
quite amount of performances moving from our tasklet to the generic workqueue.

So do you recommend us to keep our current generic implementation relying on
the SPI framework workqueue or to go back to a custom implementation using
tasklet?
If we keep the current implementation, is there a way to improve the
performances so we go back to something close to what he had before?

We saw in commit ffbbdd21329f that we can change the workqueue thread
scheduling policy to SCHED_FIFO by setting master->rt.

Best regards,

Cyrille

^ permalink raw reply	[flat|nested] 7+ messages in thread

* SPI: performance regression when using the common message queuing infrastructure
  2016-07-06  9:50 SPI: performance regression when using the common message queuing infrastructure Cyrille Pitchen
@ 2016-07-06 10:03 ` Grygorii Strashko
  2016-07-07  8:12   ` Cyrille Pitchen
  2016-07-07  9:50 ` Mark Brown
  1 sibling, 1 reply; 7+ messages in thread
From: Grygorii Strashko @ 2016-07-06 10:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/06/2016 12:50 PM, Cyrille Pitchen wrote:
> Hi Mark,
>
> recently Heiko reported to us a performance regression with Atmel SPI
> controllers. He noticed the issue on a sam9g15ek board and I was also able to
> reproduce it on a sama5d36ek board.
>
> We found out that the performance regression was introduced in 3.14 by commit:
> 8090d6d1a415d3ae1a7208995decfab8f60f4f36
> spi: atmel: Refactor spi-atmel to use SPI framework queue
>
> For the test, I connected a Spansion S25FL512 memory on the SPI1 controller of
> a sama5d36ek board. Then with an oscilloscope I monitored the chip-select, clock
> and MOSI signals on the SPI bus.
>
>
> 1 - Reading 512 bytes from the memory
>
> # dd if=/dev/mtd6 bs=512 count=1 of=/dev/null
>
> With the oscilloscope, I measured the time between the chip-select fell before
> the Read Status command (05h) and the chip-select rose after all data had been
> read by the 4-byte address Fast Read 1-1-1 command (13h).
>
> 3.14 vanilla                      : 305 ?s
> 3.14 commit 8090d6d1a415 reverted : 242 ?s   -21%
>
> 2 - Reading 1000 x 1024 bytes from the memory
>
> # dd if=/dev/mtd6 bs=1024 count=1000 of=/dev/null
>
> Still with the scope, I measured the time to read all data.
>
> 3.14 vanilla                      : 435 ms
> 3.14 commit 8090d6d1a415 reverted : 361 ms   -17%
>
>
> Indeed the oscilloscope shows that more time is spent between messages and
> transfers.
>
> commit 8090d6d1a415 replaced the tasklet used to manage a SPI message/transfer
> queue by a workqueue provided by the SPI framework.
>
> The support of this (optional) workqueue was introduced by commit:
> ffbbdd21329f3e15eeca6df2d4bc11c04d9d91c0
> spi: create a message queuing infrastructure
>
> Though the commit message claims that is common infrastructure is optional,
> the patch also claims the .transfer() hook is deprecated, suggesting drivers
> should implement the new .transfer_one_message() hook instead.
>
> This is the reason why commit 8090d6d1a415 was submitted. However we lost
> quite amount of performances moving from our tasklet to the generic workqueue.
>
> So do you recommend us to keep our current generic implementation relying on
> the SPI framework workqueue or to go back to a custom implementation using
> tasklet?
> If we keep the current implementation, is there a way to improve the
> performances so we go back to something close to what he had before?
>
> We saw in commit ffbbdd21329f that we can change the workqueue thread
> scheduling policy to SCHED_FIFO by setting master->rt.
>

master->rt is not a good choice as i know and
you may find thread [1] useful for you.

[1] http://www.spinics.net/lists/linux-rt-users/msg14347.html

-- 
regards,
-grygorii

^ permalink raw reply	[flat|nested] 7+ messages in thread

* SPI: performance regression when using the common message queuing infrastructure
  2016-07-06 10:03 ` Grygorii Strashko
@ 2016-07-07  8:12   ` Cyrille Pitchen
  2016-07-25  4:51     ` Heiko Schocher
  0 siblings, 1 reply; 7+ messages in thread
From: Cyrille Pitchen @ 2016-07-07  8:12 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Grygorii,

Le 06/07/2016 12:03, Grygorii Strashko a ?crit :
> On 07/06/2016 12:50 PM, Cyrille Pitchen wrote:
>> Hi Mark,
>>
>> recently Heiko reported to us a performance regression with Atmel SPI
>> controllers. He noticed the issue on a sam9g15ek board and I was also able to
>> reproduce it on a sama5d36ek board.
>>
>> We found out that the performance regression was introduced in 3.14 by commit:
>> 8090d6d1a415d3ae1a7208995decfab8f60f4f36
>> spi: atmel: Refactor spi-atmel to use SPI framework queue
>>
>> For the test, I connected a Spansion S25FL512 memory on the SPI1 controller of
>> a sama5d36ek board. Then with an oscilloscope I monitored the chip-select, clock
>> and MOSI signals on the SPI bus.
>>
>>
>> 1 - Reading 512 bytes from the memory
>>
>> # dd if=/dev/mtd6 bs=512 count=1 of=/dev/null
>>
>> With the oscilloscope, I measured the time between the chip-select fell before
>> the Read Status command (05h) and the chip-select rose after all data had been
>> read by the 4-byte address Fast Read 1-1-1 command (13h).
>>
>> 3.14 vanilla                      : 305 ?s
>> 3.14 commit 8090d6d1a415 reverted : 242 ?s   -21%
>>
>> 2 - Reading 1000 x 1024 bytes from the memory
>>
>> # dd if=/dev/mtd6 bs=1024 count=1000 of=/dev/null
>>
>> Still with the scope, I measured the time to read all data.
>>
>> 3.14 vanilla                      : 435 ms
>> 3.14 commit 8090d6d1a415 reverted : 361 ms   -17%
>>
>>
>> Indeed the oscilloscope shows that more time is spent between messages and
>> transfers.
>>
>> commit 8090d6d1a415 replaced the tasklet used to manage a SPI message/transfer
>> queue by a workqueue provided by the SPI framework.
>>
>> The support of this (optional) workqueue was introduced by commit:
>> ffbbdd21329f3e15eeca6df2d4bc11c04d9d91c0
>> spi: create a message queuing infrastructure
>>
>> Though the commit message claims that is common infrastructure is optional,
>> the patch also claims the .transfer() hook is deprecated, suggesting drivers
>> should implement the new .transfer_one_message() hook instead.
>>
>> This is the reason why commit 8090d6d1a415 was submitted. However we lost
>> quite amount of performances moving from our tasklet to the generic workqueue.
>>
>> So do you recommend us to keep our current generic implementation relying on
>> the SPI framework workqueue or to go back to a custom implementation using
>> tasklet?
>> If we keep the current implementation, is there a way to improve the
>> performances so we go back to something close to what he had before?
>>
>> We saw in commit ffbbdd21329f that we can change the workqueue thread
>> scheduling policy to SCHED_FIFO by setting master->rt.
>>
> 
> master->rt is not a good choice as i know and
> you may find thread [1] useful for you.
> 
> [1] http://www.spinics.net/lists/linux-rt-users/msg14347.html
> 

thanks for the link, I'll look at it :)

Best regards,

Cyrille

^ permalink raw reply	[flat|nested] 7+ messages in thread

* SPI: performance regression when using the common message queuing infrastructure
  2016-07-06  9:50 SPI: performance regression when using the common message queuing infrastructure Cyrille Pitchen
  2016-07-06 10:03 ` Grygorii Strashko
@ 2016-07-07  9:50 ` Mark Brown
  1 sibling, 0 replies; 7+ messages in thread
From: Mark Brown @ 2016-07-07  9:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 06, 2016 at 11:50:28AM +0200, Cyrille Pitchen wrote:

> So do you recommend us to keep our current generic implementation relying on
> the SPI framework workqueue or to go back to a custom implementation using
> tasklet?
> If we keep the current implementation, is there a way to improve the
> performances so we go back to something close to what he had before?

I'd suggest trying to implement whatever it is that your customm
implementation was doing in the core, having individual drivers trying
to open code the message pump is obviously not sensible.  If what you're
doing makes your controllers run faster it'll probably make everyone
else's run faster too.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20160707/e453244b/attachment.sig>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* SPI: performance regression when using the common message queuing infrastructure
  2016-07-07  8:12   ` Cyrille Pitchen
@ 2016-07-25  4:51     ` Heiko Schocher
  2016-07-29  9:33       ` Cyrille Pitchen
  0 siblings, 1 reply; 7+ messages in thread
From: Heiko Schocher @ 2016-07-25  4:51 UTC (permalink / raw)
  To: linux-arm-kernel

Hello Cyrille,

sorry for the late answer, but just back from holidays ...

Am 07.07.2016 um 10:12 schrieb Cyrille Pitchen:
> Hi Grygorii,
>
> Le 06/07/2016 12:03, Grygorii Strashko a ?crit :
>> On 07/06/2016 12:50 PM, Cyrille Pitchen wrote:
>>> Hi Mark,
>>>
>>> recently Heiko reported to us a performance regression with Atmel SPI
>>> controllers. He noticed the issue on a sam9g15ek board and I was also able to
>>> reproduce it on a sama5d36ek board.
>>>
>>> We found out that the performance regression was introduced in 3.14 by commit:
>>> 8090d6d1a415d3ae1a7208995decfab8f60f4f36
>>> spi: atmel: Refactor spi-atmel to use SPI framework queue
>>>
>>> For the test, I connected a Spansion S25FL512 memory on the SPI1 controller of
>>> a sama5d36ek board. Then with an oscilloscope I monitored the chip-select, clock
>>> and MOSI signals on the SPI bus.
>>>
>>>
>>> 1 - Reading 512 bytes from the memory
>>>
>>> # dd if=/dev/mtd6 bs=512 count=1 of=/dev/null
>>>
>>> With the oscilloscope, I measured the time between the chip-select fell before
>>> the Read Status command (05h) and the chip-select rose after all data had been
>>> read by the 4-byte address Fast Read 1-1-1 command (13h).
>>>
>>> 3.14 vanilla                      : 305 ?s
>>> 3.14 commit 8090d6d1a415 reverted : 242 ?s   -21%
>>>
>>> 2 - Reading 1000 x 1024 bytes from the memory
>>>
>>> # dd if=/dev/mtd6 bs=1024 count=1000 of=/dev/null
>>>
>>> Still with the scope, I measured the time to read all data.
>>>
>>> 3.14 vanilla                      : 435 ms
>>> 3.14 commit 8090d6d1a415 reverted : 361 ms   -17%
>>>
>>>
>>> Indeed the oscilloscope shows that more time is spent between messages and
>>> transfers.

Yes this fits with my observations.

>>> commit 8090d6d1a415 replaced the tasklet used to manage a SPI message/transfer
>>> queue by a workqueue provided by the SPI framework.
>>>
>>> The support of this (optional) workqueue was introduced by commit:
>>> ffbbdd21329f3e15eeca6df2d4bc11c04d9d91c0
>>> spi: create a message queuing infrastructure
>>>
>>> Though the commit message claims that is common infrastructure is optional,
>>> the patch also claims the .transfer() hook is deprecated, suggesting drivers
>>> should implement the new .transfer_one_message() hook instead.
>>>
>>> This is the reason why commit 8090d6d1a415 was submitted. However we lost
>>> quite amount of performances moving from our tasklet to the generic workqueue.
>>>
>>> So do you recommend us to keep our current generic implementation relying on
>>> the SPI framework workqueue or to go back to a custom implementation using
>>> tasklet?
>>> If we keep the current implementation, is there a way to improve the
>>> performances so we go back to something close to what he had before?
>>>
>>> We saw in commit ffbbdd21329f that we can change the workqueue thread
>>> scheduling policy to SCHED_FIFO by setting master->rt.
>>>
>>
>> master->rt is not a good choice as i know and
>> you may find thread [1] useful for you.
>>
>> [1] http://www.spinics.net/lists/linux-rt-users/msg14347.html
>>
>
> thanks for the link, I'll look at it :)

Thanks for digging into this issue and your tests!

Do you have some new results? Can I help you?

bye,
Heiko
-- 
DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany

^ permalink raw reply	[flat|nested] 7+ messages in thread

* SPI: performance regression when using the common message queuing infrastructure
  2016-07-25  4:51     ` Heiko Schocher
@ 2016-07-29  9:33       ` Cyrille Pitchen
  2016-07-29 12:11         ` Mark Brown
  0 siblings, 1 reply; 7+ messages in thread
From: Cyrille Pitchen @ 2016-07-29  9:33 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Heiko,

Le 25/07/2016 ? 06:51, Heiko Schocher a ?crit :
> Hello Cyrille,
> 
> sorry for the late answer, but just back from holidays ...
> 
> Am 07.07.2016 um 10:12 schrieb Cyrille Pitchen:
>> Hi Grygorii,
>>
>> Le 06/07/2016 12:03, Grygorii Strashko a ?crit :
>>> On 07/06/2016 12:50 PM, Cyrille Pitchen wrote:
>>>> Hi Mark,
>>>>
>>>> recently Heiko reported to us a performance regression with Atmel SPI
>>>> controllers. He noticed the issue on a sam9g15ek board and I was also able to
>>>> reproduce it on a sama5d36ek board.
>>>>
>>>> We found out that the performance regression was introduced in 3.14 by commit:
>>>> 8090d6d1a415d3ae1a7208995decfab8f60f4f36
>>>> spi: atmel: Refactor spi-atmel to use SPI framework queue
>>>>
>>>> For the test, I connected a Spansion S25FL512 memory on the SPI1 controller of
>>>> a sama5d36ek board. Then with an oscilloscope I monitored the chip-select, clock
>>>> and MOSI signals on the SPI bus.
>>>>
>>>>
>>>> 1 - Reading 512 bytes from the memory
>>>>
>>>> # dd if=/dev/mtd6 bs=512 count=1 of=/dev/null
>>>>
>>>> With the oscilloscope, I measured the time between the chip-select fell before
>>>> the Read Status command (05h) and the chip-select rose after all data had been
>>>> read by the 4-byte address Fast Read 1-1-1 command (13h).
>>>>
>>>> 3.14 vanilla                      : 305 ?s
>>>> 3.14 commit 8090d6d1a415 reverted : 242 ?s   -21%
>>>>
>>>> 2 - Reading 1000 x 1024 bytes from the memory
>>>>
>>>> # dd if=/dev/mtd6 bs=1024 count=1000 of=/dev/null
>>>>
>>>> Still with the scope, I measured the time to read all data.
>>>>
>>>> 3.14 vanilla                      : 435 ms
>>>> 3.14 commit 8090d6d1a415 reverted : 361 ms   -17%
>>>>
>>>>
>>>> Indeed the oscilloscope shows that more time is spent between messages and
>>>> transfers.
> 
> Yes this fits with my observations.
> 
>>>> commit 8090d6d1a415 replaced the tasklet used to manage a SPI message/transfer
>>>> queue by a workqueue provided by the SPI framework.
>>>>
>>>> The support of this (optional) workqueue was introduced by commit:
>>>> ffbbdd21329f3e15eeca6df2d4bc11c04d9d91c0
>>>> spi: create a message queuing infrastructure
>>>>
>>>> Though the commit message claims that is common infrastructure is optional,
>>>> the patch also claims the .transfer() hook is deprecated, suggesting drivers
>>>> should implement the new .transfer_one_message() hook instead.
>>>>
>>>> This is the reason why commit 8090d6d1a415 was submitted. However we lost
>>>> quite amount of performances moving from our tasklet to the generic workqueue.
>>>>
>>>> So do you recommend us to keep our current generic implementation relying on
>>>> the SPI framework workqueue or to go back to a custom implementation using
>>>> tasklet?
>>>> If we keep the current implementation, is there a way to improve the
>>>> performances so we go back to something close to what he had before?
>>>>
>>>> We saw in commit ffbbdd21329f that we can change the workqueue thread
>>>> scheduling policy to SCHED_FIFO by setting master->rt.
>>>>
>>>
>>> master->rt is not a good choice as i know and
>>> you may find thread [1] useful for you.
>>>
>>> [1] http://www.spinics.net/lists/linux-rt-users/msg14347.html
>>>
>>
>> thanks for the link, I'll look at it :)
> 
> Thanks for digging into this issue and your tests!
> 
> Do you have some new results? Can I help you?
> 
> bye,
> Heiko


We talked about moving back to a tasklet implementation but nothing was done
yet so nothing new for now, sorry.
Also, I will be out of office for the next 3 weeks: I will be back on August,
22th.


Best regards,

Cyrille

^ permalink raw reply	[flat|nested] 7+ messages in thread

* SPI: performance regression when using the common message queuing infrastructure
  2016-07-29  9:33       ` Cyrille Pitchen
@ 2016-07-29 12:11         ` Mark Brown
  0 siblings, 0 replies; 7+ messages in thread
From: Mark Brown @ 2016-07-29 12:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jul 29, 2016 at 11:33:00AM +0200, Cyrille Pitchen wrote:

> We talked about moving back to a tasklet implementation but nothing was done
> yet so nothing new for now, sorry.
> Also, I will be out of office for the next 3 weeks: I will be back on August,
> 22th.

To repeat what I said before the thing to do here is to improve the
performance of the core implementation rather than just open coding
something.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20160729/05ddab58/attachment.sig>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-07-29 12:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-06  9:50 SPI: performance regression when using the common message queuing infrastructure Cyrille Pitchen
2016-07-06 10:03 ` Grygorii Strashko
2016-07-07  8:12   ` Cyrille Pitchen
2016-07-25  4:51     ` Heiko Schocher
2016-07-29  9:33       ` Cyrille Pitchen
2016-07-29 12:11         ` Mark Brown
2016-07-07  9:50 ` Mark Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).