All of lore.kernel.org
 help / color / mirror / Atom feed
* [PROBLEM] spi driver internal error during boot on sparx5
@ 2022-08-26  9:41 Casper Andersson
  2022-08-29  8:56 ` David Jander
  0 siblings, 1 reply; 11+ messages in thread
From: Casper Andersson @ 2022-08-26  9:41 UTC (permalink / raw)
  To: Mark Brown, David Jander
  Cc: linux-spi, Marc Kleine-Budde, Andrew Lunn, Lars Povlsen,
	Steen Hegelund, Microchip Linux Driver Support

Hi,

I'm having some issues on my SparX5 switch (PCB135) from Microchip.
Since this patch series by David Jander the spi driver errors during boot.
https://lore.kernel.org/all/20220621061234.3626638-1-david@protonic.nl/

ae7d2346dc89 ("spi: Don't use the message queue if possible in spi_sync")
On this commit it starts failing to mount the partitions during boot.
This causes the output marked ERROR 1 below.

69fa95905d40 ("spi: Ensure the io_mutex is held until spi_finalize_current_message()")
On this commit it no longer boots properly. I am able to enter login
info, but then I'm unable to do anything. Though, when running latest
versions of e.g. net and net-next trees it boots sometimes. I have
observed some different errors which seems to occur seemingly at random,
show in ERROR 2-4 below. ERROR 2 and 3 seems to be the most common ones.

ERROR 1:
[    1.333629] Internal error: Oops: 96000044 [#1] PREEMPT SMP
[    1.333636] Modules linked in:
[    1.333644] CPU: 0 PID: 292 Comm: spi0 Not tainted 5.19.0-rc1 #18
[    1.333653] Hardware name: microchip,sparx5 (DT)
[    1.333657] pstate: a0000005 (NzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    1.333665] pc : ktime_get_real_ts64+0x4c/0x110
[    1.333681] lr : spi_transfer_one_message+0x13c/0x690
[    1.333693] sp : ffff80000aadbce0
[    1.333696] x29: ffff80000aadbce0 x28: 0000000000000000 x27: 00007bff79822ce0
[    1.333709] x26: ffff000004d1f400 x25: ffff000004d1f000 x24: ffff800009d03798
[    1.333720] x23: ffff8000093f3008 x22: 0000006809d03830 x21: 00000000000001c2
[    1.333732] x20: ffff800009a20e80 x19: 0000000000000000 x18: ffffffffffffffff
[    1.333743] x17: 0000000000000002 x16: 000001d25a50f10c x15: 00074f2fcdde90a0
[    1.333754] x14: ffff800008dfb8a0 x13: 000000000000ac08 x12: 00000000f5257d14
[    1.333766] x11: 00000000000002f7 x10: 0000000000000a00 x9 : ffff8000088690bc
[    1.333777] x8 : ffff000005242760 x7 : 0000000000000004 x6 : fffffbffeffcbd28
[    1.333788] x5 : 0000000000000000 x4 : 0000000009a51c00 x3 : 0000000000000000
[    1.333798] x2 : 0000000009a51c00 x1 : 0000000000000000 x0 : 0000000000000001
[    1.333810] Call trace:
[    1.333812]  ktime_get_real_ts64+0x4c/0x110
[    1.333821]  spi_transfer_one_message+0x13c/0x690
[    1.333831]  __spi_pump_transfer_message+0x174/0x550
[    1.333841]  __spi_pump_messages+0xb8/0x330
[    1.333850]  spi_pump_messages+0x24/0x30
[    1.333859]  kthread_worker_fn+0xb8/0x290
[    1.333870]  kthread+0x118/0x120
[    1.333879]  ret_from_fork+0x10/0x20
[    1.333892] Code: 120002b3 370004d5 d50339bf f9403e80 (f90002c0)
[    1.333898] ---[ end trace 0000000000000000 ]---

Error 2:
[    5.527818] Internal error: Oops: 8600000f [#1] PREEMPT SMP
[    5.534020] Modules linked in:
[    5.536959] CPU: 0 PID: 292 Comm: spi0 Not tainted 6.0.0-rc1 #7
[    5.542627] Hardware name: microchip,sparx5 (DT)
[    5.547043] pstate: 40000005 (nZcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    5.553704] pc : 0xffff0000042062d8
[    5.557046] lr : spi_finalize_current_message+0x1ac/0x2e0
[    5.562216] sp : ffff80000aa2bc50
[    5.565384] x29: ffff80000aa2bc50 x28: 0000000000000000 x27: 00007bff79858be0
[    5.572217] x26: ffff000007258c20 x25: ffff000007258800 x24: ffff80000cb8b0e0
[    5.579049] x23: ffff80000cb8b1b0 x22: ffff80000cb8b158 x21: ffff000004d1c800
[    5.585881] x20: ffff80000cb8b1b0 x19: ffff80000cb8b1b0 x18: 0000000000000000
[    5.592713] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[    5.599544] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[    5.606375] x11: 00000000000001f5 x10: 0000000000000a10 x9 : ffff80000808e890
[    5.613206] x8 : ffff000004966170 x7 : 0000000000000004 x6 : 00000001497a5800
[    5.620037] x5 : 00000000410fd030 x4 : 0000000000c0000e x3 : ffff800076773000
[    5.626869] x2 : 0000000000000000 x1 : ffff0000042062d8 x0 : ffff0000037addb8
[    5.633701] Call trace:
[    5.636040]  0xffff0000042062d8
[    5.639046]  spi_mux_complete_cb+0x48/0x60
[    5.642969]  spi_finalize_current_message+0x1ac/0x2e0
[    5.647803]  spi_transfer_one_message+0x298/0x680
[    5.652304]  __spi_pump_transfer_message+0x188/0x5a0
[    5.657055]  __spi_pump_messages+0xdc/0x330
[    5.661058]  spi_pump_messages+0x24/0x30
[    5.664812]  kthread_worker_fn+0xb8/0x290
[    5.668653]  kthread+0x118/0x120
[    5.671742]  ret_from_fork+0x10/0x20
[    5.675170] Code: 00000000 00000000 08f4b570 ffff8000 (00000000)
[    5.680999] ---[ end trace 0000000000000000 ]---
[    5.678207] note: spi0[291] exited with preempt_count 1

ERROR 3:
[    4.443467] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000108
[    4.443479] Mem abort info:
[    4.443481]   ESR = 0x0000000086000004
[    4.443485]   EC = 0x21: IABT (current EL), IL = 32 bits
[    4.443490]   SET = 0, FnV = 0
[    4.443494]   EA = 0, S1PTW = 0
[    4.443497]   FSC = 0x04: level 0 translation fault
[    4.443502] user pgtable: 4k pages, 48-bit VAs, pgdp=00000007070f2000
[    4.443508] [0000000000000108] pgd=0000000000000000, p4d=0000000000000000
[    4.443520] Internal error: Oops: 86000004 [#1] PREEMPT SMP
[    4.443527] Modules linked in:
[    4.443534] CPU: 0 PID: 292 Comm: spi0 Not tainted 5.19.0-rc1 #25
[    4.443542] Hardware name: microchip,sparx5 (DT)
[    4.443546] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    4.443555] pc : 0x108
[    4.443564] lr : spi_finalize_current_message+0x1ac/0x2e0
[    4.443578] sp : ffff80000ab03c50
[    4.443581] x29: ffff80000ab03c50 x28: 0000000000000000 x27: 00007bff79822ce0
[    4.443594] x26: ffff0000049c3420 x25: ffff0000049c3000 x24: ffff80000b7733b0
[    4.443607] x23: ffff80000b773480 x22: ffff80000b773428 x21: ffff0000036ac800
[    4.443619] x20: ffff80000b773480 x19: ffff80000b773480 x18: 0000000040bd0097
[    4.443632] x17: 0000000000000004 x16: 0000000000000121 x15: 007c2a66ba4f78c6
[    4.443644] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
[    4.443655] x11: 000000000000026b x10: 0000000000000a00 x9 : ffff80000808c94c
[    4.443667] x8 : ffff00000735c460 x7 : 0000000000000001 x6 : 0000000108d9d000
[    4.443679] x5 : 00000000410fd030 x4 : 0000000000c0000e x3 : ffff8000767a9000
[    4.443691] x2 : 0000000000000000 x1 : 0000000000000108 x0 : ffff80000b773750
[    4.443703] Call trace:
[    4.443706]  0x108
[    4.443712]  spi_mux_complete_cb+0x48/0x60
[    4.443720]  spi_finalize_current_message+0x1ac/0x2e0
[    4.443730]  spi_transfer_one_message+0x2b0/0x690
[    4.443739]  __spi_pump_transfer_message+0x188/0x5a0
[    4.443749]  __spi_pump_messages+0xdc/0x330
[    4.443759]  spi_pump_messages+0x24/0x30
[    4.443768]  kthread_worker_fn+0xb8/0x290
[    4.443779]  kthread+0x118/0x120
[    4.443788]  ret_from_fork+0x10/0x20
[    4.443802] Code: bad PC value
[    4.443807] ---[ end trace 0000000000000000 ]---

Error 4:
[    4.012013] Unable to handle kernel execute from non-executable memory at virtual address ffff80000b79b498
[    4.012027] Mem abort info:
[    4.012029]   ESR = 0x000000008600000f
[    4.012033]   EC = 0x21: IABT (current EL), IL = 32 bits
[    4.012038]   SET = 0, FnV = 0
[    4.012042]   EA = 0, S1PTW = 0
[    4.012045]   FSC = 0x0f: level 3 permission fault
[    4.012049] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000007012ec000
[    4.012055] [ffff80000b79b498] pgd=100000077ffff003, p4d=100000077ffff003, pud=100000077fffe003, pmd=10000007036c7003, pte=0068000704f29703
[    4.012077] Internal error: Oops: 8600000f [#1] PREEMPT SMP
[    4.012084] Modules linked in:
[    4.012091] CPU: 0 PID: 292 Comm: spi0 Not tainted 5.19.0-rc1 #25
[    4.012099] Hardware name: microchip,sparx5 (DT)
[    4.012103] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    4.012112] pc : 0xffff80000b79b498
[    4.012121] lr : 0xffff80000b79b498
[    4.012127] sp : ffff80000ab5bc90
[    4.012130] x29: ffff0000042ea800 x28: 0000000000000000 x27: 00007bff79822ce0
[    4.012143] x26: ffff0000042eac20 x25: ffff0000042ea800 x24: ffff80000b79b420
[    4.012155] x23: ffff80000b79b4f0 x22: ffff80000b79b498 x21: ffff000004c35800
[    4.012168] x20: ffff80000b79b4f0 x19: ffff80000b79b4f0 x18: 0000000000000000
[    4.012180] x17: 0000000000000001 x16: 0000000000000001 x15: 0117d93fc9cfcb72
[    4.012192] x14: 0114def7e22c5168 x13: ffff800008dfb8a0 x12: 00000000fa83b2da
[    4.012204] x11: 0000000000000077 x10: 000000000000008a x9 : ffff8000080b2724
[    4.012216] x8 : 0000000000000000 x7 : 0000000000000001 x6 : 0000000000000096
[    4.012227] x5 : 0000000000000000 x4 : ffff00007fbacd80 x3 : ffff000004dec880
[    4.012239] x2 : 0000000000000000 x1 : ffff000004dec880 x0 : 0000000000000000
[    4.012251] Call trace:
[    4.012254]  0xffff80000b79b498
[    4.012265] Code: 04dfc880 ffff0000 0b79b4b0 ffff8000 (08dd5984)
[    4.012271] ---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PROBLEM] spi driver internal error during boot on sparx5
  2022-08-26  9:41 [PROBLEM] spi driver internal error during boot on sparx5 Casper Andersson
@ 2022-08-29  8:56 ` David Jander
  2022-08-31 16:07   ` Mark Brown
  0 siblings, 1 reply; 11+ messages in thread
From: David Jander @ 2022-08-29  8:56 UTC (permalink / raw)
  To: Casper Andersson
  Cc: Mark Brown, linux-spi, Marc Kleine-Budde, Andrew Lunn,
	Lars Povlsen, Steen Hegelund, Microchip Linux Driver Support


Hi Casper,

On Fri, 26 Aug 2022 11:41:43 +0200
Casper Andersson <casper.casan@gmail.com> wrote:

> Hi,
> 
> I'm having some issues on my SparX5 switch (PCB135) from Microchip.
> Since this patch series by David Jander the spi driver errors during boot.
> https://lore.kernel.org/all/20220621061234.3626638-1-david@protonic.nl/
> 
> ae7d2346dc89 ("spi: Don't use the message queue if possible in spi_sync")
> On this commit it starts failing to mount the partitions during boot.
> This causes the output marked ERROR 1 below.
> 
> 69fa95905d40 ("spi: Ensure the io_mutex is held until spi_finalize_current_message()")
> On this commit it no longer boots properly. I am able to enter login
> info, but then I'm unable to do anything. Though, when running latest
> versions of e.g. net and net-next trees it boots sometimes. I have
> observed some different errors which seems to occur seemingly at random,
> show in ERROR 2-4 below. ERROR 2 and 3 seems to be the most common ones.
> 
> ERROR 1:
> [    1.333629] Internal error: Oops: 96000044 [#1] PREEMPT SMP
> [    1.333636] Modules linked in:
> [    1.333644] CPU: 0 PID: 292 Comm: spi0 Not tainted 5.19.0-rc1 #18
> [    1.333653] Hardware name: microchip,sparx5 (DT)
> [    1.333657] pstate: a0000005 (NzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    1.333665] pc : ktime_get_real_ts64+0x4c/0x110
> [    1.333681] lr : spi_transfer_one_message+0x13c/0x690
> [    1.333693] sp : ffff80000aadbce0
> [    1.333696] x29: ffff80000aadbce0 x28: 0000000000000000 x27: 00007bff79822ce0
> [    1.333709] x26: ffff000004d1f400 x25: ffff000004d1f000 x24: ffff800009d03798
> [    1.333720] x23: ffff8000093f3008 x22: 0000006809d03830 x21: 00000000000001c2
> [    1.333732] x20: ffff800009a20e80 x19: 0000000000000000 x18: ffffffffffffffff
> [    1.333743] x17: 0000000000000002 x16: 000001d25a50f10c x15: 00074f2fcdde90a0
> [    1.333754] x14: ffff800008dfb8a0 x13: 000000000000ac08 x12: 00000000f5257d14
> [    1.333766] x11: 00000000000002f7 x10: 0000000000000a00 x9 : ffff8000088690bc
> [    1.333777] x8 : ffff000005242760 x7 : 0000000000000004 x6 : fffffbffeffcbd28
> [    1.333788] x5 : 0000000000000000 x4 : 0000000009a51c00 x3 : 0000000000000000
> [    1.333798] x2 : 0000000009a51c00 x1 : 0000000000000000 x0 : 0000000000000001
> [    1.333810] Call trace:
> [    1.333812]  ktime_get_real_ts64+0x4c/0x110
> [    1.333821]  spi_transfer_one_message+0x13c/0x690
> [    1.333831]  __spi_pump_transfer_message+0x174/0x550
> [    1.333841]  __spi_pump_messages+0xb8/0x330
> [    1.333850]  spi_pump_messages+0x24/0x30
> [    1.333859]  kthread_worker_fn+0xb8/0x290
> [    1.333870]  kthread+0x118/0x120
> [    1.333879]  ret_from_fork+0x10/0x20
> [    1.333892] Code: 120002b3 370004d5 d50339bf f9403e80 (f90002c0)
> [    1.333898] ---[ end trace 0000000000000000 ]---
> 
> Error 2:
> [    5.527818] Internal error: Oops: 8600000f [#1] PREEMPT SMP
> [    5.534020] Modules linked in:
> [    5.536959] CPU: 0 PID: 292 Comm: spi0 Not tainted 6.0.0-rc1 #7
> [    5.542627] Hardware name: microchip,sparx5 (DT)
> [    5.547043] pstate: 40000005 (nZcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    5.553704] pc : 0xffff0000042062d8
> [    5.557046] lr : spi_finalize_current_message+0x1ac/0x2e0
> [    5.562216] sp : ffff80000aa2bc50
> [    5.565384] x29: ffff80000aa2bc50 x28: 0000000000000000 x27: 00007bff79858be0
> [    5.572217] x26: ffff000007258c20 x25: ffff000007258800 x24: ffff80000cb8b0e0
> [    5.579049] x23: ffff80000cb8b1b0 x22: ffff80000cb8b158 x21: ffff000004d1c800
> [    5.585881] x20: ffff80000cb8b1b0 x19: ffff80000cb8b1b0 x18: 0000000000000000
> [    5.592713] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> [    5.599544] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
> [    5.606375] x11: 00000000000001f5 x10: 0000000000000a10 x9 : ffff80000808e890
> [    5.613206] x8 : ffff000004966170 x7 : 0000000000000004 x6 : 00000001497a5800
> [    5.620037] x5 : 00000000410fd030 x4 : 0000000000c0000e x3 : ffff800076773000
> [    5.626869] x2 : 0000000000000000 x1 : ffff0000042062d8 x0 : ffff0000037addb8
> [    5.633701] Call trace:
> [    5.636040]  0xffff0000042062d8
> [    5.639046]  spi_mux_complete_cb+0x48/0x60
> [    5.642969]  spi_finalize_current_message+0x1ac/0x2e0
> [    5.647803]  spi_transfer_one_message+0x298/0x680
> [    5.652304]  __spi_pump_transfer_message+0x188/0x5a0
> [    5.657055]  __spi_pump_messages+0xdc/0x330
> [    5.661058]  spi_pump_messages+0x24/0x30
> [    5.664812]  kthread_worker_fn+0xb8/0x290
> [    5.668653]  kthread+0x118/0x120
> [    5.671742]  ret_from_fork+0x10/0x20
> [    5.675170] Code: 00000000 00000000 08f4b570 ffff8000 (00000000)
> [    5.680999] ---[ end trace 0000000000000000 ]---
> [    5.678207] note: spi0[291] exited with preempt_count 1
> 
> ERROR 3:
> [    4.443467] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000108
> [    4.443479] Mem abort info:
> [    4.443481]   ESR = 0x0000000086000004
> [    4.443485]   EC = 0x21: IABT (current EL), IL = 32 bits
> [    4.443490]   SET = 0, FnV = 0
> [    4.443494]   EA = 0, S1PTW = 0
> [    4.443497]   FSC = 0x04: level 0 translation fault
> [    4.443502] user pgtable: 4k pages, 48-bit VAs, pgdp=00000007070f2000
> [    4.443508] [0000000000000108] pgd=0000000000000000, p4d=0000000000000000
> [    4.443520] Internal error: Oops: 86000004 [#1] PREEMPT SMP
> [    4.443527] Modules linked in:
> [    4.443534] CPU: 0 PID: 292 Comm: spi0 Not tainted 5.19.0-rc1 #25
> [    4.443542] Hardware name: microchip,sparx5 (DT)
> [    4.443546] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    4.443555] pc : 0x108
> [    4.443564] lr : spi_finalize_current_message+0x1ac/0x2e0
> [    4.443578] sp : ffff80000ab03c50
> [    4.443581] x29: ffff80000ab03c50 x28: 0000000000000000 x27: 00007bff79822ce0
> [    4.443594] x26: ffff0000049c3420 x25: ffff0000049c3000 x24: ffff80000b7733b0
> [    4.443607] x23: ffff80000b773480 x22: ffff80000b773428 x21: ffff0000036ac800
> [    4.443619] x20: ffff80000b773480 x19: ffff80000b773480 x18: 0000000040bd0097
> [    4.443632] x17: 0000000000000004 x16: 0000000000000121 x15: 007c2a66ba4f78c6
> [    4.443644] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
> [    4.443655] x11: 000000000000026b x10: 0000000000000a00 x9 : ffff80000808c94c
> [    4.443667] x8 : ffff00000735c460 x7 : 0000000000000001 x6 : 0000000108d9d000
> [    4.443679] x5 : 00000000410fd030 x4 : 0000000000c0000e x3 : ffff8000767a9000
> [    4.443691] x2 : 0000000000000000 x1 : 0000000000000108 x0 : ffff80000b773750
> [    4.443703] Call trace:
> [    4.443706]  0x108
> [    4.443712]  spi_mux_complete_cb+0x48/0x60
> [    4.443720]  spi_finalize_current_message+0x1ac/0x2e0
> [    4.443730]  spi_transfer_one_message+0x2b0/0x690
> [    4.443739]  __spi_pump_transfer_message+0x188/0x5a0
> [    4.443749]  __spi_pump_messages+0xdc/0x330
> [    4.443759]  spi_pump_messages+0x24/0x30
> [    4.443768]  kthread_worker_fn+0xb8/0x290
> [    4.443779]  kthread+0x118/0x120
> [    4.443788]  ret_from_fork+0x10/0x20
> [    4.443802] Code: bad PC value
> [    4.443807] ---[ end trace 0000000000000000 ]---
> 
> Error 4:
> [    4.012013] Unable to handle kernel execute from non-executable memory at virtual address ffff80000b79b498
> [    4.012027] Mem abort info:
> [    4.012029]   ESR = 0x000000008600000f
> [    4.012033]   EC = 0x21: IABT (current EL), IL = 32 bits
> [    4.012038]   SET = 0, FnV = 0
> [    4.012042]   EA = 0, S1PTW = 0
> [    4.012045]   FSC = 0x0f: level 3 permission fault
> [    4.012049] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000007012ec000
> [    4.012055] [ffff80000b79b498] pgd=100000077ffff003, p4d=100000077ffff003, pud=100000077fffe003, pmd=10000007036c7003, pte=0068000704f29703
> [    4.012077] Internal error: Oops: 8600000f [#1] PREEMPT SMP
> [    4.012084] Modules linked in:
> [    4.012091] CPU: 0 PID: 292 Comm: spi0 Not tainted 5.19.0-rc1 #25
> [    4.012099] Hardware name: microchip,sparx5 (DT)
> [    4.012103] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    4.012112] pc : 0xffff80000b79b498
> [    4.012121] lr : 0xffff80000b79b498
> [    4.012127] sp : ffff80000ab5bc90
> [    4.012130] x29: ffff0000042ea800 x28: 0000000000000000 x27: 00007bff79822ce0
> [    4.012143] x26: ffff0000042eac20 x25: ffff0000042ea800 x24: ffff80000b79b420
> [    4.012155] x23: ffff80000b79b4f0 x22: ffff80000b79b498 x21: ffff000004c35800
> [    4.012168] x20: ffff80000b79b4f0 x19: ffff80000b79b4f0 x18: 0000000000000000
> [    4.012180] x17: 0000000000000001 x16: 0000000000000001 x15: 0117d93fc9cfcb72
> [    4.012192] x14: 0114def7e22c5168 x13: ffff800008dfb8a0 x12: 00000000fa83b2da
> [    4.012204] x11: 0000000000000077 x10: 000000000000008a x9 : ffff8000080b2724
> [    4.012216] x8 : 0000000000000000 x7 : 0000000000000001 x6 : 0000000000000096
> [    4.012227] x5 : 0000000000000000 x4 : ffff00007fbacd80 x3 : ffff000004dec880
> [    4.012239] x2 : 0000000000000000 x1 : ffff000004dec880 x0 : 0000000000000000
> [    4.012251] Call trace:
> [    4.012254]  0xffff80000b79b498
> [    4.012265] Code: 04dfc880 ffff0000 0b79b4b0 ffff8000 (08dd5984)
> [    4.012271] ---[ end trace 0000000000000000 ]---

Thanks for reporting.
Looking at Errors 2 and 3, I suspect there might be a race in the SPI mux
driver. After a quick inspection, I see this:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/spi/spi-mux.c?h=v6.0-rc3#n123

spi_mux_transfer_one_message() returns before the message is transferred (in
spi_async()), which is not expected. AFAIK, an ctlr->transfer_one_message()
implementation should not return until the transfer is completed.

To check if that is causing the problem, could you try the following change:

--- a/drivers/spi/spi-mux.c
+++ b/drivers/spi/spi-mux.c
@@ -120,7 +120,7 @@ static int spi_mux_transfer_one_message(struct spi_controller *ctlr,
        m->spi = priv->spi;
 
        /* do the transfer */
-       return spi_async(priv->spi, m);
+       return spi_sync(priv->spi, m);
 }
 
 static int spi_mux_probe(struct spi_device *spi)

Not sure if this is a correct fix, but I'd like to know if your situation
changes this way, if you could try it.
I don't have access to any hardware with a mux unfortunately, so I can't test
it myself.

Best regards,

-- 
David Jander
Protonic Holland.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PROBLEM] spi driver internal error during boot on sparx5
  2022-08-29  8:56 ` David Jander
@ 2022-08-31 16:07   ` Mark Brown
  2022-09-01  6:57     ` Vincent Whitchurch
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Brown @ 2022-08-31 16:07 UTC (permalink / raw)
  To: David Jander
  Cc: Casper Andersson, linux-spi, Marc Kleine-Budde, Andrew Lunn,
	Lars Povlsen, Steen Hegelund, Microchip Linux Driver Support

[-- Attachment #1: Type: text/plain, Size: 3687 bytes --]

On Mon, Aug 29, 2022 at 10:56:13AM +0200, David Jander wrote:

> spi_mux_transfer_one_message() returns before the message is transferred (in
> spi_async()), which is not expected. AFAIK, an ctlr->transfer_one_message()
> implementation should not return until the transfer is completed.

That's because what it wants to happen here is that the controller then
runs the message that it's being asked to perform, it'll then get a
callback when the message completes which it'll use to deselect the
device and then complete the original callback.  This is a horrible
hack.

It should be fine for transfer_one_message() to return immediately, it's
called from __spi_pump_transfer_message() which will wait for the
message to be finalized.  However I do note that if we get a message
going in via the sync path skipping the queue we set msg->sync, among
other things, and then the mux will try to reuse the same message and
resubmit it as an async message with the sync flag set which I can't see
is going to go well.

> Not sure if this is a correct fix, but I'd like to know if your situation
> changes this way, if you could try it.
> I don't have access to any hardware with a mux unfortunately, so I can't test
> it myself.

I guess claiming to have a noop mux might work for testing, though I'd
be dubious that it was actually doing the mux operations properly?  I
think we need to either change spi_mux to duplicate the incoming message
(that's probably "cleaner") or teach the core that spi-mux exists and
should always use the pump/queue.  The below might do the trick but in
spite of my suggestion above I've not tested either yet:

diff --git a/drivers/spi/spi-mux.c b/drivers/spi/spi-mux.c
index f5d32ec4634e..0709e987bd5a 100644
--- a/drivers/spi/spi-mux.c
+++ b/drivers/spi/spi-mux.c
@@ -161,6 +161,7 @@ static int spi_mux_probe(struct spi_device *spi)
 	ctlr->num_chipselect = mux_control_states(priv->mux);
 	ctlr->bus_num = -1;
 	ctlr->dev.of_node = spi->dev.of_node;
+	ctlr->must_async = true;
 
 	ret = devm_spi_register_controller(&spi->dev, ctlr);
 	if (ret)
diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
index 1cfed874f7ae..88d48a105d3c 100644
--- a/drivers/spi/spi.c
+++ b/drivers/spi/spi.c
@@ -4033,7 +4033,7 @@ static int __spi_sync(struct spi_device *spi, struct spi_message *message)
 	 * guard against reentrancy from a different context. The io_mutex
 	 * will catch those cases.
 	 */
-	if (READ_ONCE(ctlr->queue_empty)) {
+	if (READ_ONCE(ctlr->queue_empty) && !ctlr->must_async) {
 		message->actual_length = 0;
 		message->status = -EINPROGRESS;
 
diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h
index e6c73d5ff1a8..f089ee1ead58 100644
--- a/include/linux/spi/spi.h
+++ b/include/linux/spi/spi.h
@@ -469,6 +469,7 @@ extern struct spi_device *spi_new_ancillary_device(struct spi_device *spi, u8 ch
  *	SPI_TRANS_FAIL_NO_START.
  * @queue_empty: signal green light for opportunistically skipping the queue
  *	for spi_sync transfers.
+ * @must_async: disable all fast paths in the core
  *
  * Each SPI controller can communicate with one or more @spi_device
  * children.  These make a small bus, sharing MOSI, MISO and SCK signals
@@ -690,6 +691,7 @@ struct spi_controller {
 
 	/* Flag for enabling opportunistic skipping of the queue in spi_sync */
 	bool			queue_empty;
+	bool			must_async;
 };
 
 static inline void *spi_controller_get_devdata(struct spi_controller *ctlr)

Assuming that works out there'll be an extra test in the fast path but
no sync operations, and a performance hit for spi-mux users.  Hopefully
that works as well as it did before.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PROBLEM] spi driver internal error during boot on sparx5
  2022-08-31 16:07   ` Mark Brown
@ 2022-09-01  6:57     ` Vincent Whitchurch
  2022-09-01 11:02       ` David Jander
  2022-09-01 11:11       ` Mark Brown
  0 siblings, 2 replies; 11+ messages in thread
From: Vincent Whitchurch @ 2022-09-01  6:57 UTC (permalink / raw)
  To: Mark Brown
  Cc: David Jander, Casper Andersson, linux-spi, Marc Kleine-Budde,
	Andrew Lunn, Lars Povlsen, Steen Hegelund,
	Microchip Linux Driver Support

On Wed, Aug 31, 2022 at 05:07:42PM +0100, Mark Brown wrote:
> On Mon, Aug 29, 2022 at 10:56:13AM +0200, David Jander wrote:
> > Not sure if this is a correct fix, but I'd like to know if your situation
> > changes this way, if you could try it.
> > I don't have access to any hardware with a mux unfortunately, so I can't test
> > it myself.
> 
> I guess claiming to have a noop mux might work for testing, though I'd
> be dubious that it was actually doing the mux operations properly?  I

I'm seeing these problems with the roadtest test framework which uses
UML and doesn't need any special hardware.  Roadtest puts its emulated
devices under a spi-mux with gpio-mockup and there are no tests for the
actual muxing, but other driver tests break since transfers using
spi-mux don't work properly.

I pushed the latest version with SPI support to the tree below.  The
test_adc084s021 tests a SPI device.  Roadtest is placed inside a linux
tree, but it doesn't need any patches to the kernel:

 https://github.com/vwax/linux/commits/roadtest/devel

You can reproduce the problem with:

 git remote add vwax https://github.com/vwax/linux.git
 git fetch vwax
 git archive vwax/roadtest/devel tools/testing/roadtest | tar xf -
 make -C tools/testing/roadtest/ -j24 OPTS="-k adc --rt-timeout 10 -v"

The test passes on v5.19 but on current mainline it hangs and times out.

> think we need to either change spi_mux to duplicate the incoming message
> (that's probably "cleaner") or teach the core that spi-mux exists and
> should always use the pump/queue.  The below might do the trick but in
> spite of my suggestion above I've not tested either yet:
> 
> diff --git a/drivers/spi/spi-mux.c b/drivers/spi/spi-mux.c
> index f5d32ec4634e..0709e987bd5a 100644
> --- a/drivers/spi/spi-mux.c
> +++ b/drivers/spi/spi-mux.c
> @@ -161,6 +161,7 @@ static int spi_mux_probe(struct spi_device *spi)
>  	ctlr->num_chipselect = mux_control_states(priv->mux);
>  	ctlr->bus_num = -1;
>  	ctlr->dev.of_node = spi->dev.of_node;
> +	ctlr->must_async = true;
>  
>  	ret = devm_spi_register_controller(&spi->dev, ctlr);
>  	if (ret)
> diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
> index 1cfed874f7ae..88d48a105d3c 100644
> --- a/drivers/spi/spi.c
> +++ b/drivers/spi/spi.c
> @@ -4033,7 +4033,7 @@ static int __spi_sync(struct spi_device *spi, struct spi_message *message)
>  	 * guard against reentrancy from a different context. The io_mutex
>  	 * will catch those cases.
>  	 */
> -	if (READ_ONCE(ctlr->queue_empty)) {
> +	if (READ_ONCE(ctlr->queue_empty) && !ctlr->must_async) {
>  		message->actual_length = 0;
>  		message->status = -EINPROGRESS;
>  
> diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h
> index e6c73d5ff1a8..f089ee1ead58 100644
> --- a/include/linux/spi/spi.h
> +++ b/include/linux/spi/spi.h
> @@ -469,6 +469,7 @@ extern struct spi_device *spi_new_ancillary_device(struct spi_device *spi, u8 ch
>   *	SPI_TRANS_FAIL_NO_START.
>   * @queue_empty: signal green light for opportunistically skipping the queue
>   *	for spi_sync transfers.
> + * @must_async: disable all fast paths in the core
>   *
>   * Each SPI controller can communicate with one or more @spi_device
>   * children.  These make a small bus, sharing MOSI, MISO and SCK signals
> @@ -690,6 +691,7 @@ struct spi_controller {
>  
>  	/* Flag for enabling opportunistic skipping of the queue in spi_sync */
>  	bool			queue_empty;
> +	bool			must_async;
>  };
>  
>  static inline void *spi_controller_get_devdata(struct spi_controller *ctlr)
> 
> Assuming that works out there'll be an extra test in the fast path but
> no sync operations, and a performance hit for spi-mux users.  Hopefully
> that works as well as it did before.

The above patch will fix the problem introduced by commit ae7d2346dc89
("spi: Don't use the message queue if possible in spi_sync"), but it's
not enough.

There is another bug introduced by the commit 69fa95905d40 ("spi: Ensure
the io_mutex is held until spi_finalize_current_message()") which the
above proposed patch does not fix.  After one SPI transfer fails the
next transfer via the spi-mux never gets to the parent bus.  The above
test_adc084s021 uses roadtest's flaky_bus feature to inject bus
failures.

Note that if you're backing the tree to these commits for testing,
there's another problem you'll run into.  Both these commits cause
crashes, and to fix those you have to cherry-pick commit 72c5c59b659d
("spi: Set ctlr->cur_msg also in the sync transfer case").

And the SPI tree where these patches are on was based on v5.19-rc1,
which has an unrelated problem with the IIO sysfs trigger which can also
cause hangs and which was fixed by commit 78601726d4a5 ("iio: trigger:
sysfs: fix use-after-free on remove"), although those hangs are rare, as
opposed to the SPI problems which happen every time.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PROBLEM] spi driver internal error during boot on sparx5
  2022-09-01  6:57     ` Vincent Whitchurch
@ 2022-09-01 11:02       ` David Jander
  2022-09-01 11:42         ` Vincent Whitchurch
                           ` (2 more replies)
  2022-09-01 11:11       ` Mark Brown
  1 sibling, 3 replies; 11+ messages in thread
From: David Jander @ 2022-09-01 11:02 UTC (permalink / raw)
  To: Vincent Whitchurch
  Cc: Mark Brown, Casper Andersson, linux-spi, Marc Kleine-Budde,
	Andrew Lunn, Lars Povlsen, Steen Hegelund,
	Microchip Linux Driver Support

On Thu, 1 Sep 2022 08:57:37 +0200
Vincent Whitchurch <vincent.whitchurch@axis.com> wrote:

> On Wed, Aug 31, 2022 at 05:07:42PM +0100, Mark Brown wrote:
> > On Mon, Aug 29, 2022 at 10:56:13AM +0200, David Jander wrote:  
> > > Not sure if this is a correct fix, but I'd like to know if your situation
> > > changes this way, if you could try it.
> > > I don't have access to any hardware with a mux unfortunately, so I can't test
> > > it myself.  
> > 
> > I guess claiming to have a noop mux might work for testing, though I'd
> > be dubious that it was actually doing the mux operations properly?  I  
> 
> I'm seeing these problems with the roadtest test framework which uses
> UML and doesn't need any special hardware.  Roadtest puts its emulated
> devices under a spi-mux with gpio-mockup and there are no tests for the
> actual muxing, but other driver tests break since transfers using
> spi-mux don't work properly.
> 
> I pushed the latest version with SPI support to the tree below.  The
> test_adc084s021 tests a SPI device.  Roadtest is placed inside a linux
> tree, but it doesn't need any patches to the kernel:
> 
>  https://github.com/vwax/linux/commits/roadtest/devel
> 
> You can reproduce the problem with:
> 
>  git remote add vwax https://github.com/vwax/linux.git
>  git fetch vwax
>  git archive vwax/roadtest/devel tools/testing/roadtest | tar xf -
>  make -C tools/testing/roadtest/ -j24 OPTS="-k adc --rt-timeout 10 -v"
> 
> The test passes on v5.19 but on current mainline it hangs and times out.

This is very nice. Thanks.
I followed your instructions, and if I apply the following, all tests are
passed.

diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
index 83da8862b8f2..32c01e684af3 100644
--- a/drivers/spi/spi.c
+++ b/drivers/spi/spi.c
@@ -1727,8 +1727,7 @@ static void __spi_pump_messages(struct spi_controller *ctlr, bool in_kthread)
        spin_unlock_irqrestore(&ctlr->queue_lock, flags);
 
        ret = __spi_pump_transfer_message(ctlr, msg, was_busy);
-       if (!ret)
-               kthread_queue_work(ctlr->kworker, &ctlr->pump_messages);
+       kthread_queue_work(ctlr->kworker, &ctlr->pump_messages);
 
        ctlr->cur_msg = NULL;
        ctlr->fallback = false;

The problem is that if __spi_pump_transfer_message() fails, the ctlr->busy
flag is left true, so __spi_async() is not going to queue new work. The busy
transition is handled right above that piece of code, in
__spi_pump_transfer_message(), and the mechanism is to queue more work to do
it. Apparently work was only queued when the transfer was successful, and I am
not sure why it was like that. Queuing work unconditionally solves the issue
and should not be a problem.

Curious thing is, that this change alone is sufficient to make all the
roadtest tests pass. But I still think that does not fix the bug reported by
Casper. For that, Mark's patch is also necessary.

@Casper: can you test Mark's patch with above addition?

> > think we need to either change spi_mux to duplicate the incoming message
> > (that's probably "cleaner") or teach the core that spi-mux exists and
> > should always use the pump/queue.  The below might do the trick but in
> > spite of my suggestion above I've not tested either yet:
> > 
> > diff --git a/drivers/spi/spi-mux.c b/drivers/spi/spi-mux.c
> > index f5d32ec4634e..0709e987bd5a 100644
> > --- a/drivers/spi/spi-mux.c
> > +++ b/drivers/spi/spi-mux.c
> > @@ -161,6 +161,7 @@ static int spi_mux_probe(struct spi_device *spi)
> >  	ctlr->num_chipselect = mux_control_states(priv->mux);
> >  	ctlr->bus_num = -1;
> >  	ctlr->dev.of_node = spi->dev.of_node;
> > +	ctlr->must_async = true;
> >  
> >  	ret = devm_spi_register_controller(&spi->dev, ctlr);
> >  	if (ret)
> > diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
> > index 1cfed874f7ae..88d48a105d3c 100644
> > --- a/drivers/spi/spi.c
> > +++ b/drivers/spi/spi.c
> > @@ -4033,7 +4033,7 @@ static int __spi_sync(struct spi_device *spi, struct spi_message *message)
> >  	 * guard against reentrancy from a different context. The io_mutex
> >  	 * will catch those cases.
> >  	 */
> > -	if (READ_ONCE(ctlr->queue_empty)) {
> > +	if (READ_ONCE(ctlr->queue_empty) && !ctlr->must_async) {
> >  		message->actual_length = 0;
> >  		message->status = -EINPROGRESS;
> >  
> > diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h
> > index e6c73d5ff1a8..f089ee1ead58 100644
> > --- a/include/linux/spi/spi.h
> > +++ b/include/linux/spi/spi.h
> > @@ -469,6 +469,7 @@ extern struct spi_device *spi_new_ancillary_device(struct spi_device *spi, u8 ch
> >   *	SPI_TRANS_FAIL_NO_START.
> >   * @queue_empty: signal green light for opportunistically skipping the queue
> >   *	for spi_sync transfers.
> > + * @must_async: disable all fast paths in the core
> >   *
> >   * Each SPI controller can communicate with one or more @spi_device
> >   * children.  These make a small bus, sharing MOSI, MISO and SCK signals
> > @@ -690,6 +691,7 @@ struct spi_controller {
> >  
> >  	/* Flag for enabling opportunistic skipping of the queue in spi_sync */
> >  	bool			queue_empty;
> > +	bool			must_async;
> >  };
> >  
> >  static inline void *spi_controller_get_devdata(struct spi_controller *ctlr)
> > 
> > Assuming that works out there'll be an extra test in the fast path but
> > no sync operations, and a performance hit for spi-mux users.  Hopefully
> > that works as well as it did before.  
>[...]

Best regards,

-- 
David Jander
Protonic Holland.

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PROBLEM] spi driver internal error during boot on sparx5
  2022-09-01  6:57     ` Vincent Whitchurch
  2022-09-01 11:02       ` David Jander
@ 2022-09-01 11:11       ` Mark Brown
  1 sibling, 0 replies; 11+ messages in thread
From: Mark Brown @ 2022-09-01 11:11 UTC (permalink / raw)
  To: Vincent Whitchurch
  Cc: David Jander, Casper Andersson, linux-spi, Marc Kleine-Budde,
	Andrew Lunn, Lars Povlsen, Steen Hegelund,
	Microchip Linux Driver Support

[-- Attachment #1: Type: text/plain, Size: 565 bytes --]

On Thu, Sep 01, 2022 at 08:57:37AM +0200, Vincent Whitchurch wrote:

> I'm seeing these problems with the roadtest test framework which uses
> UML and doesn't need any special hardware.  Roadtest puts its emulated
> devices under a spi-mux with gpio-mockup and there are no tests for the
> actual muxing, but other driver tests break since transfers using
> spi-mux don't work properly.

Oh, excellent - I was just trying to remember the name of roadtest a
couple of days ago so thanks for reminding me of that as well as for the
testing!  I'll try this out later.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PROBLEM] spi driver internal error during boot on sparx5
  2022-09-01 11:02       ` David Jander
@ 2022-09-01 11:42         ` Vincent Whitchurch
  2022-09-01 12:08           ` David Jander
  2022-09-01 11:51         ` Mark Brown
  2022-09-01 15:16         ` Casper Andersson
  2 siblings, 1 reply; 11+ messages in thread
From: Vincent Whitchurch @ 2022-09-01 11:42 UTC (permalink / raw)
  To: David Jander
  Cc: Mark Brown, Casper Andersson, linux-spi, Marc Kleine-Budde,
	Andrew Lunn, Lars Povlsen, Steen Hegelund,
	Microchip Linux Driver Support

On Thu, Sep 01, 2022 at 01:02:22PM +0200, David Jander wrote:
> On Thu, 1 Sep 2022 08:57:37 +0200
> Vincent Whitchurch <vincent.whitchurch@axis.com> wrote:
> > You can reproduce the problem with:
> > 
> >  git remote add vwax https://github.com/vwax/linux.git
> >  git fetch vwax
> >  git archive vwax/roadtest/devel tools/testing/roadtest | tar xf -
> >  make -C tools/testing/roadtest/ -j24 OPTS="-k adc --rt-timeout 10 -v"
> > 
> > The test passes on v5.19 but on current mainline it hangs and times out.
> 
> This is very nice. Thanks.

Great to hear that it was useful!

> I followed your instructions, and if I apply the following, all tests are
> passed.
> 
> diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
> index 83da8862b8f2..32c01e684af3 100644
> --- a/drivers/spi/spi.c
> +++ b/drivers/spi/spi.c
> @@ -1727,8 +1727,7 @@ static void __spi_pump_messages(struct spi_controller *ctlr, bool in_kthread)
>         spin_unlock_irqrestore(&ctlr->queue_lock, flags);
>  
>         ret = __spi_pump_transfer_message(ctlr, msg, was_busy);
> -       if (!ret)
> -               kthread_queue_work(ctlr->kworker, &ctlr->pump_messages);
> +       kthread_queue_work(ctlr->kworker, &ctlr->pump_messages);
>  
>         ctlr->cur_msg = NULL;
>         ctlr->fallback = false;
> 
> The problem is that if __spi_pump_transfer_message() fails, the ctlr->busy
> flag is left true, so __spi_async() is not going to queue new work. The busy
> transition is handled right above that piece of code, in
> __spi_pump_transfer_message(), and the mechanism is to queue more work to do
> it. Apparently work was only queued when the transfer was successful, and I am
> not sure why it was like that. Queuing work unconditionally solves the issue
> and should not be a problem.
> 
> Curious thing is, that this change alone is sufficient to make all the
> roadtest tests pass. But I still think that does not fix the bug reported by
> Casper. For that, Mark's patch is also necessary.

Yes, I noticed that too.  If you comment out the two usages of
"@flaky_bus" in tools/testing/roadtest/roadtest/tests/iio/adc/test_adc084s021.py then
the bus error injection will be disabled and only the success paths will
be tested.  If you do that and test ae7d2346dc89ae89a6e0 ("spi: Don't
use the message queue if possible in spi_sync") with the crash fix and
iio sysfs fixes cherry-picked in, you'll see the following failure
without Mark's patch:

 FAILED roadtest/tests/iio/adc/test_adc084s021.py::test_illuminance - BlockingIOError: [Errno 115] Operation now in progress

But if you move forward to 69fa95905d40846756d22 ("spi: Ensure the
io_mutex is held until spi_finalize_current_message()"), then the tests
start passing again, if they're run without the error injection.

So 69fa95905d40846756d22 seems to be masking the original problem from
ae7d2346dc89ae89a6e0 (while at the same time introducing the other
problem in the error handling).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PROBLEM] spi driver internal error during boot on sparx5
  2022-09-01 11:02       ` David Jander
  2022-09-01 11:42         ` Vincent Whitchurch
@ 2022-09-01 11:51         ` Mark Brown
  2022-09-01 15:16         ` Casper Andersson
  2 siblings, 0 replies; 11+ messages in thread
From: Mark Brown @ 2022-09-01 11:51 UTC (permalink / raw)
  To: David Jander
  Cc: Vincent Whitchurch, Casper Andersson, linux-spi,
	Marc Kleine-Budde, Andrew Lunn, Lars Povlsen, Steen Hegelund,
	Microchip Linux Driver Support

[-- Attachment #1: Type: text/plain, Size: 1155 bytes --]

On Thu, Sep 01, 2022 at 01:02:22PM +0200, David Jander wrote:

> @@ -1727,8 +1727,7 @@ static void __spi_pump_messages(struct spi_controller *ctlr, bool in_kthread)
>         spin_unlock_irqrestore(&ctlr->queue_lock, flags);
>  
>         ret = __spi_pump_transfer_message(ctlr, msg, was_busy);
> -       if (!ret)
> -               kthread_queue_work(ctlr->kworker, &ctlr->pump_messages);
> +       kthread_queue_work(ctlr->kworker, &ctlr->pump_messages);
>  
>         ctlr->cur_msg = NULL;
>         ctlr->fallback = false;
> 
> The problem is that if __spi_pump_transfer_message() fails, the ctlr->busy
> flag is left true, so __spi_async() is not going to queue new work. The busy
> transition is handled right above that piece of code, in
> __spi_pump_transfer_message(), and the mechanism is to queue more work to do
> it. Apparently work was only queued when the transfer was successful, and I am
> not sure why it was like that. Queuing work unconditionally solves the issue
> and should not be a problem.

This looks like a good spot regardless of if it fixes the issues with
spi-mux, can you submit it as a patch please?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PROBLEM] spi driver internal error during boot on sparx5
  2022-09-01 11:42         ` Vincent Whitchurch
@ 2022-09-01 12:08           ` David Jander
  0 siblings, 0 replies; 11+ messages in thread
From: David Jander @ 2022-09-01 12:08 UTC (permalink / raw)
  To: Vincent Whitchurch
  Cc: Mark Brown, Casper Andersson, linux-spi, Marc Kleine-Budde,
	Andrew Lunn, Lars Povlsen, Steen Hegelund,
	Microchip Linux Driver Support

On Thu, 1 Sep 2022 13:42:19 +0200
Vincent Whitchurch <vincent.whitchurch@axis.com> wrote:

> On Thu, Sep 01, 2022 at 01:02:22PM +0200, David Jander wrote:
> > On Thu, 1 Sep 2022 08:57:37 +0200
> > Vincent Whitchurch <vincent.whitchurch@axis.com> wrote:  
> > > You can reproduce the problem with:
> > > 
> > >  git remote add vwax https://github.com/vwax/linux.git
> > >  git fetch vwax
> > >  git archive vwax/roadtest/devel tools/testing/roadtest | tar xf -
> > >  make -C tools/testing/roadtest/ -j24 OPTS="-k adc --rt-timeout 10 -v"
> > > 
> > > The test passes on v5.19 but on current mainline it hangs and times out.  
> > 
> > This is very nice. Thanks.  
> 
> Great to hear that it was useful!
> 
> > I followed your instructions, and if I apply the following, all tests are
> > passed.
> > 
> > diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
> > index 83da8862b8f2..32c01e684af3 100644
> > --- a/drivers/spi/spi.c
> > +++ b/drivers/spi/spi.c
> > @@ -1727,8 +1727,7 @@ static void __spi_pump_messages(struct spi_controller *ctlr, bool in_kthread)
> >         spin_unlock_irqrestore(&ctlr->queue_lock, flags);
> >  
> >         ret = __spi_pump_transfer_message(ctlr, msg, was_busy);
> > -       if (!ret)
> > -               kthread_queue_work(ctlr->kworker, &ctlr->pump_messages);
> > +       kthread_queue_work(ctlr->kworker, &ctlr->pump_messages);
> >  
> >         ctlr->cur_msg = NULL;
> >         ctlr->fallback = false;
> > 
> > The problem is that if __spi_pump_transfer_message() fails, the ctlr->busy
> > flag is left true, so __spi_async() is not going to queue new work. The busy
> > transition is handled right above that piece of code, in
> > __spi_pump_transfer_message(), and the mechanism is to queue more work to do
> > it. Apparently work was only queued when the transfer was successful, and I am
> > not sure why it was like that. Queuing work unconditionally solves the issue
> > and should not be a problem.
> > 
> > Curious thing is, that this change alone is sufficient to make all the
> > roadtest tests pass. But I still think that does not fix the bug reported by
> > Casper. For that, Mark's patch is also necessary.  
> 
> Yes, I noticed that too.  If you comment out the two usages of
> "@flaky_bus" in tools/testing/roadtest/roadtest/tests/iio/adc/test_adc084s021.py then
> the bus error injection will be disabled and only the success paths will
> be tested.  If you do that and test ae7d2346dc89ae89a6e0 ("spi: Don't
> use the message queue if possible in spi_sync") with the crash fix and
> iio sysfs fixes cherry-picked in, you'll see the following failure
> without Mark's patch:
> 
>  FAILED roadtest/tests/iio/adc/test_adc084s021.py::test_illuminance - BlockingIOError: [Errno 115] Operation now in progress
> 
> But if you move forward to 69fa95905d40846756d22 ("spi: Ensure the
> io_mutex is held until spi_finalize_current_message()"), then the tests
> start passing again, if they're run without the error injection.
> 
> So 69fa95905d40846756d22 seems to be masking the original problem from
> ae7d2346dc89ae89a6e0 (while at the same time introducing the other
> problem in the error handling).

The following patch set should be viewed as one single change:

dc3029056b02 spi: opportunistically skip ctlr->cur_msg_completion
69fa95905d40 spi: Ensure the io_mutex is held until spi_finalize_current_message()
72c5c59b659d spi: Set ctlr->cur_msg also in the sync transfer case
1a9cafcb57b7 spi: Remove unneeded READ_ONCE for ctlr->busy flag
66a221593cb2 spi: Remove the now unused ctlr->idling flag
049d6ccc4da8 spi: Remove check for idling in __spi_pump_messages()
d5256cce1f50 spi: Remove check for controller idling in spi sync path
8711a2ab51dd spi: __spi_pump_messages: Consolidate spin_unlocks to goto target
c1038165fbbf spi: Lock controller idling transition inside the io_mutex
ae7d2346dc89 spi: Don't use the message queue if possible in spi_sync
1714582a3a08 spi: Move ctlr->cur_msg_prepared to struct spi_message

I left the individual steps, in order to make it easier to view how the code
mutated, but the individual steps have had very little testing in between, and
some have known problems in corner-cases.
Particularly patch 69fa95905d40 does fix support for controllers that do not
call spi_finalize_current_message() in the same context as transfer_one(), for
example from IRQ, or in the case of the mux from a completion. If you replace
spi_async() with spi_sync() in spi-mux.c you will accomplish the same.

Best regards,

-- 
David Jander

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PROBLEM] spi driver internal error during boot on sparx5
  2022-09-01 11:02       ` David Jander
  2022-09-01 11:42         ` Vincent Whitchurch
  2022-09-01 11:51         ` Mark Brown
@ 2022-09-01 15:16         ` Casper Andersson
  2022-09-02  6:38           ` David Jander
  2 siblings, 1 reply; 11+ messages in thread
From: Casper Andersson @ 2022-09-01 15:16 UTC (permalink / raw)
  To: David Jander
  Cc: Vincent Whitchurch, Mark Brown, linux-spi, Marc Kleine-Budde,
	Andrew Lunn, Lars Povlsen, Steen Hegelund,
	Microchip Linux Driver Support

Hi,

Thanks for the help and fixes. I did some testing with your patches. I
saw you already submitted the patches, but for reference here's my
results.

David's initial fix (changing async->sync):
no issues

Mark's fix:
no issues

David's second fix:
Issue still occurs. But does not cause any problems and together with
Mark's fix it works fine.

Best Regards,
Casper

On 2022-09-01 13:02, David Jander wrote:
> On Thu, 1 Sep 2022 08:57:37 +0200
> Vincent Whitchurch <vincent.whitchurch@axis.com> wrote:
> 
> > On Wed, Aug 31, 2022 at 05:07:42PM +0100, Mark Brown wrote:
> > > On Mon, Aug 29, 2022 at 10:56:13AM +0200, David Jander wrote:  
> > > > Not sure if this is a correct fix, but I'd like to know if your situation
> > > > changes this way, if you could try it.
> > > > I don't have access to any hardware with a mux unfortunately, so I can't test
> > > > it myself.  
> > > 
> > > I guess claiming to have a noop mux might work for testing, though I'd
> > > be dubious that it was actually doing the mux operations properly?  I  
> > 
> > I'm seeing these problems with the roadtest test framework which uses
> > UML and doesn't need any special hardware.  Roadtest puts its emulated
> > devices under a spi-mux with gpio-mockup and there are no tests for the
> > actual muxing, but other driver tests break since transfers using
> > spi-mux don't work properly.
> > 
> > I pushed the latest version with SPI support to the tree below.  The
> > test_adc084s021 tests a SPI device.  Roadtest is placed inside a linux
> > tree, but it doesn't need any patches to the kernel:
> > 
> >  https://github.com/vwax/linux/commits/roadtest/devel
> > 
> > You can reproduce the problem with:
> > 
> >  git remote add vwax https://github.com/vwax/linux.git
> >  git fetch vwax
> >  git archive vwax/roadtest/devel tools/testing/roadtest | tar xf -
> >  make -C tools/testing/roadtest/ -j24 OPTS="-k adc --rt-timeout 10 -v"
> > 
> > The test passes on v5.19 but on current mainline it hangs and times out.
> 
> This is very nice. Thanks.
> I followed your instructions, and if I apply the following, all tests are
> passed.
> 
> diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
> index 83da8862b8f2..32c01e684af3 100644
> --- a/drivers/spi/spi.c
> +++ b/drivers/spi/spi.c
> @@ -1727,8 +1727,7 @@ static void __spi_pump_messages(struct spi_controller *ctlr, bool in_kthread)
>         spin_unlock_irqrestore(&ctlr->queue_lock, flags);
>  
>         ret = __spi_pump_transfer_message(ctlr, msg, was_busy);
> -       if (!ret)
> -               kthread_queue_work(ctlr->kworker, &ctlr->pump_messages);
> +       kthread_queue_work(ctlr->kworker, &ctlr->pump_messages);
>  
>         ctlr->cur_msg = NULL;
>         ctlr->fallback = false;
> 
> The problem is that if __spi_pump_transfer_message() fails, the ctlr->busy
> flag is left true, so __spi_async() is not going to queue new work. The busy
> transition is handled right above that piece of code, in
> __spi_pump_transfer_message(), and the mechanism is to queue more work to do
> it. Apparently work was only queued when the transfer was successful, and I am
> not sure why it was like that. Queuing work unconditionally solves the issue
> and should not be a problem.
> 
> Curious thing is, that this change alone is sufficient to make all the
> roadtest tests pass. But I still think that does not fix the bug reported by
> Casper. For that, Mark's patch is also necessary.
> 
> @Casper: can you test Mark's patch with above addition?
> 
> > > think we need to either change spi_mux to duplicate the incoming message
> > > (that's probably "cleaner") or teach the core that spi-mux exists and
> > > should always use the pump/queue.  The below might do the trick but in
> > > spite of my suggestion above I've not tested either yet:
> > > 
> > > diff --git a/drivers/spi/spi-mux.c b/drivers/spi/spi-mux.c
> > > index f5d32ec4634e..0709e987bd5a 100644
> > > --- a/drivers/spi/spi-mux.c
> > > +++ b/drivers/spi/spi-mux.c
> > > @@ -161,6 +161,7 @@ static int spi_mux_probe(struct spi_device *spi)
> > >  	ctlr->num_chipselect = mux_control_states(priv->mux);
> > >  	ctlr->bus_num = -1;
> > >  	ctlr->dev.of_node = spi->dev.of_node;
> > > +	ctlr->must_async = true;
> > >  
> > >  	ret = devm_spi_register_controller(&spi->dev, ctlr);
> > >  	if (ret)
> > > diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
> > > index 1cfed874f7ae..88d48a105d3c 100644
> > > --- a/drivers/spi/spi.c
> > > +++ b/drivers/spi/spi.c
> > > @@ -4033,7 +4033,7 @@ static int __spi_sync(struct spi_device *spi, struct spi_message *message)
> > >  	 * guard against reentrancy from a different context. The io_mutex
> > >  	 * will catch those cases.
> > >  	 */
> > > -	if (READ_ONCE(ctlr->queue_empty)) {
> > > +	if (READ_ONCE(ctlr->queue_empty) && !ctlr->must_async) {
> > >  		message->actual_length = 0;
> > >  		message->status = -EINPROGRESS;
> > >  
> > > diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h
> > > index e6c73d5ff1a8..f089ee1ead58 100644
> > > --- a/include/linux/spi/spi.h
> > > +++ b/include/linux/spi/spi.h
> > > @@ -469,6 +469,7 @@ extern struct spi_device *spi_new_ancillary_device(struct spi_device *spi, u8 ch
> > >   *	SPI_TRANS_FAIL_NO_START.
> > >   * @queue_empty: signal green light for opportunistically skipping the queue
> > >   *	for spi_sync transfers.
> > > + * @must_async: disable all fast paths in the core
> > >   *
> > >   * Each SPI controller can communicate with one or more @spi_device
> > >   * children.  These make a small bus, sharing MOSI, MISO and SCK signals
> > > @@ -690,6 +691,7 @@ struct spi_controller {
> > >  
> > >  	/* Flag for enabling opportunistic skipping of the queue in spi_sync */
> > >  	bool			queue_empty;
> > > +	bool			must_async;
> > >  };
> > >  
> > >  static inline void *spi_controller_get_devdata(struct spi_controller *ctlr)
> > > 
> > > Assuming that works out there'll be an extra test in the fast path but
> > > no sync operations, and a performance hit for spi-mux users.  Hopefully
> > > that works as well as it did before.  
> >[...]
> 
> Best regards,
> 
> -- 
> David Jander
> Protonic Holland.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PROBLEM] spi driver internal error during boot on sparx5
  2022-09-01 15:16         ` Casper Andersson
@ 2022-09-02  6:38           ` David Jander
  0 siblings, 0 replies; 11+ messages in thread
From: David Jander @ 2022-09-02  6:38 UTC (permalink / raw)
  To: Casper Andersson
  Cc: Vincent Whitchurch, Mark Brown, linux-spi, Marc Kleine-Budde,
	Andrew Lunn, Lars Povlsen, Steen Hegelund,
	Microchip Linux Driver Support

On Thu, 1 Sep 2022 17:16:54 +0200
Casper Andersson <casper.casan@gmail.com> wrote:

> Thanks for the help and fixes. I did some testing with your patches. I
> saw you already submitted the patches, but for reference here's my
> results.
> 
> David's initial fix (changing async->sync):
> no issues
>
> Mark's fix:
> no issues
> 
> David's second fix:
> Issue still occurs. But does not cause any problems and together with
> Mark's fix it works fine.

Thanks a lot for reporting. Happy to know it worked.

Best regards,

-- 
David Jander

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-09-02  6:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-26  9:41 [PROBLEM] spi driver internal error during boot on sparx5 Casper Andersson
2022-08-29  8:56 ` David Jander
2022-08-31 16:07   ` Mark Brown
2022-09-01  6:57     ` Vincent Whitchurch
2022-09-01 11:02       ` David Jander
2022-09-01 11:42         ` Vincent Whitchurch
2022-09-01 12:08           ` David Jander
2022-09-01 11:51         ` Mark Brown
2022-09-01 15:16         ` Casper Andersson
2022-09-02  6:38           ` David Jander
2022-09-01 11:11       ` Mark Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.