All of lore.kernel.org
 help / color / mirror / Atom feed
* mtd: pxa3xx_nand: issue with command time out
@ 2016-01-13  4:13 ` Michael Wang
  0 siblings, 0 replies; 30+ messages in thread
From: Michael Wang @ 2016-01-13  4:13 UTC (permalink / raw)
  To: robert.jarzmik; +Cc: linux-mtd, linux-arm-kernel, ezequiel, computersforpeace

Hi Robert,

I am currently working with Marvell 98DX4251 switchchip with control and
management subsystem SOC based on ARMADA XP, and after upgrading kernel from
3.16.7 to 4.3, the nand controller command have a high chance of timing out.

I have tried adding my own debugs, which result in the problem going away,
and therefore lead me think it is a timing sensitive issue. After going 
through
git commit history, I have tried reverting this patch:

"mtd: pxa3xx-nand: handle PIO in threaded interrupt"
http://lists.infradead.org/pipermail/linux-mtd/2015-February/057913.html

and this allows the nand controller to work properly again.

In addition, since this problem seems to be timing sensitive, I have 
also tried
increasing the CHIP_DELAY_TIMEOUT value, which also got the NAND controller
working properly without commands timing out.

diff --git a/drivers/mtd/nand/pxa3xx_nand.c b/drivers/mtd/nand/pxa3xx_nand.c
index 740983a..63a53e2 100644
--- a/drivers/mtd/nand/pxa3xx_nand.c
+++ b/drivers/mtd/nand/pxa3xx_nand.c
@@ -39,7 +39,7 @@

  #include <linux/platform_data/mtd-nand-pxa3xx.h>

-#define        CHIP_DELAY_TIMEOUT      msecs_to_jiffies(200)
+#define        CHIP_DELAY_TIMEOUT      msecs_to_jiffies(300)
  #define NAND_STOP_DELAY                msecs_to_jiffies(40)
  #define PAGE_CHUNK_SIZE                (2048)


So I guess my question is whether the CHIP_DELAY_TIMEOUT should be increased
for 98DX4251, or is this likely to be caused by some other issue?


Another thing I have tried is cherry-picking the following new patches 
in 4.4,
but the command time out issue still exist:
234b0ab mtd: pxa3xx_nand: clean up the pxa3xx timings
dc5f40d mtd: pxa3xx_nand: rework flash detection and timing setup
7bfd112 mtd: pxa3xx_nand: add helpers to setup the timings
05b8cd0 mtd: pxa3xx_nand: fix some compile issues on non-ARM arches
6369e7f mtd: pxa3xx_nand: switch to device PM
16cf509 mtd: pxa3xx_nand: don't duplicate MTD suspend/resume
932fbc8 mtd: nand: pxa3xx_nand: show parent device in sysfs
3819b67 mtd: nand: pxa3xx-nand: prevent DFI bus lockup on removal
c36eb77 mtd: nand: pxa3xx-nand: switch to dmaengine
c141e2c mtd: pxa3xx_nand: Remove unused platform-data flash specification


I found this patch, which sounds very relevant to my issue, and is 
within the
update, but the issue still exists on 98DX4251:
"mtd: nand: pxa3xx-nand: fix random command timeouts"
http://lists.infradead.org/pipermail/linux-mtd/2015-August/061107.html


I have copied portion of the kernel boot log below:

pxa3xx-nand f10d0000.nand: This platform can't do DMA on this device
nand: device found, Manufacturer ID: 0x2c, Chip ID: 0x38
nand: Micron MT29F8G08ABABAWP
nand: 1024 MiB, SLC, erase size: 512 KiB, page size: 4096, OOB size: 224
pxa3xx-nand f10d0000.nand: ECC strength 16, ECC step size 2048
Bad block table found at page 262016, version 0x01
Bad block table found at page 261888, version 0x01
2 ofpart partitions found on MTD device pxa3xx_nand-0
Creating 2 MTD partitions on "pxa3xx_nand-0":
0x000000000000-0x00003f800000 : "user"
0x00003f800000-0x000040000000 : "errlog"

ubi0: attaching mtd0
random: nonblocking pool is initialized
ubi0: scanning is finished
ubi0 warning: print_rsvd_warning: cannot reserve enough PEBs for bad PEB 
handling, reserved 24, need 40
ubi0: attached mtd0 (name "user", size 1016 MiB)
ubi0: PEB size: 524288 bytes (512 KiB), LEB size: 516096 bytes
ubi0: min./max. I/O unit sizes: 4096/4096, sub-page size 4096
ubi0: VID header offset: 4096 (aligned 4096), data offset: 8192
ubi0: good PEBs: 2032, bad PEBs: 0, corrupted PEBs: 0
ubi0: user volume: 1, internal volumes: 1, max. volumes count: 128
ubi0: max/mean erase counter: 140/75, WL threshold: 4096, image sequence 
number: 581719033
ubi0: available PEBs: 0, total reserved PEBs: 2032, PEBs reserved for 
bad PEB handling: 24
ubi0: background thread "ubi_bgt0d" started, PID 715
UBIFS (ubi0:0): background thread "ubifs_bgt0_0" started, PID 720
UBIFS (ubi0:0): UBIFS: mounted UBI device 0, volume 0, name "user"
UBIFS (ubi0:0): LEB size: 516096 bytes (504 KiB), min./max. I/O unit 
sizes: 4096 bytes/4096 bytes
UBIFS (ubi0:0): FS size: 1029095424 bytes (981 MiB, 1994 LEBs), journal 
size 10452992 bytes (9 MiB, 21 LEBs)
UBIFS (ubi0:0): reserved for root: 0 bytes (0 KiB)
UBIFS (ubi0:0): media format: w4/r0 (latest is w4/r0), UUID 
A12323F1-CA09-47D2-BDD4-50807C99149A, small LPT model
pxa3xx-nand f10d0000.nand: Wait time out!!!
UBIFS error (ubi0:0 pid 724): ubifs_read_node: bad node type (255 but 
expected 1)
pxa3xx-nand f10d0000.nand: handle_data_pio: invalid state 0
------------[ cut here ]------------
kernel BUG at 
/home/maker/jenkins/workspace/1901_linux_upgrade_continuous_build_targets/TARGET/SBx81CFC960/label/i7/lin
ux/drivers/mtd/nand/pxa3xx_nand.c:553!
Internal error: Oops - BUG: 0 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 434 Comm: irq/28-f10d0000 Not tainted 4.3.0-at1 #1
Hardware name: Marvell Armada 370/XP (Device Tree)
task: bc378900 ti: bd350000 task.ti: bd350000
PC is at pxa3xx_nand_irq_thread+0x40/0x12c
LR is at pxa3xx_nand_irq_thread+0x40/0x12c
pc : [<802f2a28>]    lr : [<802f2a28>]    psr: 60000013
sp : bd351f20  ip : 00000000  fp : 00000000
r10: 00000001  r9 : bc0d1ad8  r8 : 00000000
r7 : 80065a8c  r6 : bd2497c0  r5 : 00000800  r4 : bc249810
r3 : bd350000  r2 : 806a6cf4  r1 : 60000013  r0 : 0000003b
Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 3d54006a  DAC: 00000051
Process irq/28-f10d0000 (pid: 434, stack limit = 0xbd350220)
Stack: (0xbd351f20 to 0xbd352000)
1f20: 802f29e8 bd2497c0 bc0d1a80 80065aa8 bd2497e0 bc0d1a80 bd2497c0 
80065878
1f40: 00000000 800659c4 bd249800 00000000 bd2497c0 8006574c 00000000 
00000000
1f60: 00000000 8003be24 e08f3003 00000000 e28330c0 bd2497c0 00000000 
00000000
1f80: bd351f80 bd351f80 00000000 00000000 bd351f90 bd351f90 bd351fac 
bd249800
1fa0: 8003bd4c 00000000 00000000 8000f6b8 00000000 00000000 00000000 
00000000
1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
00000000
1fe0: 00000000 00000000 00000000 00000000 00000013 00000000 000033fc 
00003414
[<802f2a28>] (pxa3xx_nand_irq_thread) from [<80065aa8>] 
(irq_thread_fn+0x1c/0x40)
[<80065aa8>] (irq_thread_fn) from [<80065878>] (irq_thread+0x12c/0x178)
[<80065878>] (irq_thread) from [<8003be24>] (kthread+0xd8/0xf4)
[<8003be24>] (kthread) from [<8000f6b8>] (ret_from_fork+0x14/0x3c)
Code: e59f20ec e59f10ec e2800010 ebfeefd5 (e7f001f2)
---[ end trace dce614962c53f62c ]---
Unable to handle kernel paging request at virtual address ffffffec
pgd = 80004000
[ffffffec] *pgd=3fffd861, *pte=00000000, *ppte=00000000
Internal error: Oops: 37 [#2] SMP ARM
Modules linked in:
CPU: 0 PID: 434 Comm: irq/28-f10d0000 Tainted: G      D 4.3.0-at1 #1
Hardware name: Marvell Armada 370/XP (Device Tree)
task: bc378900 ti: bd350000 task.ti: bd350000
PC is at kthread_data+0x4/0xc
LR is at irq_thread_dtor+0x28/0xc8
pc : [<8003c404>]    lr : [<800659ec>]    psr: 20000093
sp : bd351da8  ip : bd351f84  fp : 00000001
r10: bd351dd4  r9 : 60000093  r8 : bc378900
r7 : 806d232c  r6 : 00000000  r5 : bc378900  r4 : bc378c94
r3 : 00000000  r2 : bd351da8  r1 : 000039a2  r0 : bc378900
Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 3d54006a  DAC: 00000051
Process irq/28-f10d0000 (pid: 434, stack limit = 0xbd350220)
Stack: (0xbd351da8 to 0xbd352000)
1da0:                   800659c4 bc378c94 bc378900 8003a81c 00000000 
bc378900
1dc0: 00000000 80591d94 806a3cdc 80024e64 806d3a38 00000002 806a6cd4 
80063f9c
1de0: 00000000 806d14c4 0000000b 80591d94 806a3cdc bc378900 60000093 
00000000
1e00: 00000000 80012ca8 bd350220 0000000b bd351ed0 802f2a28 e7f001f2 
00000000
1e20: e7100000 80013660 bd350000 800090bc 00000006 6f667461 00000004 
00000000
1e40: 00030001 802f2a28 3a6d726f 64303166 30303030 6e616e2e 00000064 
0000002e
1e60: bffcb278 00000008 00000003 00000020 11e2054f 00000000 00000000 
00000002
1e80: bffceb80 cb550fa0 00000002 bc01f500 1dcd6500 bffceb80 000004a8 
bffceb80
1ea0: 806d2940 bffcecb0 00000000 bd351f04 bc16fa10 00000003 80065a8c 
802f2a2c
1ec0: 00000000 80013a98 00000000 80013660 0000003b 60000013 806a6cf4 
bd350000
1ee0: bc249810 00000800 bd2497c0 80065a8c 00000000 bc0d1ad8 00000001 
00000000
1f00: 00000000 bd351f20 802f2a28 802f2a28 60000013 ffffffff 00000051 
00000000
1f20: 802f29e8 bd2497c0 bc0d1a80 80065aa8 bd2497e0 bc0d1a80 bd2497c0 
80065878
1f40: 00000000 800659c4 bd249800 00000000 bd2497c0 8006574c 00000000 
00000000
1f60: 00000000 8003be24 e08f3003 00000000 e28330c0 bd2497c0 00000000 
00000000
1f80: bd351f80 bd351f80 00000001 00010001 bd351f90 bd351f90 bd351fac 
bd249800
1fa0: 8003bd4c 00000000 00000000 8000f6b8 00000000 00000000 00000000 
00000000
1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
00000000
1fe0: 00000000 00000000 00000000 00000000 00000013 00000000 000033fc 
00003414
[<8003c404>] (kthread_data) from [<800659ec>] (irq_thread_dtor+0x28/0xc8)
[<800659ec>] (irq_thread_dtor) from [<8003a81c>] (task_work_run+0x88/0xbc)
[<8003a81c>] (task_work_run) from [<80024e64>] (do_exit+0x2fc/0x940)
[<80024e64>] (do_exit) from [<80012ca8>] (die+0x214/0x2e8)
[<80012ca8>] (die) from [<800090bc>] (do_undefinstr+0xb8/0x220)
[<800090bc>] (do_undefinstr) from [<80013660>] (__und_svc_finish+0x0/0x20)
Exception stack(0xbd351ed0 to 0xbd351f18)
1ec0:                                     0000003b 60000013 806a6cf4 
bd350000
1ee0: bc249810 00000800 bd2497c0 80065a8c 00000000 bc0d1ad8 00000001 
00000000
1f00: 00000000 bd351f20 802f2a28 802f2a28 60000013 ffffffff
[<80013660>] (__und_svc_finish) from [<802f2a28>] 
(pxa3xx_nand_irq_thread+0x40/0x12c)
[<802f2a28>] (pxa3xx_nand_irq_thread) from [<80065aa8>] 
(irq_thread_fn+0x1c/0x40)
[<80065aa8>] (irq_thread_fn) from [<80065878>] (irq_thread+0x12c/0x178)
[<80065878>] (irq_thread) from [<8003be24>] (kthread+0xd8/0xf4)
[<8003be24>] (kthread) from [<8000f6b8>] (ret_from_fork+0x14/0x3c)
Code: 012fff1e e241101c eaffffbe e59032a0 (e5130014)
---[ end trace dce614962c53f62d ]---


Any suggestion, help or hint will be very much appreciated.

Thanks,
Michael

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* mtd: pxa3xx_nand: issue with command time out
@ 2016-01-13  4:13 ` Michael Wang
  0 siblings, 0 replies; 30+ messages in thread
From: Michael Wang @ 2016-01-13  4:13 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Robert,

I am currently working with Marvell 98DX4251 switchchip with control and
management subsystem SOC based on ARMADA XP, and after upgrading kernel from
3.16.7 to 4.3, the nand controller command have a high chance of timing out.

I have tried adding my own debugs, which result in the problem going away,
and therefore lead me think it is a timing sensitive issue. After going 
through
git commit history, I have tried reverting this patch:

"mtd: pxa3xx-nand: handle PIO in threaded interrupt"
http://lists.infradead.org/pipermail/linux-mtd/2015-February/057913.html

and this allows the nand controller to work properly again.

In addition, since this problem seems to be timing sensitive, I have 
also tried
increasing the CHIP_DELAY_TIMEOUT value, which also got the NAND controller
working properly without commands timing out.

diff --git a/drivers/mtd/nand/pxa3xx_nand.c b/drivers/mtd/nand/pxa3xx_nand.c
index 740983a..63a53e2 100644
--- a/drivers/mtd/nand/pxa3xx_nand.c
+++ b/drivers/mtd/nand/pxa3xx_nand.c
@@ -39,7 +39,7 @@

  #include <linux/platform_data/mtd-nand-pxa3xx.h>

-#define        CHIP_DELAY_TIMEOUT      msecs_to_jiffies(200)
+#define        CHIP_DELAY_TIMEOUT      msecs_to_jiffies(300)
  #define NAND_STOP_DELAY                msecs_to_jiffies(40)
  #define PAGE_CHUNK_SIZE                (2048)


So I guess my question is whether the CHIP_DELAY_TIMEOUT should be increased
for 98DX4251, or is this likely to be caused by some other issue?


Another thing I have tried is cherry-picking the following new patches 
in 4.4,
but the command time out issue still exist:
234b0ab mtd: pxa3xx_nand: clean up the pxa3xx timings
dc5f40d mtd: pxa3xx_nand: rework flash detection and timing setup
7bfd112 mtd: pxa3xx_nand: add helpers to setup the timings
05b8cd0 mtd: pxa3xx_nand: fix some compile issues on non-ARM arches
6369e7f mtd: pxa3xx_nand: switch to device PM
16cf509 mtd: pxa3xx_nand: don't duplicate MTD suspend/resume
932fbc8 mtd: nand: pxa3xx_nand: show parent device in sysfs
3819b67 mtd: nand: pxa3xx-nand: prevent DFI bus lockup on removal
c36eb77 mtd: nand: pxa3xx-nand: switch to dmaengine
c141e2c mtd: pxa3xx_nand: Remove unused platform-data flash specification


I found this patch, which sounds very relevant to my issue, and is 
within the
update, but the issue still exists on 98DX4251:
"mtd: nand: pxa3xx-nand: fix random command timeouts"
http://lists.infradead.org/pipermail/linux-mtd/2015-August/061107.html


I have copied portion of the kernel boot log below:

pxa3xx-nand f10d0000.nand: This platform can't do DMA on this device
nand: device found, Manufacturer ID: 0x2c, Chip ID: 0x38
nand: Micron MT29F8G08ABABAWP
nand: 1024 MiB, SLC, erase size: 512 KiB, page size: 4096, OOB size: 224
pxa3xx-nand f10d0000.nand: ECC strength 16, ECC step size 2048
Bad block table found at page 262016, version 0x01
Bad block table found at page 261888, version 0x01
2 ofpart partitions found on MTD device pxa3xx_nand-0
Creating 2 MTD partitions on "pxa3xx_nand-0":
0x000000000000-0x00003f800000 : "user"
0x00003f800000-0x000040000000 : "errlog"

ubi0: attaching mtd0
random: nonblocking pool is initialized
ubi0: scanning is finished
ubi0 warning: print_rsvd_warning: cannot reserve enough PEBs for bad PEB 
handling, reserved 24, need 40
ubi0: attached mtd0 (name "user", size 1016 MiB)
ubi0: PEB size: 524288 bytes (512 KiB), LEB size: 516096 bytes
ubi0: min./max. I/O unit sizes: 4096/4096, sub-page size 4096
ubi0: VID header offset: 4096 (aligned 4096), data offset: 8192
ubi0: good PEBs: 2032, bad PEBs: 0, corrupted PEBs: 0
ubi0: user volume: 1, internal volumes: 1, max. volumes count: 128
ubi0: max/mean erase counter: 140/75, WL threshold: 4096, image sequence 
number: 581719033
ubi0: available PEBs: 0, total reserved PEBs: 2032, PEBs reserved for 
bad PEB handling: 24
ubi0: background thread "ubi_bgt0d" started, PID 715
UBIFS (ubi0:0): background thread "ubifs_bgt0_0" started, PID 720
UBIFS (ubi0:0): UBIFS: mounted UBI device 0, volume 0, name "user"
UBIFS (ubi0:0): LEB size: 516096 bytes (504 KiB), min./max. I/O unit 
sizes: 4096 bytes/4096 bytes
UBIFS (ubi0:0): FS size: 1029095424 bytes (981 MiB, 1994 LEBs), journal 
size 10452992 bytes (9 MiB, 21 LEBs)
UBIFS (ubi0:0): reserved for root: 0 bytes (0 KiB)
UBIFS (ubi0:0): media format: w4/r0 (latest is w4/r0), UUID 
A12323F1-CA09-47D2-BDD4-50807C99149A, small LPT model
pxa3xx-nand f10d0000.nand: Wait time out!!!
UBIFS error (ubi0:0 pid 724): ubifs_read_node: bad node type (255 but 
expected 1)
pxa3xx-nand f10d0000.nand: handle_data_pio: invalid state 0
------------[ cut here ]------------
kernel BUG at 
/home/maker/jenkins/workspace/1901_linux_upgrade_continuous_build_targets/TARGET/SBx81CFC960/label/i7/lin
ux/drivers/mtd/nand/pxa3xx_nand.c:553!
Internal error: Oops - BUG: 0 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 434 Comm: irq/28-f10d0000 Not tainted 4.3.0-at1 #1
Hardware name: Marvell Armada 370/XP (Device Tree)
task: bc378900 ti: bd350000 task.ti: bd350000
PC is at pxa3xx_nand_irq_thread+0x40/0x12c
LR is at pxa3xx_nand_irq_thread+0x40/0x12c
pc : [<802f2a28>]    lr : [<802f2a28>]    psr: 60000013
sp : bd351f20  ip : 00000000  fp : 00000000
r10: 00000001  r9 : bc0d1ad8  r8 : 00000000
r7 : 80065a8c  r6 : bd2497c0  r5 : 00000800  r4 : bc249810
r3 : bd350000  r2 : 806a6cf4  r1 : 60000013  r0 : 0000003b
Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 3d54006a  DAC: 00000051
Process irq/28-f10d0000 (pid: 434, stack limit = 0xbd350220)
Stack: (0xbd351f20 to 0xbd352000)
1f20: 802f29e8 bd2497c0 bc0d1a80 80065aa8 bd2497e0 bc0d1a80 bd2497c0 
80065878
1f40: 00000000 800659c4 bd249800 00000000 bd2497c0 8006574c 00000000 
00000000
1f60: 00000000 8003be24 e08f3003 00000000 e28330c0 bd2497c0 00000000 
00000000
1f80: bd351f80 bd351f80 00000000 00000000 bd351f90 bd351f90 bd351fac 
bd249800
1fa0: 8003bd4c 00000000 00000000 8000f6b8 00000000 00000000 00000000 
00000000
1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
00000000
1fe0: 00000000 00000000 00000000 00000000 00000013 00000000 000033fc 
00003414
[<802f2a28>] (pxa3xx_nand_irq_thread) from [<80065aa8>] 
(irq_thread_fn+0x1c/0x40)
[<80065aa8>] (irq_thread_fn) from [<80065878>] (irq_thread+0x12c/0x178)
[<80065878>] (irq_thread) from [<8003be24>] (kthread+0xd8/0xf4)
[<8003be24>] (kthread) from [<8000f6b8>] (ret_from_fork+0x14/0x3c)
Code: e59f20ec e59f10ec e2800010 ebfeefd5 (e7f001f2)
---[ end trace dce614962c53f62c ]---
Unable to handle kernel paging request at virtual address ffffffec
pgd = 80004000
[ffffffec] *pgd=3fffd861, *pte=00000000, *ppte=00000000
Internal error: Oops: 37 [#2] SMP ARM
Modules linked in:
CPU: 0 PID: 434 Comm: irq/28-f10d0000 Tainted: G      D 4.3.0-at1 #1
Hardware name: Marvell Armada 370/XP (Device Tree)
task: bc378900 ti: bd350000 task.ti: bd350000
PC is at kthread_data+0x4/0xc
LR is at irq_thread_dtor+0x28/0xc8
pc : [<8003c404>]    lr : [<800659ec>]    psr: 20000093
sp : bd351da8  ip : bd351f84  fp : 00000001
r10: bd351dd4  r9 : 60000093  r8 : bc378900
r7 : 806d232c  r6 : 00000000  r5 : bc378900  r4 : bc378c94
r3 : 00000000  r2 : bd351da8  r1 : 000039a2  r0 : bc378900
Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 3d54006a  DAC: 00000051
Process irq/28-f10d0000 (pid: 434, stack limit = 0xbd350220)
Stack: (0xbd351da8 to 0xbd352000)
1da0:                   800659c4 bc378c94 bc378900 8003a81c 00000000 
bc378900
1dc0: 00000000 80591d94 806a3cdc 80024e64 806d3a38 00000002 806a6cd4 
80063f9c
1de0: 00000000 806d14c4 0000000b 80591d94 806a3cdc bc378900 60000093 
00000000
1e00: 00000000 80012ca8 bd350220 0000000b bd351ed0 802f2a28 e7f001f2 
00000000
1e20: e7100000 80013660 bd350000 800090bc 00000006 6f667461 00000004 
00000000
1e40: 00030001 802f2a28 3a6d726f 64303166 30303030 6e616e2e 00000064 
0000002e
1e60: bffcb278 00000008 00000003 00000020 11e2054f 00000000 00000000 
00000002
1e80: bffceb80 cb550fa0 00000002 bc01f500 1dcd6500 bffceb80 000004a8 
bffceb80
1ea0: 806d2940 bffcecb0 00000000 bd351f04 bc16fa10 00000003 80065a8c 
802f2a2c
1ec0: 00000000 80013a98 00000000 80013660 0000003b 60000013 806a6cf4 
bd350000
1ee0: bc249810 00000800 bd2497c0 80065a8c 00000000 bc0d1ad8 00000001 
00000000
1f00: 00000000 bd351f20 802f2a28 802f2a28 60000013 ffffffff 00000051 
00000000
1f20: 802f29e8 bd2497c0 bc0d1a80 80065aa8 bd2497e0 bc0d1a80 bd2497c0 
80065878
1f40: 00000000 800659c4 bd249800 00000000 bd2497c0 8006574c 00000000 
00000000
1f60: 00000000 8003be24 e08f3003 00000000 e28330c0 bd2497c0 00000000 
00000000
1f80: bd351f80 bd351f80 00000001 00010001 bd351f90 bd351f90 bd351fac 
bd249800
1fa0: 8003bd4c 00000000 00000000 8000f6b8 00000000 00000000 00000000 
00000000
1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
00000000
1fe0: 00000000 00000000 00000000 00000000 00000013 00000000 000033fc 
00003414
[<8003c404>] (kthread_data) from [<800659ec>] (irq_thread_dtor+0x28/0xc8)
[<800659ec>] (irq_thread_dtor) from [<8003a81c>] (task_work_run+0x88/0xbc)
[<8003a81c>] (task_work_run) from [<80024e64>] (do_exit+0x2fc/0x940)
[<80024e64>] (do_exit) from [<80012ca8>] (die+0x214/0x2e8)
[<80012ca8>] (die) from [<800090bc>] (do_undefinstr+0xb8/0x220)
[<800090bc>] (do_undefinstr) from [<80013660>] (__und_svc_finish+0x0/0x20)
Exception stack(0xbd351ed0 to 0xbd351f18)
1ec0:                                     0000003b 60000013 806a6cf4 
bd350000
1ee0: bc249810 00000800 bd2497c0 80065a8c 00000000 bc0d1ad8 00000001 
00000000
1f00: 00000000 bd351f20 802f2a28 802f2a28 60000013 ffffffff
[<80013660>] (__und_svc_finish) from [<802f2a28>] 
(pxa3xx_nand_irq_thread+0x40/0x12c)
[<802f2a28>] (pxa3xx_nand_irq_thread) from [<80065aa8>] 
(irq_thread_fn+0x1c/0x40)
[<80065aa8>] (irq_thread_fn) from [<80065878>] (irq_thread+0x12c/0x178)
[<80065878>] (irq_thread) from [<8003be24>] (kthread+0xd8/0xf4)
[<8003be24>] (kthread) from [<8000f6b8>] (ret_from_fork+0x14/0x3c)
Code: 012fff1e e241101c eaffffbe e59032a0 (e5130014)
---[ end trace dce614962c53f62d ]---


Any suggestion, help or hint will be very much appreciated.

Thanks,
Michael

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: mtd: pxa3xx_nand: issue with command time out
  2016-01-13  4:13 ` Michael Wang
@ 2016-01-13  7:47   ` Robert Jarzmik
  -1 siblings, 0 replies; 30+ messages in thread
From: Robert Jarzmik @ 2016-01-13  7:47 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-mtd, linux-arm-kernel, ezequiel, computersforpeace

Michael Wang <Michael.Wang@alliedtelesis.co.nz> writes:

> Hi Robert,
>
> I am currently working with Marvell 98DX4251 switchchip with control and
> management subsystem SOC based on ARMADA XP, and after upgrading kernel from
...zip...

> 3.16.7 to 4.3, the nand controller command have a high chance of timing out.
> -#define        CHIP_DELAY_TIMEOUT      msecs_to_jiffies(200)
> +#define        CHIP_DELAY_TIMEOUT      msecs_to_jiffies(300)
>   #define NAND_STOP_DELAY                msecs_to_jiffies(40)
>   #define PAGE_CHUNK_SIZE                (2048)
>
> So I guess my question is whether the CHIP_DELAY_TIMEOUT should be increased
> for 98DX4251, or is this likely to be caused by some other issue?

Hi Michael,

> Any suggestion, help or hint will be very much appreciated.
Actually I have one.
Before we change the timeout, I'd like to see the sequence of register accesses.

As I owe a patch to Brian on this topic, I'd like to have you as a tester. In
return, I'll analyse the output of debug traces it generates for you.

The patch is at the bottom of this mail, and a "git am" on the mail should apply
it to your kernel. It leaves up to you to add the "#define VERBOSE_DEBUG 1" in
pxa3xx-nand.c, recompile, and take the traces.

Once I have the traces, I will be able to compare with my traces on pxa3xx
platforms and spot if there is a difference. Then we'll be able to see if
increasing the timeout is the only option.

Cheers.

-- 
Robert

--->8---
>From 2ab54517c4162e0041d11dd9ae376818d9a3a3e3 Mon Sep 17 00:00:00 2001
From: Robert Jarzmik <robert.jarzmik@free.fr>
Date: Tue, 11 Aug 2015 21:54:11 +0200
Subject: [PATCH v2] mtd: nand: pxa3xx_nand: add register access debug

Add verbose debug for register accesses. This enables easier debugging
by following where and how hardware is stimulated, and how it answers.

Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr>
---
 drivers/mtd/nand/pxa3xx_nand.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/drivers/mtd/nand/pxa3xx_nand.c b/drivers/mtd/nand/pxa3xx_nand.c
index e453ae9a17fa..0845d09e14c1 100644
--- a/drivers/mtd/nand/pxa3xx_nand.c
+++ b/drivers/mtd/nand/pxa3xx_nand.c
@@ -136,11 +136,23 @@
 #define READ_ID_BYTES		7
 
 /* macros for registers read/write */
-#define nand_writel(info, off, val)	\
-	writel_relaxed((val), (info)->mmio_base + (off))
-
-#define nand_readl(info, off)		\
-	readl_relaxed((info)->mmio_base + (off))
+#define nand_writel(info, off, val)					\
+	do {								\
+		dev_vdbg(&info->pdev->dev,				\
+			 "%s():%d nand_writel(0x%x, 0x%04x)\n",		\
+			 __func__, __LINE__, (val), (off));		\
+		writel_relaxed((val), (info)->mmio_base + (off));	\
+	} while (0)
+
+#define nand_readl(info, off)						\
+	({								\
+		unsigned int _v;					\
+		_v = readl_relaxed((info)->mmio_base + (off));		\
+		dev_vdbg(&info->pdev->dev,				\
+			 "%s():%d nand_readl(0x%04x): 0x%x\n",		\
+			 __func__, __LINE__, (off), _v);		\
+		_v;							\
+	})
 
 /* error code and state */
 enum {
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* mtd: pxa3xx_nand: issue with command time out
@ 2016-01-13  7:47   ` Robert Jarzmik
  0 siblings, 0 replies; 30+ messages in thread
From: Robert Jarzmik @ 2016-01-13  7:47 UTC (permalink / raw)
  To: linux-arm-kernel

Michael Wang <Michael.Wang@alliedtelesis.co.nz> writes:

> Hi Robert,
>
> I am currently working with Marvell 98DX4251 switchchip with control and
> management subsystem SOC based on ARMADA XP, and after upgrading kernel from
...zip...

> 3.16.7 to 4.3, the nand controller command have a high chance of timing out.
> -#define        CHIP_DELAY_TIMEOUT      msecs_to_jiffies(200)
> +#define        CHIP_DELAY_TIMEOUT      msecs_to_jiffies(300)
>   #define NAND_STOP_DELAY                msecs_to_jiffies(40)
>   #define PAGE_CHUNK_SIZE                (2048)
>
> So I guess my question is whether the CHIP_DELAY_TIMEOUT should be increased
> for 98DX4251, or is this likely to be caused by some other issue?

Hi Michael,

> Any suggestion, help or hint will be very much appreciated.
Actually I have one.
Before we change the timeout, I'd like to see the sequence of register accesses.

As I owe a patch to Brian on this topic, I'd like to have you as a tester. In
return, I'll analyse the output of debug traces it generates for you.

The patch is at the bottom of this mail, and a "git am" on the mail should apply
it to your kernel. It leaves up to you to add the "#define VERBOSE_DEBUG 1" in
pxa3xx-nand.c, recompile, and take the traces.

Once I have the traces, I will be able to compare with my traces on pxa3xx
platforms and spot if there is a difference. Then we'll be able to see if
increasing the timeout is the only option.

Cheers.

-- 
Robert

--->8---
>From 2ab54517c4162e0041d11dd9ae376818d9a3a3e3 Mon Sep 17 00:00:00 2001
From: Robert Jarzmik <robert.jarzmik@free.fr>
Date: Tue, 11 Aug 2015 21:54:11 +0200
Subject: [PATCH v2] mtd: nand: pxa3xx_nand: add register access debug

Add verbose debug for register accesses. This enables easier debugging
by following where and how hardware is stimulated, and how it answers.

Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr>
---
 drivers/mtd/nand/pxa3xx_nand.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/drivers/mtd/nand/pxa3xx_nand.c b/drivers/mtd/nand/pxa3xx_nand.c
index e453ae9a17fa..0845d09e14c1 100644
--- a/drivers/mtd/nand/pxa3xx_nand.c
+++ b/drivers/mtd/nand/pxa3xx_nand.c
@@ -136,11 +136,23 @@
 #define READ_ID_BYTES		7
 
 /* macros for registers read/write */
-#define nand_writel(info, off, val)	\
-	writel_relaxed((val), (info)->mmio_base + (off))
-
-#define nand_readl(info, off)		\
-	readl_relaxed((info)->mmio_base + (off))
+#define nand_writel(info, off, val)					\
+	do {								\
+		dev_vdbg(&info->pdev->dev,				\
+			 "%s():%d nand_writel(0x%x, 0x%04x)\n",		\
+			 __func__, __LINE__, (val), (off));		\
+		writel_relaxed((val), (info)->mmio_base + (off));	\
+	} while (0)
+
+#define nand_readl(info, off)						\
+	({								\
+		unsigned int _v;					\
+		_v = readl_relaxed((info)->mmio_base + (off));		\
+		dev_vdbg(&info->pdev->dev,				\
+			 "%s():%d nand_readl(0x%04x): 0x%x\n",		\
+			 __func__, __LINE__, (off), _v);		\
+		_v;							\
+	})
 
 /* error code and state */
 enum {
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: mtd: pxa3xx_nand: issue with command time out
  2016-01-13  7:47   ` Robert Jarzmik
  (?)
@ 2016-01-14 22:57   ` Michael Wang
  2016-01-16 23:53       ` Robert Jarzmik
  -1 siblings, 1 reply; 30+ messages in thread
From: Michael Wang @ 2016-01-14 22:57 UTC (permalink / raw)
  To: Robert Jarzmik; +Cc: linux-mtd, linux-arm-kernel, ezequiel, computersforpeace

[-- Attachment #1: Type: text/plain, Size: 1042 bytes --]


On 01/13/2016 08:47 PM, Robert Jarzmik wrote:
> Hi Michael,
>> Any suggestion, help or hint will be very much appreciated.
> Actually I have one.
> Before we change the timeout, I'd like to see the sequence of register accesses.
>
> As I owe a patch to Brian on this topic, I'd like to have you as a tester. In
> return, I'll analyse the output of debug traces it generates for you.
>
> The patch is at the bottom of this mail, and a "git am" on the mail should apply
> it to your kernel. It leaves up to you to add the "#define VERBOSE_DEBUG 1" in
> pxa3xx-nand.c, recompile, and take the traces.
>
> Once I have the traces, I will be able to compare with my traces on pxa3xx
> platforms and spot if there is a difference. Then we'll be able to see if
> increasing the timeout is the only option.
>
> Cheers.
>
Hi Robert,

Thank you for your reply.

I am happy to be the tester for this issue.

The complete output of debug traces and boot log from your debug patch
is in the attachment.

Thanks,
Michael



[-- Attachment #2: pxa3xx_debug.zip --]
[-- Type: application/zip, Size: 194242 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mtd: pxa3xx_nand: issue with command time out
  2016-01-14 22:57   ` Michael Wang
@ 2016-01-16 23:53       ` Robert Jarzmik
  0 siblings, 0 replies; 30+ messages in thread
From: Robert Jarzmik @ 2016-01-16 23:53 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-mtd, linux-arm-kernel, ezequiel, computersforpeace

Michael Wang <Michael.Wang@alliedtelesis.co.nz> writes:

> The complete output of debug traces and boot log from your debug patch
> is in the attachment.

Okay, thanks, that's great.
Might a ask another capture, but with :
 1) the timestamps this time
    (ie. CONFIG_PRINTK_TIME=y) ? This will give me the timing of your failure,
    in the dmesg you will provide me.
 2) on the first line of drain_fifo(), add a single :
    nand_readl(info, NDSR)

This will enable me to see how your board behaves on the temporal aspect, as
today I only have this, which looks correct AFAICT:

pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():445 nand_writel(0x1, NDECCCTRL)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():465 nand_writel(0xfff, NDSR)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():466 nand_writel(0x0, NDCR)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():467 nand_writel(0xd104b000, NDCR)

pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x1
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():706 nand_writel(0x1, NDSR)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():720 nand_writel(0x100d3000, NDCB0)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():721 nand_writel(0x4ff0000, NDCB0)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():722 nand_writel(0x2, NDCB0)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():726 nand_writel(0x820, NDCB0)
  => RJK: here the read command is submitted

pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x1800
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():706 nand_writel(0x1800, NDSR)
  => RJK: here the read command is completed

pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x2
pxa3xx-nand f10d0000.nand: Wait time out!!!
  => RJK: here data is available in NAND controller, and yet it's too late

pxa3xx-nand f10d0000.nand: pxa3xx_nand_stop():476 nand_readl(0x0000): 0xd104b000
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq_thread():636 nand_writel(0x6, 0x0014)
  => RJK: here we finish the read data transfer operation, too late

Cheers.

--
Robert

^ permalink raw reply	[flat|nested] 30+ messages in thread

* mtd: pxa3xx_nand: issue with command time out
@ 2016-01-16 23:53       ` Robert Jarzmik
  0 siblings, 0 replies; 30+ messages in thread
From: Robert Jarzmik @ 2016-01-16 23:53 UTC (permalink / raw)
  To: linux-arm-kernel

Michael Wang <Michael.Wang@alliedtelesis.co.nz> writes:

> The complete output of debug traces and boot log from your debug patch
> is in the attachment.

Okay, thanks, that's great.
Might a ask another capture, but with :
 1) the timestamps this time
    (ie. CONFIG_PRINTK_TIME=y) ? This will give me the timing of your failure,
    in the dmesg you will provide me.
 2) on the first line of drain_fifo(), add a single :
    nand_readl(info, NDSR)

This will enable me to see how your board behaves on the temporal aspect, as
today I only have this, which looks correct AFAICT:

pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():445 nand_writel(0x1, NDECCCTRL)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():465 nand_writel(0xfff, NDSR)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():466 nand_writel(0x0, NDCR)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():467 nand_writel(0xd104b000, NDCR)

pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x1
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():706 nand_writel(0x1, NDSR)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():720 nand_writel(0x100d3000, NDCB0)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():721 nand_writel(0x4ff0000, NDCB0)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():722 nand_writel(0x2, NDCB0)
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():726 nand_writel(0x820, NDCB0)
  => RJK: here the read command is submitted

pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x1800
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():706 nand_writel(0x1800, NDSR)
  => RJK: here the read command is completed

pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x2
pxa3xx-nand f10d0000.nand: Wait time out!!!
  => RJK: here data is available in NAND controller, and yet it's too late

pxa3xx-nand f10d0000.nand: pxa3xx_nand_stop():476 nand_readl(0x0000): 0xd104b000
pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq_thread():636 nand_writel(0x6, 0x0014)
  => RJK: here we finish the read data transfer operation, too late

Cheers.

--
Robert

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mtd: pxa3xx_nand: issue with command time out
  2016-01-16 23:53       ` Robert Jarzmik
  (?)
@ 2016-01-18  2:15       ` Michael Wang
  2016-01-18 21:13           ` Robert Jarzmik
  -1 siblings, 1 reply; 30+ messages in thread
From: Michael Wang @ 2016-01-18  2:15 UTC (permalink / raw)
  To: Robert Jarzmik; +Cc: linux-mtd, linux-arm-kernel, ezequiel, computersforpeace

[-- Attachment #1: Type: text/plain, Size: 1134 bytes --]

On 01/17/2016 12:53 PM, Robert Jarzmik wrote:
> Might a ask another capture, but with :
>   1) the timestamps this time
>      (ie. CONFIG_PRINTK_TIME=y) ? This will give me the timing of your failure,
>      in the dmesg you will provide me.
>   2) on the first line of drain_fifo(), add a single :
>      nand_readl(info, NDSR)
>
> This will enable me to see how your board behaves on the temporal aspect, as
> today I only have this, which looks correct AFAICT
Hi Robert,

Thank you for your analysis.

The attachment file contains the complete debug output with the 
additional info you requested.

I just want to mention that I get slightly different results on my board 
with debugs enabled/disabled.

Without your debugs, the kernel will fail to boot up properly with 
command timing out and nand
controller in invalid state:
 >pxa3xx-nand f10d0000.nand: Wait time out!!!
 >pxa3xx-nand f10d0000.nand: handle_data_pio: invalid state 0

With your debugs, commands still time out a few times, but 
handle_data_pio() doesn't get into
invalid state and the kernel boots up fine.

Thanks,
Michael



[-- Attachment #2: pxa3xx_debug2.zip --]
[-- Type: application/zip, Size: 2011927 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mtd: pxa3xx_nand: issue with command time out
  2016-01-18  2:15       ` Michael Wang
@ 2016-01-18 21:13           ` Robert Jarzmik
  0 siblings, 0 replies; 30+ messages in thread
From: Robert Jarzmik @ 2016-01-18 21:13 UTC (permalink / raw)
  To: Michael Wang, ezequiel; +Cc: linux-mtd, linux-arm-kernel, computersforpeace

Michael Wang <Michael.Wang@alliedtelesis.co.nz> writes:

> On 01/17/2016 12:53 PM, Robert Jarzmik wrote:
>> Might a ask another capture, but with :
>>   1) the timestamps this time
>>      (ie. CONFIG_PRINTK_TIME=y) ? This will give me the timing of your failure,
>>      in the dmesg you will provide me.
>>   2) on the first line of drain_fifo(), add a single :
>>      nand_readl(info, NDSR)
>>
>> This will enable me to see how your board behaves on the temporal aspect, as
>> today I only have this, which looks correct AFAICT
> Hi Robert,
>
> Thank you for your analysis.
>
> The attachment file contains the complete debug output with the 
> additional info you requested.
>
> I just want to mention that I get slightly different results on my board 
> with debugs enabled/disabled.
>
> Without your debugs, the kernel will fail to boot up properly with 
> command timing out and nand
> controller in invalid state:
>  >pxa3xx-nand f10d0000.nand: Wait time out!!!
>  >pxa3xx-nand f10d0000.nand: handle_data_pio: invalid state 0
>
> With your debugs, commands still time out a few times, but 
> handle_data_pio() doesn't get into
> invalid state and the kernel boots up fine.

Ok, I analysed with the timings, and now I'm pretty against the timeout change
if the analysis below is correct.

See in the log extract in [1], you have a contiguous sequence of one working
read followed by one broken read :
 - in the working read, the irq thread is scheduled in 8ms, which is pretty
   decent.
 - in the broken read, the irq thread is scheduled in 160ms, which is just
   unbearable.

>From this analysis I think you have something broken in your platform, a driver
taking a lock and blocking the scheduling or something like that. Your timeouts
seem to come from there.

If you want to add 2 prinks to validate the theory (and remove my debug logs
which change the realtime behavior) :
 - put one just before the "ret = IRQ_WAKE_THREAD";
 - put the other on the first line of pxa3xx_nand_irq_thread()
 - and calculate the difference. 

Cheers.

--
Robert

[1] Extract of your log
	=> RJK: begin of the working read
[ 2172.409771] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():445 nand_writel(0x1, NDECCCTRL)
[ 2172.417844] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():465 nand_writel(0xfff, NDSR)
[ 2172.426084] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():466 nand_writel(0x0, NDCR)
[ 2172.434147] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():467 nand_writel(0xd104b000, NDCR)
[ 2172.442823] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x1
[ 2172.450608] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():708 nand_writel(0x1, NDSR)
[ 2172.458479] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():722 nand_writel(0x300d3000, NDCB0)
[ 2172.466961] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():723 nand_writel(0x7fcb0000, NDCB0)
[ 2172.475442] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():724 nand_writel(0x3, NDCB0)
[ 2172.483312] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():728 nand_writel(0x820, NDCB0)
[ 2172.491450] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x2
[ 2172.499259] pxa3xx-nand f10d0000.nand: drain_fifo():509 nand_readl(NDSR): 0x2
	=> RJK: here the irq thread was scheduled in 8ms
[ 2172.506741] pxa3xx-nand f10d0000.nand: drain_fifo():509 nand_readl(NDSR): 0x2
[ 2172.514106] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq_thread():638 nand_writel(0x6, NDSR)
[ 2172.522606] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x500
[ 2172.530565] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():708 nand_writel(0x500, NDSR)
	=> RJK: begin of the broken read
[ 2172.538671] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():445 nand_writel(0x1, NDECCCTRL)
[ 2172.546753] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():465 nand_writel(0xfff, NDSR)
[ 2172.554991] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():466 nand_writel(0x0, NDCR)
[ 2172.563050] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():467 nand_writel(0xd104b000, NDCR)
[ 2172.571722] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x1
[ 2172.579508] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():708 nand_writel(0x1, NDSR)
[ 2172.587379] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():722 nand_writel(0x100d3000, NDCB0)
[ 2172.595860] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():723 nand_writel(0xfefd0000, NDCB0)
[ 2172.604341] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():724 nand_writel(0x2, NDCB0)
[ 2172.612212] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():728 nand_writel(0x820, NDCB0)
[ 2172.620319] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x1800
[ 2172.628367] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():708 nand_writel(0x1800, NDSR)
[ 2172.636519] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x2
[ 2172.798485] pxa3xx-nand f10d0000.nand: drain_fifo():509 nand_readl(NDSR): 0x2
	=> RJK: here the irq thread was scheduled in 160ms
[ 2172.805996] pxa3xx-nand f10d0000.nand: drain_fifo():509 nand_readl(NDSR): 0x2
[ 2172.813379] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq_thread():638 nand_writel(0x6, NDSR)
[ 2172.821017] pxa3xx-nand f10d0000.nand: Wait time out!!!
[ 2172.821029] pxa3xx-nand f10d0000.nand: pxa3xx_nand_stop():476 nand_readl(NDCR): 0xc104b000
[ 2172.821038] pxa3xx-nand f10d0000.nand: pxa3xx_nand_stop():487 nand_writel(0xfff, NDSR)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* mtd: pxa3xx_nand: issue with command time out
@ 2016-01-18 21:13           ` Robert Jarzmik
  0 siblings, 0 replies; 30+ messages in thread
From: Robert Jarzmik @ 2016-01-18 21:13 UTC (permalink / raw)
  To: linux-arm-kernel

Michael Wang <Michael.Wang@alliedtelesis.co.nz> writes:

> On 01/17/2016 12:53 PM, Robert Jarzmik wrote:
>> Might a ask another capture, but with :
>>   1) the timestamps this time
>>      (ie. CONFIG_PRINTK_TIME=y) ? This will give me the timing of your failure,
>>      in the dmesg you will provide me.
>>   2) on the first line of drain_fifo(), add a single :
>>      nand_readl(info, NDSR)
>>
>> This will enable me to see how your board behaves on the temporal aspect, as
>> today I only have this, which looks correct AFAICT
> Hi Robert,
>
> Thank you for your analysis.
>
> The attachment file contains the complete debug output with the 
> additional info you requested.
>
> I just want to mention that I get slightly different results on my board 
> with debugs enabled/disabled.
>
> Without your debugs, the kernel will fail to boot up properly with 
> command timing out and nand
> controller in invalid state:
>  >pxa3xx-nand f10d0000.nand: Wait time out!!!
>  >pxa3xx-nand f10d0000.nand: handle_data_pio: invalid state 0
>
> With your debugs, commands still time out a few times, but 
> handle_data_pio() doesn't get into
> invalid state and the kernel boots up fine.

Ok, I analysed with the timings, and now I'm pretty against the timeout change
if the analysis below is correct.

See in the log extract in [1], you have a contiguous sequence of one working
read followed by one broken read :
 - in the working read, the irq thread is scheduled in 8ms, which is pretty
   decent.
 - in the broken read, the irq thread is scheduled in 160ms, which is just
   unbearable.

>From this analysis I think you have something broken in your platform, a driver
taking a lock and blocking the scheduling or something like that. Your timeouts
seem to come from there.

If you want to add 2 prinks to validate the theory (and remove my debug logs
which change the realtime behavior) :
 - put one just before the "ret = IRQ_WAKE_THREAD";
 - put the other on the first line of pxa3xx_nand_irq_thread()
 - and calculate the difference. 

Cheers.

--
Robert

[1] Extract of your log
	=> RJK: begin of the working read
[ 2172.409771] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():445 nand_writel(0x1, NDECCCTRL)
[ 2172.417844] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():465 nand_writel(0xfff, NDSR)
[ 2172.426084] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():466 nand_writel(0x0, NDCR)
[ 2172.434147] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():467 nand_writel(0xd104b000, NDCR)
[ 2172.442823] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x1
[ 2172.450608] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():708 nand_writel(0x1, NDSR)
[ 2172.458479] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():722 nand_writel(0x300d3000, NDCB0)
[ 2172.466961] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():723 nand_writel(0x7fcb0000, NDCB0)
[ 2172.475442] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():724 nand_writel(0x3, NDCB0)
[ 2172.483312] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():728 nand_writel(0x820, NDCB0)
[ 2172.491450] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x2
[ 2172.499259] pxa3xx-nand f10d0000.nand: drain_fifo():509 nand_readl(NDSR): 0x2
	=> RJK: here the irq thread was scheduled in 8ms
[ 2172.506741] pxa3xx-nand f10d0000.nand: drain_fifo():509 nand_readl(NDSR): 0x2
[ 2172.514106] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq_thread():638 nand_writel(0x6, NDSR)
[ 2172.522606] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x500
[ 2172.530565] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():708 nand_writel(0x500, NDSR)
	=> RJK: begin of the broken read
[ 2172.538671] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():445 nand_writel(0x1, NDECCCTRL)
[ 2172.546753] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():465 nand_writel(0xfff, NDSR)
[ 2172.554991] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():466 nand_writel(0x0, NDCR)
[ 2172.563050] pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():467 nand_writel(0xd104b000, NDCR)
[ 2172.571722] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x1
[ 2172.579508] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():708 nand_writel(0x1, NDSR)
[ 2172.587379] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():722 nand_writel(0x100d3000, NDCB0)
[ 2172.595860] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():723 nand_writel(0xfefd0000, NDCB0)
[ 2172.604341] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():724 nand_writel(0x2, NDCB0)
[ 2172.612212] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():728 nand_writel(0x820, NDCB0)
[ 2172.620319] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x1800
[ 2172.628367] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():708 nand_writel(0x1800, NDSR)
[ 2172.636519] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():658 nand_readl(NDSR): 0x2
[ 2172.798485] pxa3xx-nand f10d0000.nand: drain_fifo():509 nand_readl(NDSR): 0x2
	=> RJK: here the irq thread was scheduled in 160ms
[ 2172.805996] pxa3xx-nand f10d0000.nand: drain_fifo():509 nand_readl(NDSR): 0x2
[ 2172.813379] pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq_thread():638 nand_writel(0x6, NDSR)
[ 2172.821017] pxa3xx-nand f10d0000.nand: Wait time out!!!
[ 2172.821029] pxa3xx-nand f10d0000.nand: pxa3xx_nand_stop():476 nand_readl(NDCR): 0xc104b000
[ 2172.821038] pxa3xx-nand f10d0000.nand: pxa3xx_nand_stop():487 nand_writel(0xfff, NDSR)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mtd: pxa3xx_nand: issue with command time out
  2016-01-18 21:13           ` Robert Jarzmik
@ 2016-01-19 22:33             ` Michael Wang
  -1 siblings, 0 replies; 30+ messages in thread
From: Michael Wang @ 2016-01-19 22:33 UTC (permalink / raw)
  To: Robert Jarzmik, ezequiel; +Cc: linux-mtd, linux-arm-kernel, computersforpeace

On 01/19/2016 10:13 AM, Robert Jarzmik wrote:
>  From this analysis I think you have something broken in your platform, a driver
> taking a lock and blocking the scheduling or something like that. Your timeouts
> seem to come from there.
>
> If you want to add 2 prinks to validate the theory (and remove my debug logs
> which change the realtime behavior) :
>   - put one just before the "ret = IRQ_WAKE_THREAD";
>   - put the other on the first line of pxa3xx_nand_irq_thread()
>   - and calculate the difference.
Hi Robert,

Thank you very much for your help.

I have added the 2 printks, and I think you are right that there is 
something
blocking the scheduling.

[  146.372745] pxa3xx_nand_irq:700 IRQ_WAKE_THREAD status=2
[  146.378106] pxa3xx_nand_irq_thread:643
[  146.411117] pxa3xx_nand_irq:700 IRQ_WAKE_THREAD status=2
[  146.537124] pxa3xx_nand_irq_thread:643
       => irq thread was scheduled in 0.126 ms
[  146.541245] pxa3xx_nand_irq:700 IRQ_WAKE_THREAD status=2
[  146.546603] pxa3xx_nand_irq_thread:643

I will continue to investigate the cause of this delay.

Thanks,
Michael

^ permalink raw reply	[flat|nested] 30+ messages in thread

* mtd: pxa3xx_nand: issue with command time out
@ 2016-01-19 22:33             ` Michael Wang
  0 siblings, 0 replies; 30+ messages in thread
From: Michael Wang @ 2016-01-19 22:33 UTC (permalink / raw)
  To: linux-arm-kernel

On 01/19/2016 10:13 AM, Robert Jarzmik wrote:
>  From this analysis I think you have something broken in your platform, a driver
> taking a lock and blocking the scheduling or something like that. Your timeouts
> seem to come from there.
>
> If you want to add 2 prinks to validate the theory (and remove my debug logs
> which change the realtime behavior) :
>   - put one just before the "ret = IRQ_WAKE_THREAD";
>   - put the other on the first line of pxa3xx_nand_irq_thread()
>   - and calculate the difference.
Hi Robert,

Thank you very much for your help.

I have added the 2 printks, and I think you are right that there is 
something
blocking the scheduling.

[  146.372745] pxa3xx_nand_irq:700 IRQ_WAKE_THREAD status=2
[  146.378106] pxa3xx_nand_irq_thread:643
[  146.411117] pxa3xx_nand_irq:700 IRQ_WAKE_THREAD status=2
[  146.537124] pxa3xx_nand_irq_thread:643
       => irq thread was scheduled in 0.126 ms
[  146.541245] pxa3xx_nand_irq:700 IRQ_WAKE_THREAD status=2
[  146.546603] pxa3xx_nand_irq_thread:643

I will continue to investigate the cause of this delay.

Thanks,
Michael

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mtd: pxa3xx_nand: issue with command time out
  2016-01-16 23:53       ` Robert Jarzmik
@ 2016-01-25 18:30         ` Brian Norris
  -1 siblings, 0 replies; 30+ messages in thread
From: Brian Norris @ 2016-01-25 18:30 UTC (permalink / raw)
  To: Robert Jarzmik; +Cc: Michael Wang, linux-mtd, linux-arm-kernel, ezequiel

Hi Robert,

On Sun, Jan 17, 2016 at 12:53:06AM +0100, Robert Jarzmik wrote:
> Michael Wang <Michael.Wang@alliedtelesis.co.nz> writes:
> 
> > The complete output of debug traces and boot log from your debug patch
> > is in the attachment.
> 
> Okay, thanks, that's great.
> Might a ask another capture, but with :
>  1) the timestamps this time
>     (ie. CONFIG_PRINTK_TIME=y) ? This will give me the timing of your failure,
>     in the dmesg you will provide me.
>  2) on the first line of drain_fifo(), add a single :
>     nand_readl(info, NDSR)
> 
> This will enable me to see how your board behaves on the temporal aspect, as
> today I only have this, which looks correct AFAICT:
> 
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():445 nand_writel(0x1, NDECCCTRL)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():465 nand_writel(0xfff, NDSR)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():466 nand_writel(0x0, NDCR)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():467 nand_writel(0xd104b000, NDCR)
> 
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x1
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():706 nand_writel(0x1, NDSR)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():720 nand_writel(0x100d3000, NDCB0)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():721 nand_writel(0x4ff0000, NDCB0)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():722 nand_writel(0x2, NDCB0)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():726 nand_writel(0x820, NDCB0)
>   => RJK: here the read command is submitted
> 
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x1800
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():706 nand_writel(0x1800, NDSR)
>   => RJK: here the read command is completed
> 
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x2
> pxa3xx-nand f10d0000.nand: Wait time out!!!
>   => RJK: here data is available in NAND controller, and yet it's too late
> 
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_stop():476 nand_readl(0x0000): 0xd104b000
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq_thread():636 nand_writel(0x6, 0x0014)
>   => RJK: here we finish the read data transfer operation, too late

I'm not extremely familiar with this driver, but you've caused me to
take a second look at this commit:

commit 24542257a3b987025d4b998ec2d15e556c98ad3f
Author: Robert Jarzmik <robert.jarzmik@free.fr>
Date:   Fri Feb 20 19:36:43 2015 +0100

    mtd: pxa3xx-nand: handle PIO in threaded interrupt


And now I'm wondering: when does the completion get triggered? i.e.:

	complete(&info->cmd_complete);

?

It seems to me like you've short-circuited some of the IRQ handling
code, so that the threaded handler is buggy. AIUI, if the completion
event ever happens, it's actually happening *before* the full (threaded)
handler is actually finished, since it occurs in pxa3xx_nand_irq(),
before pxa3xx_nand_irq_thread() ever runs. Now, I'm not sure if that
causes a real problem in practice, since you used IRQF_ONESHOT, but I
would think that's also suspect.

Brian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* mtd: pxa3xx_nand: issue with command time out
@ 2016-01-25 18:30         ` Brian Norris
  0 siblings, 0 replies; 30+ messages in thread
From: Brian Norris @ 2016-01-25 18:30 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Robert,

On Sun, Jan 17, 2016 at 12:53:06AM +0100, Robert Jarzmik wrote:
> Michael Wang <Michael.Wang@alliedtelesis.co.nz> writes:
> 
> > The complete output of debug traces and boot log from your debug patch
> > is in the attachment.
> 
> Okay, thanks, that's great.
> Might a ask another capture, but with :
>  1) the timestamps this time
>     (ie. CONFIG_PRINTK_TIME=y) ? This will give me the timing of your failure,
>     in the dmesg you will provide me.
>  2) on the first line of drain_fifo(), add a single :
>     nand_readl(info, NDSR)
> 
> This will enable me to see how your board behaves on the temporal aspect, as
> today I only have this, which looks correct AFAICT:
> 
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():445 nand_writel(0x1, NDECCCTRL)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():465 nand_writel(0xfff, NDSR)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():466 nand_writel(0x0, NDCR)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_start():467 nand_writel(0xd104b000, NDCR)
> 
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x1
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():706 nand_writel(0x1, NDSR)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():720 nand_writel(0x100d3000, NDCB0)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():721 nand_writel(0x4ff0000, NDCB0)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():722 nand_writel(0x2, NDCB0)
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():726 nand_writel(0x820, NDCB0)
>   => RJK: here the read command is submitted
> 
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x1800
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():706 nand_writel(0x1800, NDSR)
>   => RJK: here the read command is completed
> 
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq():656 nand_readl(NDSR): 0x2
> pxa3xx-nand f10d0000.nand: Wait time out!!!
>   => RJK: here data is available in NAND controller, and yet it's too late
> 
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_stop():476 nand_readl(0x0000): 0xd104b000
> pxa3xx-nand f10d0000.nand: pxa3xx_nand_irq_thread():636 nand_writel(0x6, 0x0014)
>   => RJK: here we finish the read data transfer operation, too late

I'm not extremely familiar with this driver, but you've caused me to
take a second look at this commit:

commit 24542257a3b987025d4b998ec2d15e556c98ad3f
Author: Robert Jarzmik <robert.jarzmik@free.fr>
Date:   Fri Feb 20 19:36:43 2015 +0100

    mtd: pxa3xx-nand: handle PIO in threaded interrupt


And now I'm wondering: when does the completion get triggered? i.e.:

	complete(&info->cmd_complete);

?

It seems to me like you've short-circuited some of the IRQ handling
code, so that the threaded handler is buggy. AIUI, if the completion
event ever happens, it's actually happening *before* the full (threaded)
handler is actually finished, since it occurs in pxa3xx_nand_irq(),
before pxa3xx_nand_irq_thread() ever runs. Now, I'm not sure if that
causes a real problem in practice, since you used IRQF_ONESHOT, but I
would think that's also suspect.

Brian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mtd: pxa3xx_nand: issue with command time out
  2016-01-25 18:30         ` Brian Norris
@ 2016-01-25 20:48           ` Robert Jarzmik
  -1 siblings, 0 replies; 30+ messages in thread
From: Robert Jarzmik @ 2016-01-25 20:48 UTC (permalink / raw)
  To: Brian Norris, ezequiel; +Cc: linux-mtd, Michael Wang, linux-arm-kernel

Brian Norris <computersforpeace@gmail.com> writes:

> Hi Robert,
>
> On Sun, Jan 17, 2016 at 12:53:06AM +0100, Robert Jarzmik wrote:
> commit 24542257a3b987025d4b998ec2d15e556c98ad3f
> Author: Robert Jarzmik <robert.jarzmik@free.fr>
> Date:   Fri Feb 20 19:36:43 2015 +0100
>
>     mtd: pxa3xx-nand: handle PIO in threaded interrupt
>
>
> And now I'm wondering: when does the completion get triggered? i.e.:
>
> 	complete(&info->cmd_complete);

> It seems to me like you've short-circuited some of the IRQ handling
> code, so that the threaded handler is buggy. AIUI, if the completion
> event ever happens, it's actually happening *before* the full (threaded)
> handler is actually finished, since it occurs in pxa3xx_nand_irq(),
> before pxa3xx_nand_irq_thread() ever runs. Now, I'm not sure if that
> causes a real problem in practice, since you used IRQF_ONESHOT, but I
> would think that's also suspect.

I think you're very right, this "goto NORMAL_IRQ_EXIT" creates a
short-cirtcuit. That makes me think that I should :
 - add
static void pxa3xx_nand_data_finished(struct pxa3xx_nand_info *info)
{
	unsigned int done;

 	cmd_done = info->cs ? NDSR_CS1_CMDD : NDSR_CS0_CMDD;
	nand_writel(info, NDSR, cmd_done);
        complete(&inf0->cmd_complete);
}

 - in pxa3xx_nand_irq_thread(), I should add at the end:
	pxa3xx_nand_data_finished(info);

 - in pxa3xx_nand_data_dma_irq(), I should add at the end:
	pxa3xx_nand_data_finished(info);

Ezequiel, would you have a look at that code snippet and tell me if I'm missing
something in the driver's structure ?

Cheers.

--
Robert

^ permalink raw reply	[flat|nested] 30+ messages in thread

* mtd: pxa3xx_nand: issue with command time out
@ 2016-01-25 20:48           ` Robert Jarzmik
  0 siblings, 0 replies; 30+ messages in thread
From: Robert Jarzmik @ 2016-01-25 20:48 UTC (permalink / raw)
  To: linux-arm-kernel

Brian Norris <computersforpeace@gmail.com> writes:

> Hi Robert,
>
> On Sun, Jan 17, 2016 at 12:53:06AM +0100, Robert Jarzmik wrote:
> commit 24542257a3b987025d4b998ec2d15e556c98ad3f
> Author: Robert Jarzmik <robert.jarzmik@free.fr>
> Date:   Fri Feb 20 19:36:43 2015 +0100
>
>     mtd: pxa3xx-nand: handle PIO in threaded interrupt
>
>
> And now I'm wondering: when does the completion get triggered? i.e.:
>
> 	complete(&info->cmd_complete);

> It seems to me like you've short-circuited some of the IRQ handling
> code, so that the threaded handler is buggy. AIUI, if the completion
> event ever happens, it's actually happening *before* the full (threaded)
> handler is actually finished, since it occurs in pxa3xx_nand_irq(),
> before pxa3xx_nand_irq_thread() ever runs. Now, I'm not sure if that
> causes a real problem in practice, since you used IRQF_ONESHOT, but I
> would think that's also suspect.

I think you're very right, this "goto NORMAL_IRQ_EXIT" creates a
short-cirtcuit. That makes me think that I should :
 - add
static void pxa3xx_nand_data_finished(struct pxa3xx_nand_info *info)
{
	unsigned int done;

 	cmd_done = info->cs ? NDSR_CS1_CMDD : NDSR_CS0_CMDD;
	nand_writel(info, NDSR, cmd_done);
        complete(&inf0->cmd_complete);
}

 - in pxa3xx_nand_irq_thread(), I should add at the end:
	pxa3xx_nand_data_finished(info);

 - in pxa3xx_nand_data_dma_irq(), I should add at the end:
	pxa3xx_nand_data_finished(info);

Ezequiel, would you have a look at that code snippet and tell me if I'm missing
something in the driver's structure ?

Cheers.

--
Robert

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mtd: pxa3xx_nand: issue with command time out
  2016-01-25 20:48           ` Robert Jarzmik
@ 2016-01-26 13:03             ` Ezequiel Garcia
  -1 siblings, 0 replies; 30+ messages in thread
From: Ezequiel Garcia @ 2016-01-26 13:03 UTC (permalink / raw)
  To: Robert Jarzmik; +Cc: Brian Norris, linux-mtd, Michael Wang, linux-arm-kernel

On 25 January 2016 at 17:48, Robert Jarzmik <robert.jarzmik@free.fr> wrote:
> Brian Norris <computersforpeace@gmail.com> writes:
>
>> Hi Robert,
>>
>> On Sun, Jan 17, 2016 at 12:53:06AM +0100, Robert Jarzmik wrote:
>> commit 24542257a3b987025d4b998ec2d15e556c98ad3f
>> Author: Robert Jarzmik <robert.jarzmik@free.fr>
>> Date:   Fri Feb 20 19:36:43 2015 +0100
>>
>>     mtd: pxa3xx-nand: handle PIO in threaded interrupt
>>
>>
>> And now I'm wondering: when does the completion get triggered? i.e.:
>>
>>       complete(&info->cmd_complete);
>
>> It seems to me like you've short-circuited some of the IRQ handling
>> code, so that the threaded handler is buggy. AIUI, if the completion
>> event ever happens, it's actually happening *before* the full (threaded)
>> handler is actually finished, since it occurs in pxa3xx_nand_irq(),
>> before pxa3xx_nand_irq_thread() ever runs. Now, I'm not sure if that
>> causes a real problem in practice, since you used IRQF_ONESHOT, but I
>> would think that's also suspect.
>
> I think you're very right, this "goto NORMAL_IRQ_EXIT" creates a
> short-cirtcuit. That makes me think that I should :
>  - add
> static void pxa3xx_nand_data_finished(struct pxa3xx_nand_info *info)
> {
>         unsigned int done;
>
>         cmd_done = info->cs ? NDSR_CS1_CMDD : NDSR_CS0_CMDD;
>         nand_writel(info, NDSR, cmd_done);
>         complete(&inf0->cmd_complete);
> }
>
>  - in pxa3xx_nand_irq_thread(), I should add at the end:
>         pxa3xx_nand_data_finished(info);
>
>  - in pxa3xx_nand_data_dma_irq(), I should add at the end:
>         pxa3xx_nand_data_finished(info);
>
> Ezequiel, would you have a look at that code snippet and tell me if I'm missing
> something in the driver's structure ?
>

Although it seems Brian is right, I'm not so sure about it. As far as
I can recall, the IRQ sequence is (please correct me if I'm wrong!):

1. Write command request (WRCMDREQ), the IRQ handler writes the command.
2. Read/write data request (RDDREQ | WRDREQ). The IRQ handler fires
the threaded handler, which does data transfer.
3. Command done IRQ (CMDD), the IRQ handler completes the completion.

I don't think data request and command done can be found
simultaneously, so I don't think this is a real issue. Ideas?

Regarding the fixed proposed up there, it seems wrong: the completion
should be completed only after a command done IRQ is found, and not as
the result of data transfer being finished.
-- 
Ezequiel García, VanguardiaSur
www.vanguardiasur.com.ar

^ permalink raw reply	[flat|nested] 30+ messages in thread

* mtd: pxa3xx_nand: issue with command time out
@ 2016-01-26 13:03             ` Ezequiel Garcia
  0 siblings, 0 replies; 30+ messages in thread
From: Ezequiel Garcia @ 2016-01-26 13:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 25 January 2016 at 17:48, Robert Jarzmik <robert.jarzmik@free.fr> wrote:
> Brian Norris <computersforpeace@gmail.com> writes:
>
>> Hi Robert,
>>
>> On Sun, Jan 17, 2016 at 12:53:06AM +0100, Robert Jarzmik wrote:
>> commit 24542257a3b987025d4b998ec2d15e556c98ad3f
>> Author: Robert Jarzmik <robert.jarzmik@free.fr>
>> Date:   Fri Feb 20 19:36:43 2015 +0100
>>
>>     mtd: pxa3xx-nand: handle PIO in threaded interrupt
>>
>>
>> And now I'm wondering: when does the completion get triggered? i.e.:
>>
>>       complete(&info->cmd_complete);
>
>> It seems to me like you've short-circuited some of the IRQ handling
>> code, so that the threaded handler is buggy. AIUI, if the completion
>> event ever happens, it's actually happening *before* the full (threaded)
>> handler is actually finished, since it occurs in pxa3xx_nand_irq(),
>> before pxa3xx_nand_irq_thread() ever runs. Now, I'm not sure if that
>> causes a real problem in practice, since you used IRQF_ONESHOT, but I
>> would think that's also suspect.
>
> I think you're very right, this "goto NORMAL_IRQ_EXIT" creates a
> short-cirtcuit. That makes me think that I should :
>  - add
> static void pxa3xx_nand_data_finished(struct pxa3xx_nand_info *info)
> {
>         unsigned int done;
>
>         cmd_done = info->cs ? NDSR_CS1_CMDD : NDSR_CS0_CMDD;
>         nand_writel(info, NDSR, cmd_done);
>         complete(&inf0->cmd_complete);
> }
>
>  - in pxa3xx_nand_irq_thread(), I should add at the end:
>         pxa3xx_nand_data_finished(info);
>
>  - in pxa3xx_nand_data_dma_irq(), I should add at the end:
>         pxa3xx_nand_data_finished(info);
>
> Ezequiel, would you have a look at that code snippet and tell me if I'm missing
> something in the driver's structure ?
>

Although it seems Brian is right, I'm not so sure about it. As far as
I can recall, the IRQ sequence is (please correct me if I'm wrong!):

1. Write command request (WRCMDREQ), the IRQ handler writes the command.
2. Read/write data request (RDDREQ | WRDREQ). The IRQ handler fires
the threaded handler, which does data transfer.
3. Command done IRQ (CMDD), the IRQ handler completes the completion.

I don't think data request and command done can be found
simultaneously, so I don't think this is a real issue. Ideas?

Regarding the fixed proposed up there, it seems wrong: the completion
should be completed only after a command done IRQ is found, and not as
the result of data transfer being finished.
-- 
Ezequiel Garc?a, VanguardiaSur
www.vanguardiasur.com.ar

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Re: mtd: pxa3xx_nand: issue with command time out
  2016-01-26 13:03             ` Ezequiel Garcia
@ 2016-03-01 22:17               ` Hamish Martin
  -1 siblings, 0 replies; 30+ messages in thread
From: Hamish Martin @ 2016-03-01 22:17 UTC (permalink / raw)
  To: Ezequiel Garcia, Robert Jarzmik
  Cc: Brian Norris, linux-mtd, Michael Wang, linux-arm-kernel

Hi Robert, Ezequiel, and Brian,

I have been working with Michael Wang on this issue and wanted to update 
you on our results as we think there is a wider issue to resolve here.

I believe Ezequiel's analysis of the IRQ sequence is correct. The 
proposed patch from Brian is not needed, not does it resolve our issue.

Further, we have debugged in more detail with kernel tracing and see 
that the problem occurs because nothing calls 'schedule()' in the kernel 
within the timeout period (200ms) to actually allow the IRQ bottom-half 
thread ((pxa3xx_nand_irq_thread()) to run.

With the trace enabled we were able to see long delays between the IRQ 
thread being queued for run as a result of the return with 
IRQ_WAKE_THREAD in pxa3xx_nand_irq(), and a call to schedule() actually 
selecting the thread for execution. This delay was tied to whatever else 
was happening in our system during the boot up when this issue occurs. 
For instance, if a mutex got unlocked this would ultimately lead to a 
call to schedule() and our thread (pxa3xx_nand_irq_thread()) would run. 
If some file system ops happened schedule() would get called. These 
unrelated activities were required so that schedule() got called and the 
scheduler would let the bottom-half execute.
In our tracing we saw the issue occur whenever the call to schedule() 
didn't occur before the timeout expired. In short, if we got lucky and 
something completely unrelated happened that led to a call to 
'schedule()', then the issue didn't occur. If we got unlucky and nothing 
called schedule(), the issue was observed.

We have enabled CONFIG_PREEMPT in our system to resolve this. This has 
the effect of leading to a triggering of 'schedule()' at the end of 
processing interrupts. This means that we now see a nice tight execution 
of pxa3xx_nand threaded IRQ such that we see all ops completing with 
less that 20ms delay.

I wanted to highlight this to you because I think it shows that the 
change to a threaded IRQ is probably only safe if we can guarantee 
scheduling of the bottom-half thread within the time out period (200ms). 
I think this is not safe for systems with CONFIG_PREEMPT not configured. 
Or perhaps there is a better way to resolve this. Is this a deficiency 
in the Armada XP code?

What do you think?

Thanks,
Hamish Martin


On 01/27/2016 02:03 AM, Ezequiel Garcia wrote:
> On 25 January 2016 at 17:48, Robert Jarzmik <robert.jarzmik@free.fr> wrote:
>> Brian Norris <computersforpeace@gmail.com> writes:
>>
>>> Hi Robert,
>>>
>>> On Sun, Jan 17, 2016 at 12:53:06AM +0100, Robert Jarzmik wrote:
>>> commit 24542257a3b987025d4b998ec2d15e556c98ad3f
>>> Author: Robert Jarzmik <robert.jarzmik@free.fr>
>>> Date:   Fri Feb 20 19:36:43 2015 +0100
>>>
>>>      mtd: pxa3xx-nand: handle PIO in threaded interrupt
>>>
>>>
>>> And now I'm wondering: when does the completion get triggered? i.e.:
>>>
>>>        complete(&info->cmd_complete);
>>> It seems to me like you've short-circuited some of the IRQ handling
>>> code, so that the threaded handler is buggy. AIUI, if the completion
>>> event ever happens, it's actually happening *before* the full (threaded)
>>> handler is actually finished, since it occurs in pxa3xx_nand_irq(),
>>> before pxa3xx_nand_irq_thread() ever runs. Now, I'm not sure if that
>>> causes a real problem in practice, since you used IRQF_ONESHOT, but I
>>> would think that's also suspect.
>> I think you're very right, this "goto NORMAL_IRQ_EXIT" creates a
>> short-cirtcuit. That makes me think that I should :
>>   - add
>> static void pxa3xx_nand_data_finished(struct pxa3xx_nand_info *info)
>> {
>>          unsigned int done;
>>
>>          cmd_done =nfo->cs ? NDSR_CS1_CMDD : NDSR_CS0_CMDD;
>>          nand_writel(info, NDSR, cmd_done);
>>          complete(&inf0->cmd_complete);
>> }
>>
>>   - in pxa3xx_nand_irq_thread(), I should add at the end:
>>          pxa3xx_nand_data_finished(info);
>>
>>   - in pxa3xx_nand_data_dma_irq(), I should add at the end:
>>          pxa3xx_nand_data_finished(info);
>>
>> Ezequiel, would you have a look at that code snippet and tell me if I'm missing
>> something in the driver's structure ?
>>
> Although it seems Brian is right, I'm not so sure about it. As far as
> I can recall, the IRQ sequence is (please correct me if I'm wrong!):
>
> 1. Write command request (WRCMDREQ), the IRQ handler writes the command.
> 2. Read/write data request (RDDREQ | WRDREQ). The IRQ handler fires
> the threaded handler, which does data transfer.
> 3. Command done IRQ (CMDD), the IRQ handler completes the completion.
>
> I don't think data request and command done can be found
> simultaneously, so I don't think this is a real issue. Ideas?
>
> Regarding the fixed proposed up there, it seems wrong: the completion
> should be completed only after a command done IRQ is found, and not as
> the result of data transfer being finished.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* mtd: pxa3xx_nand: issue with command time out
@ 2016-03-01 22:17               ` Hamish Martin
  0 siblings, 0 replies; 30+ messages in thread
From: Hamish Martin @ 2016-03-01 22:17 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Robert, Ezequiel, and Brian,

I have been working with Michael Wang on this issue and wanted to update 
you on our results as we think there is a wider issue to resolve here.

I believe Ezequiel's analysis of the IRQ sequence is correct. The 
proposed patch from Brian is not needed, not does it resolve our issue.

Further, we have debugged in more detail with kernel tracing and see 
that the problem occurs because nothing calls 'schedule()' in the kernel 
within the timeout period (200ms) to actually allow the IRQ bottom-half 
thread ((pxa3xx_nand_irq_thread()) to run.

With the trace enabled we were able to see long delays between the IRQ 
thread being queued for run as a result of the return with 
IRQ_WAKE_THREAD in pxa3xx_nand_irq(), and a call to schedule() actually 
selecting the thread for execution. This delay was tied to whatever else 
was happening in our system during the boot up when this issue occurs. 
For instance, if a mutex got unlocked this would ultimately lead to a 
call to schedule() and our thread (pxa3xx_nand_irq_thread()) would run. 
If some file system ops happened schedule() would get called. These 
unrelated activities were required so that schedule() got called and the 
scheduler would let the bottom-half execute.
In our tracing we saw the issue occur whenever the call to schedule() 
didn't occur before the timeout expired. In short, if we got lucky and 
something completely unrelated happened that led to a call to 
'schedule()', then the issue didn't occur. If we got unlucky and nothing 
called schedule(), the issue was observed.

We have enabled CONFIG_PREEMPT in our system to resolve this. This has 
the effect of leading to a triggering of 'schedule()' at the end of 
processing interrupts. This means that we now see a nice tight execution 
of pxa3xx_nand threaded IRQ such that we see all ops completing with 
less that 20ms delay.

I wanted to highlight this to you because I think it shows that the 
change to a threaded IRQ is probably only safe if we can guarantee 
scheduling of the bottom-half thread within the time out period (200ms). 
I think this is not safe for systems with CONFIG_PREEMPT not configured. 
Or perhaps there is a better way to resolve this. Is this a deficiency 
in the Armada XP code?

What do you think?

Thanks,
Hamish Martin


On 01/27/2016 02:03 AM, Ezequiel Garcia wrote:
> On 25 January 2016 at 17:48, Robert Jarzmik <robert.jarzmik@free.fr> wrote:
>> Brian Norris <computersforpeace@gmail.com> writes:
>>
>>> Hi Robert,
>>>
>>> On Sun, Jan 17, 2016 at 12:53:06AM +0100, Robert Jarzmik wrote:
>>> commit 24542257a3b987025d4b998ec2d15e556c98ad3f
>>> Author: Robert Jarzmik <robert.jarzmik@free.fr>
>>> Date:   Fri Feb 20 19:36:43 2015 +0100
>>>
>>>      mtd: pxa3xx-nand: handle PIO in threaded interrupt
>>>
>>>
>>> And now I'm wondering: when does the completion get triggered? i.e.:
>>>
>>>        complete(&info->cmd_complete);
>>> It seems to me like you've short-circuited some of the IRQ handling
>>> code, so that the threaded handler is buggy. AIUI, if the completion
>>> event ever happens, it's actually happening *before* the full (threaded)
>>> handler is actually finished, since it occurs in pxa3xx_nand_irq(),
>>> before pxa3xx_nand_irq_thread() ever runs. Now, I'm not sure if that
>>> causes a real problem in practice, since you used IRQF_ONESHOT, but I
>>> would think that's also suspect.
>> I think you're very right, this "goto NORMAL_IRQ_EXIT" creates a
>> short-cirtcuit. That makes me think that I should :
>>   - add
>> static void pxa3xx_nand_data_finished(struct pxa3xx_nand_info *info)
>> {
>>          unsigned int done;
>>
>>          cmd_done =nfo->cs ? NDSR_CS1_CMDD : NDSR_CS0_CMDD;
>>          nand_writel(info, NDSR, cmd_done);
>>          complete(&inf0->cmd_complete);
>> }
>>
>>   - in pxa3xx_nand_irq_thread(), I should add at the end:
>>          pxa3xx_nand_data_finished(info);
>>
>>   - in pxa3xx_nand_data_dma_irq(), I should add at the end:
>>          pxa3xx_nand_data_finished(info);
>>
>> Ezequiel, would you have a look at that code snippet and tell me if I'm missing
>> something in the driver's structure ?
>>
> Although it seems Brian is right, I'm not so sure about it. As far as
> I can recall, the IRQ sequence is (please correct me if I'm wrong!):
>
> 1. Write command request (WRCMDREQ), the IRQ handler writes the command.
> 2. Read/write data request (RDDREQ | WRDREQ). The IRQ handler fires
> the threaded handler, which does data transfer.
> 3. Command done IRQ (CMDD), the IRQ handler completes the completion.
>
> I don't think data request and command done can be found
> simultaneously, so I don't think this is a real issue. Ideas?
>
> Regarding the fixed proposed up there, it seems wrong: the completion
> should be completed only after a command done IRQ is found, and not as
> the result of data transfer being finished.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Re: mtd: pxa3xx_nand: issue with command time out
  2016-03-01 22:17               ` Hamish Martin
@ 2016-03-03 19:24                 ` Ezequiel Garcia
  -1 siblings, 0 replies; 30+ messages in thread
From: Ezequiel Garcia @ 2016-03-03 19:24 UTC (permalink / raw)
  To: Hamish Martin
  Cc: Robert Jarzmik, Brian Norris, linux-mtd, Michael Wang,
	linux-arm-kernel, Boris Brezillon, Richard Weinberger,
	Thomas Petazzoni

+Richard, Boris and Thomas

On 1 March 2016 at 19:17, Hamish Martin
<Hamish.Martin@alliedtelesis.co.nz> wrote:
> Hi Robert, Ezequiel, and Brian,
>
> I have been working with Michael Wang on this issue and wanted to update
> you on our results as we think there is a wider issue to resolve here.
>
> I believe Ezequiel's analysis of the IRQ sequence is correct. The
> proposed patch from Brian is not needed, not does it resolve our issue.
>
> Further, we have debugged in more detail with kernel tracing and see
> that the problem occurs because nothing calls 'schedule()' in the kernel
> within the timeout period (200ms) to actually allow the IRQ bottom-half
> thread ((pxa3xx_nand_irq_thread()) to run.
>
> With the trace enabled we were able to see long delays between the IRQ
> thread being queued for run as a result of the return with
> IRQ_WAKE_THREAD in pxa3xx_nand_irq(), and a call to schedule() actually
> selecting the thread for execution. This delay was tied to whatever else
> was happening in our system during the boot up when this issue occurs.
> For instance, if a mutex got unlocked this would ultimately lead to a
> call to schedule() and our thread (pxa3xx_nand_irq_thread()) would run.
> If some file system ops happened schedule() would get called. These
> unrelated activities were required so that schedule() got called and the
> scheduler would let the bottom-half execute.
> In our tracing we saw the issue occur whenever the call to schedule()
> didn't occur before the timeout expired. In short, if we got lucky and
> something completely unrelated happened that led to a call to
> 'schedule()', then the issue didn't occur. If we got unlucky and nothing
> called schedule(), the issue was observed.
>
> We have enabled CONFIG_PREEMPT in our system to resolve this. This has
> the effect of leading to a triggering of 'schedule()' at the end of
> processing interrupts. This means that we now see a nice tight execution
> of pxa3xx_nand threaded IRQ such that we see all ops completing with
> less that 20ms delay.
>
> I wanted to highlight this to you because I think it shows that the
> change to a threaded IRQ is probably only safe if we can guarantee
> scheduling of the bottom-half thread within the time out period (200ms).
> I think this is not safe for systems with CONFIG_PREEMPT not configured.
> Or perhaps there is a better way to resolve this. Is this a deficiency
> in the Armada XP code?
>
> What do you think?
>

Thanks a lot for updating us. Richard, Boris and I have
discussed about this on IRC.

We came to the following conclusions:

1. There's no way to guarantee that the thread will be scheduled
soon enough to always do the data transfer in time.

Using CONFIG_PREEMPT could prevent a timeout in many (most?)
cases, but it's not guaranteed to work.

2. The pxa3xx-nand driver is poorly designed, and it shouldn't
do any data transfer in the IRQ (threaded or not) handler.

Instead, this should be done in {read,write}_buf, i.e. in the
context of the calling process that asks data to be read/write.
This is potentially a big rework.

3. Richard proposed to get rid of the timeout
(using wait_for_completion_interruptible instead?).

I'm wondering if that could avoid these kind of issues (false
timeouts), as a less invasive solution, until someone steps
up to rework the driver.

How does it sound?
-- 
Ezequiel García, VanguardiaSur
www.vanguardiasur.com.ar

^ permalink raw reply	[flat|nested] 30+ messages in thread

* mtd: pxa3xx_nand: issue with command time out
@ 2016-03-03 19:24                 ` Ezequiel Garcia
  0 siblings, 0 replies; 30+ messages in thread
From: Ezequiel Garcia @ 2016-03-03 19:24 UTC (permalink / raw)
  To: linux-arm-kernel

+Richard, Boris and Thomas

On 1 March 2016 at 19:17, Hamish Martin
<Hamish.Martin@alliedtelesis.co.nz> wrote:
> Hi Robert, Ezequiel, and Brian,
>
> I have been working with Michael Wang on this issue and wanted to update
> you on our results as we think there is a wider issue to resolve here.
>
> I believe Ezequiel's analysis of the IRQ sequence is correct. The
> proposed patch from Brian is not needed, not does it resolve our issue.
>
> Further, we have debugged in more detail with kernel tracing and see
> that the problem occurs because nothing calls 'schedule()' in the kernel
> within the timeout period (200ms) to actually allow the IRQ bottom-half
> thread ((pxa3xx_nand_irq_thread()) to run.
>
> With the trace enabled we were able to see long delays between the IRQ
> thread being queued for run as a result of the return with
> IRQ_WAKE_THREAD in pxa3xx_nand_irq(), and a call to schedule() actually
> selecting the thread for execution. This delay was tied to whatever else
> was happening in our system during the boot up when this issue occurs.
> For instance, if a mutex got unlocked this would ultimately lead to a
> call to schedule() and our thread (pxa3xx_nand_irq_thread()) would run.
> If some file system ops happened schedule() would get called. These
> unrelated activities were required so that schedule() got called and the
> scheduler would let the bottom-half execute.
> In our tracing we saw the issue occur whenever the call to schedule()
> didn't occur before the timeout expired. In short, if we got lucky and
> something completely unrelated happened that led to a call to
> 'schedule()', then the issue didn't occur. If we got unlucky and nothing
> called schedule(), the issue was observed.
>
> We have enabled CONFIG_PREEMPT in our system to resolve this. This has
> the effect of leading to a triggering of 'schedule()' at the end of
> processing interrupts. This means that we now see a nice tight execution
> of pxa3xx_nand threaded IRQ such that we see all ops completing with
> less that 20ms delay.
>
> I wanted to highlight this to you because I think it shows that the
> change to a threaded IRQ is probably only safe if we can guarantee
> scheduling of the bottom-half thread within the time out period (200ms).
> I think this is not safe for systems with CONFIG_PREEMPT not configured.
> Or perhaps there is a better way to resolve this. Is this a deficiency
> in the Armada XP code?
>
> What do you think?
>

Thanks a lot for updating us. Richard, Boris and I have
discussed about this on IRC.

We came to the following conclusions:

1. There's no way to guarantee that the thread will be scheduled
soon enough to always do the data transfer in time.

Using CONFIG_PREEMPT could prevent a timeout in many (most?)
cases, but it's not guaranteed to work.

2. The pxa3xx-nand driver is poorly designed, and it shouldn't
do any data transfer in the IRQ (threaded or not) handler.

Instead, this should be done in {read,write}_buf, i.e. in the
context of the calling process that asks data to be read/write.
This is potentially a big rework.

3. Richard proposed to get rid of the timeout
(using wait_for_completion_interruptible instead?).

I'm wondering if that could avoid these kind of issues (false
timeouts), as a less invasive solution, until someone steps
up to rework the driver.

How does it sound?
-- 
Ezequiel Garc?a, VanguardiaSur
www.vanguardiasur.com.ar

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mtd: pxa3xx_nand: issue with command time out
  2016-03-03 19:24                 ` Ezequiel Garcia
@ 2016-03-03 20:16                   ` Hamish Martin
  -1 siblings, 0 replies; 30+ messages in thread
From: Hamish Martin @ 2016-03-03 20:16 UTC (permalink / raw)
  To: Ezequiel Garcia
  Cc: Robert Jarzmik, Brian Norris, linux-mtd, Michael Wang,
	linux-arm-kernel, Boris Brezillon, Richard Weinberger,
	Thomas Petazzoni, Chris Packham

Hi Ezequiel,

I'm glad my analysis was correct and we more fully understand the 
problem now. CONFIG_PREEMPT seems to work for us now, but it does seem 
like the wrong fix, and if you guys think it will not guarantee 
preventing the issue then we'd rather back that change out.

I had contemplated using wait_for_completion_interruptible() but was a 
bit unsure if it was safe. Do you have a proposed patch? We would be 
happy to apply it to our system, turn off CONFIG_PREEMPT, and see how it 
goes. Out issue is highly reproducible so I think it would be a good 
first test for a change?

Thanks,
Hamish M.

On 03/04/2016 08:24 AM, Ezequiel Garcia wrote:
> +Richard, Boris and Thomas
>
> On 1 March 2016 at 19:17, Hamish Martin
> <Hamish.Martin@alliedtelesis.co.nz> wrote:
>> Hi Robert, Ezequiel, and Brian,
>>
>> I have been working with Michael Wang on this issue and wanted to update
>> you on our results as we think there is a wider issue to resolve here.
>>
>> I believe Ezequiel's analysis of the IRQ sequence is correct. The
>> proposed patch from Brian is not needed, not does it resolve our issue.
>>
>> Further, we have debugged in more detail with kernel tracing and see
>> that the problem occurs because nothing calls 'schedule()' in the kernel
>> within the timeout period (200ms) to actually allow the IRQ bottom-half
>> thread ((pxa3xx_nand_irq_thread()) to run.
>>
>> With the trace enabled we were able to see long delays between the IRQ
>> thread being queued for run as a result of the return with
>> IRQ_WAKE_THREAD in pxa3xx_nand_irq(), and a call to schedule() actually
>> selecting the thread for execution. This delay was tied to whatever else
>> was happening in our system during the boot up when this issue occurs.
>> For instance, if a mutex got unlocked this would ultimately lead to a
>> call to schedule() and our thread (pxa3xx_nand_irq_thread()) would run.
>> If some file system ops happened schedule() would get called. These
>> unrelated activities were required so that schedule() got called and the
>> scheduler would let the bottom-half execute.
>> In our tracing we saw the issue occur whenever the call to schedule()
>> didn't occur before the timeout expired. In short, if we got lucky and
>> something completely unrelated happened that led to a call to
>> 'schedule()', then the issue didn't occur. If we got unlucky and nothing
>> called schedule(), the issue was observed.
>>
>> We have enabled CONFIG_PREEMPT in our system to resolve this. This has
>> the effect of leading to a triggering of 'schedule()' at the end of
>> processing interrupts. This means that we now see a nice tight execution
>> of pxa3xx_nand threaded IRQ such that we see all ops completing with
>> less that 20ms delay.
>>
>> I wanted to highlight this to you because I think it shows that the
>> change to a threaded IRQ is probably only safe if we can guarantee
>> scheduling of the bottom-half thread within the time out period (200ms).
>> I think this is not safe for systems with CONFIG_PREEMPT not configured.
>> Or perhaps there is a better way to resolve this. Is this a deficiency
>> in the Armada XP code?
>>
>> What do you think?
>>
> Thanks a lot for updating us. Richard, Boris and I have
> discussed about this on IRC.
>
> We came to the following conclusions:
>
> 1. There's no way to guarantee that the thread will be scheduled
> soon enough to always do the data transfer in time.
>
> Using CONFIG_PREEMPT could prevent a timeout in many (most?)
> cases, but it's not guaranteed to work.
>
> 2. The pxa3xx-nand driver is poorly designed, and it shouldn't
> do any data transfer in the IRQ (threaded or not) handler.
>
> Instead, this should be done in {read,write}_buf, i.e. in the
> context of the calling process that asks data to be read/write.
> This is potentially a big rework.
>
> 3. Richard proposed to get rid of the timeout
> (using wait_for_completion_interruptible instead?).
>
> I'm wondering if that could avoid these kind of issues (false
> timeouts), as a less invasive solution, until someone steps
> up to rework the driver.
>
> How does it sound?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* mtd: pxa3xx_nand: issue with command time out
@ 2016-03-03 20:16                   ` Hamish Martin
  0 siblings, 0 replies; 30+ messages in thread
From: Hamish Martin @ 2016-03-03 20:16 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Ezequiel,

I'm glad my analysis was correct and we more fully understand the 
problem now. CONFIG_PREEMPT seems to work for us now, but it does seem 
like the wrong fix, and if you guys think it will not guarantee 
preventing the issue then we'd rather back that change out.

I had contemplated using wait_for_completion_interruptible() but was a 
bit unsure if it was safe. Do you have a proposed patch? We would be 
happy to apply it to our system, turn off CONFIG_PREEMPT, and see how it 
goes. Out issue is highly reproducible so I think it would be a good 
first test for a change?

Thanks,
Hamish M.

On 03/04/2016 08:24 AM, Ezequiel Garcia wrote:
> +Richard, Boris and Thomas
>
> On 1 March 2016 at 19:17, Hamish Martin
> <Hamish.Martin@alliedtelesis.co.nz> wrote:
>> Hi Robert, Ezequiel, and Brian,
>>
>> I have been working with Michael Wang on this issue and wanted to update
>> you on our results as we think there is a wider issue to resolve here.
>>
>> I believe Ezequiel's analysis of the IRQ sequence is correct. The
>> proposed patch from Brian is not needed, not does it resolve our issue.
>>
>> Further, we have debugged in more detail with kernel tracing and see
>> that the problem occurs because nothing calls 'schedule()' in the kernel
>> within the timeout period (200ms) to actually allow the IRQ bottom-half
>> thread ((pxa3xx_nand_irq_thread()) to run.
>>
>> With the trace enabled we were able to see long delays between the IRQ
>> thread being queued for run as a result of the return with
>> IRQ_WAKE_THREAD in pxa3xx_nand_irq(), and a call to schedule() actually
>> selecting the thread for execution. This delay was tied to whatever else
>> was happening in our system during the boot up when this issue occurs.
>> For instance, if a mutex got unlocked this would ultimately lead to a
>> call to schedule() and our thread (pxa3xx_nand_irq_thread()) would run.
>> If some file system ops happened schedule() would get called. These
>> unrelated activities were required so that schedule() got called and the
>> scheduler would let the bottom-half execute.
>> In our tracing we saw the issue occur whenever the call to schedule()
>> didn't occur before the timeout expired. In short, if we got lucky and
>> something completely unrelated happened that led to a call to
>> 'schedule()', then the issue didn't occur. If we got unlucky and nothing
>> called schedule(), the issue was observed.
>>
>> We have enabled CONFIG_PREEMPT in our system to resolve this. This has
>> the effect of leading to a triggering of 'schedule()' at the end of
>> processing interrupts. This means that we now see a nice tight execution
>> of pxa3xx_nand threaded IRQ such that we see all ops completing with
>> less that 20ms delay.
>>
>> I wanted to highlight this to you because I think it shows that the
>> change to a threaded IRQ is probably only safe if we can guarantee
>> scheduling of the bottom-half thread within the time out period (200ms).
>> I think this is not safe for systems with CONFIG_PREEMPT not configured.
>> Or perhaps there is a better way to resolve this. Is this a deficiency
>> in the Armada XP code?
>>
>> What do you think?
>>
> Thanks a lot for updating us. Richard, Boris and I have
> discussed about this on IRC.
>
> We came to the following conclusions:
>
> 1. There's no way to guarantee that the thread will be scheduled
> soon enough to always do the data transfer in time.
>
> Using CONFIG_PREEMPT could prevent a timeout in many (most?)
> cases, but it's not guaranteed to work.
>
> 2. The pxa3xx-nand driver is poorly designed, and it shouldn't
> do any data transfer in the IRQ (threaded or not) handler.
>
> Instead, this should be done in {read,write}_buf, i.e. in the
> context of the calling process that asks data to be read/write.
> This is potentially a big rework.
>
> 3. Richard proposed to get rid of the timeout
> (using wait_for_completion_interruptible instead?).
>
> I'm wondering if that could avoid these kind of issues (false
> timeouts), as a less invasive solution, until someone steps
> up to rework the driver.
>
> How does it sound?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mtd: pxa3xx_nand: issue with command time out
  2016-03-03 19:24                 ` Ezequiel Garcia
@ 2016-03-03 23:18                   ` Richard Weinberger
  -1 siblings, 0 replies; 30+ messages in thread
From: Richard Weinberger @ 2016-03-03 23:18 UTC (permalink / raw)
  To: Ezequiel Garcia, Hamish Martin
  Cc: Robert Jarzmik, Brian Norris, linux-mtd, Michael Wang,
	linux-arm-kernel, Boris Brezillon, Thomas Petazzoni

Am 03.03.2016 um 20:24 schrieb Ezequiel Garcia:
> +Richard, Boris and Thomas
> 
> On 1 March 2016 at 19:17, Hamish Martin
> <Hamish.Martin@alliedtelesis.co.nz> wrote:
>> Hi Robert, Ezequiel, and Brian,
>>
>> I have been working with Michael Wang on this issue and wanted to update
>> you on our results as we think there is a wider issue to resolve here.
>>
>> I believe Ezequiel's analysis of the IRQ sequence is correct. The
>> proposed patch from Brian is not needed, not does it resolve our issue.
>>
>> Further, we have debugged in more detail with kernel tracing and see
>> that the problem occurs because nothing calls 'schedule()' in the kernel
>> within the timeout period (200ms) to actually allow the IRQ bottom-half
>> thread ((pxa3xx_nand_irq_thread()) to run.
>>
>> With the trace enabled we were able to see long delays between the IRQ
>> thread being queued for run as a result of the return with
>> IRQ_WAKE_THREAD in pxa3xx_nand_irq(), and a call to schedule() actually
>> selecting the thread for execution. This delay was tied to whatever else
>> was happening in our system during the boot up when this issue occurs.
>> For instance, if a mutex got unlocked this would ultimately lead to a
>> call to schedule() and our thread (pxa3xx_nand_irq_thread()) would run.
>> If some file system ops happened schedule() would get called. These
>> unrelated activities were required so that schedule() got called and the
>> scheduler would let the bottom-half execute.
>> In our tracing we saw the issue occur whenever the call to schedule()
>> didn't occur before the timeout expired. In short, if we got lucky and
>> something completely unrelated happened that led to a call to
>> 'schedule()', then the issue didn't occur. If we got unlucky and nothing
>> called schedule(), the issue was observed.
>>
>> We have enabled CONFIG_PREEMPT in our system to resolve this. This has
>> the effect of leading to a triggering of 'schedule()' at the end of
>> processing interrupts. This means that we now see a nice tight execution
>> of pxa3xx_nand threaded IRQ such that we see all ops completing with
>> less that 20ms delay.
>>
>> I wanted to highlight this to you because I think it shows that the
>> change to a threaded IRQ is probably only safe if we can guarantee
>> scheduling of the bottom-half thread within the time out period (200ms).
>> I think this is not safe for systems with CONFIG_PREEMPT not configured.
>> Or perhaps there is a better way to resolve this. Is this a deficiency
>> in the Armada XP code?
>>
>> What do you think?
>>
> 
> Thanks a lot for updating us. Richard, Boris and I have
> discussed about this on IRC.
> 
> We came to the following conclusions:
> 
> 1. There's no way to guarantee that the thread will be scheduled
> soon enough to always do the data transfer in time.
> 
> Using CONFIG_PREEMPT could prevent a timeout in many (most?)
> cases, but it's not guaranteed to work.
> 
> 2. The pxa3xx-nand driver is poorly designed, and it shouldn't
> do any data transfer in the IRQ (threaded or not) handler.
> 
> Instead, this should be done in {read,write}_buf, i.e. in the
> context of the calling process that asks data to be read/write.
> This is potentially a big rework.
> 
> 3. Richard proposed to get rid of the timeout
> (using wait_for_completion_interruptible instead?).

Getting rid of the timeout can work but definitely needs more
clarification.
Before commit 24542257a3b987025d4b998ec2d15e556c98ad3f it made sense
but now the thread can be delayed for unknown time.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 30+ messages in thread

* mtd: pxa3xx_nand: issue with command time out
@ 2016-03-03 23:18                   ` Richard Weinberger
  0 siblings, 0 replies; 30+ messages in thread
From: Richard Weinberger @ 2016-03-03 23:18 UTC (permalink / raw)
  To: linux-arm-kernel

Am 03.03.2016 um 20:24 schrieb Ezequiel Garcia:
> +Richard, Boris and Thomas
> 
> On 1 March 2016 at 19:17, Hamish Martin
> <Hamish.Martin@alliedtelesis.co.nz> wrote:
>> Hi Robert, Ezequiel, and Brian,
>>
>> I have been working with Michael Wang on this issue and wanted to update
>> you on our results as we think there is a wider issue to resolve here.
>>
>> I believe Ezequiel's analysis of the IRQ sequence is correct. The
>> proposed patch from Brian is not needed, not does it resolve our issue.
>>
>> Further, we have debugged in more detail with kernel tracing and see
>> that the problem occurs because nothing calls 'schedule()' in the kernel
>> within the timeout period (200ms) to actually allow the IRQ bottom-half
>> thread ((pxa3xx_nand_irq_thread()) to run.
>>
>> With the trace enabled we were able to see long delays between the IRQ
>> thread being queued for run as a result of the return with
>> IRQ_WAKE_THREAD in pxa3xx_nand_irq(), and a call to schedule() actually
>> selecting the thread for execution. This delay was tied to whatever else
>> was happening in our system during the boot up when this issue occurs.
>> For instance, if a mutex got unlocked this would ultimately lead to a
>> call to schedule() and our thread (pxa3xx_nand_irq_thread()) would run.
>> If some file system ops happened schedule() would get called. These
>> unrelated activities were required so that schedule() got called and the
>> scheduler would let the bottom-half execute.
>> In our tracing we saw the issue occur whenever the call to schedule()
>> didn't occur before the timeout expired. In short, if we got lucky and
>> something completely unrelated happened that led to a call to
>> 'schedule()', then the issue didn't occur. If we got unlucky and nothing
>> called schedule(), the issue was observed.
>>
>> We have enabled CONFIG_PREEMPT in our system to resolve this. This has
>> the effect of leading to a triggering of 'schedule()' at the end of
>> processing interrupts. This means that we now see a nice tight execution
>> of pxa3xx_nand threaded IRQ such that we see all ops completing with
>> less that 20ms delay.
>>
>> I wanted to highlight this to you because I think it shows that the
>> change to a threaded IRQ is probably only safe if we can guarantee
>> scheduling of the bottom-half thread within the time out period (200ms).
>> I think this is not safe for systems with CONFIG_PREEMPT not configured.
>> Or perhaps there is a better way to resolve this. Is this a deficiency
>> in the Armada XP code?
>>
>> What do you think?
>>
> 
> Thanks a lot for updating us. Richard, Boris and I have
> discussed about this on IRC.
> 
> We came to the following conclusions:
> 
> 1. There's no way to guarantee that the thread will be scheduled
> soon enough to always do the data transfer in time.
> 
> Using CONFIG_PREEMPT could prevent a timeout in many (most?)
> cases, but it's not guaranteed to work.
> 
> 2. The pxa3xx-nand driver is poorly designed, and it shouldn't
> do any data transfer in the IRQ (threaded or not) handler.
> 
> Instead, this should be done in {read,write}_buf, i.e. in the
> context of the calling process that asks data to be read/write.
> This is potentially a big rework.
> 
> 3. Richard proposed to get rid of the timeout
> (using wait_for_completion_interruptible instead?).

Getting rid of the timeout can work but definitely needs more
clarification.
Before commit 24542257a3b987025d4b998ec2d15e556c98ad3f it made sense
but now the thread can be delayed for unknown time.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mtd: pxa3xx_nand: issue with command time out
  2016-03-03 20:16                   ` Hamish Martin
@ 2016-03-04  0:03                     ` Ezequiel Garcia
  -1 siblings, 0 replies; 30+ messages in thread
From: Ezequiel Garcia @ 2016-03-04  0:03 UTC (permalink / raw)
  To: Hamish Martin
  Cc: Robert Jarzmik, Brian Norris, linux-mtd, Michael Wang,
	linux-arm-kernel, Boris Brezillon, Richard Weinberger,
	Thomas Petazzoni, Chris Packham

On 3 March 2016 at 17:16, Hamish Martin
<Hamish.Martin@alliedtelesis.co.nz> wrote:
> Hi Ezequiel,
>
> I'm glad my analysis was correct and we more fully understand the
> problem now. CONFIG_PREEMPT seems to work for us now, but it does seem
> like the wrong fix, and if you guys think it will not guarantee
> preventing the issue then we'd rather back that change out.
>

Do you have any good reason for NOT enabling CONFIG_PREEMPT?

I've had very bad experiences on embedded systems when not
using CONFIG_PREEMPT, and came to the conclusion that
enable full preemption, unless there's a good reason not to.

Enabling CONFIG_PREEMPT would result in slightly lower throughput,
but maximum latency will be greatly reduced.

For instance, in a kernel without preemption, a UBI static volume check
(which happens on the first UBI static volume open call), can take
several seconds.

> I had contemplated using wait_for_completion_interruptible() but was a
> bit unsure if it was safe. Do you have a proposed patch? We would be
> happy to apply it to our system, turn off CONFIG_PREEMPT, and see how it
> goes. Out issue is highly reproducible so I think it would be a good
> first test for a change?
>

Sure, I can try to cook a patch for you.
-- 
Ezequiel García, VanguardiaSur
www.vanguardiasur.com.ar

^ permalink raw reply	[flat|nested] 30+ messages in thread

* mtd: pxa3xx_nand: issue with command time out
@ 2016-03-04  0:03                     ` Ezequiel Garcia
  0 siblings, 0 replies; 30+ messages in thread
From: Ezequiel Garcia @ 2016-03-04  0:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 3 March 2016 at 17:16, Hamish Martin
<Hamish.Martin@alliedtelesis.co.nz> wrote:
> Hi Ezequiel,
>
> I'm glad my analysis was correct and we more fully understand the
> problem now. CONFIG_PREEMPT seems to work for us now, but it does seem
> like the wrong fix, and if you guys think it will not guarantee
> preventing the issue then we'd rather back that change out.
>

Do you have any good reason for NOT enabling CONFIG_PREEMPT?

I've had very bad experiences on embedded systems when not
using CONFIG_PREEMPT, and came to the conclusion that
enable full preemption, unless there's a good reason not to.

Enabling CONFIG_PREEMPT would result in slightly lower throughput,
but maximum latency will be greatly reduced.

For instance, in a kernel without preemption, a UBI static volume check
(which happens on the first UBI static volume open call), can take
several seconds.

> I had contemplated using wait_for_completion_interruptible() but was a
> bit unsure if it was safe. Do you have a proposed patch? We would be
> happy to apply it to our system, turn off CONFIG_PREEMPT, and see how it
> goes. Out issue is highly reproducible so I think it would be a good
> first test for a change?
>

Sure, I can try to cook a patch for you.
-- 
Ezequiel Garc?a, VanguardiaSur
www.vanguardiasur.com.ar

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: mtd: pxa3xx_nand: issue with command time out
  2016-03-04  0:03                     ` Ezequiel Garcia
@ 2016-03-07 20:04                       ` Hamish Martin
  -1 siblings, 0 replies; 30+ messages in thread
From: Hamish Martin @ 2016-03-07 20:04 UTC (permalink / raw)
  To: Ezequiel Garcia
  Cc: Robert Jarzmik, Brian Norris, linux-mtd, Michael Wang,
	linux-arm-kernel, Boris Brezillon, Richard Weinberger,
	Thomas Petazzoni, Chris Packham

Hi Ezequiel,

On 03/04/2016 01:03 PM, Ezequiel Garcia wrote:
> On 3 March 2016 at 17:16, Hamish Martin
> <Hamish.Martin@alliedtelesis.co.nz> wrote:
>> Hi Ezequiel,
>>
>> I'm glad my analysis was correct and we more fully understand the
>> problem now. CONFIG_PREEMPT seems to work for us now, but it does seem
>> like the wrong fix, and if you guys think it will not guarantee
>> preventing the issue then we'd rather back that change out.
>>
> Do you have any good reason for NOT enabling CONFIG_PREEMPT?
>
> I've had very bad experiences on embedded systems when not
> using CONFIG_PREEMPT, and came to the conclusion that
> enable full preemption, unless there's a good reason not to.
>
> Enabling CONFIG_PREEMPT would result in slightly lower throughput,
> but maximum latency will be greatly reduced.
>
> For instance, in a kernel without preemption, a UBI static volume check
> (which happens on the first UBI static volume open call), can take
> several seconds.
We don't have a good reason, it is mainly an historic artifact. We are 
happy to go with that as a fix and we are investigating it for other 
platforms too. I think it is an oversight that hasn't been given the 
thought it should.
I think we still need to consider operation of the driver without 
CONFIG_PREEMPT enabled.

>
>> I had contemplated using wait_for_completion_interruptible() but was a
>> bit unsure if it was safe. Do you have a proposed patch? We would be
>> happy to apply it to our system, turn off CONFIG_PREEMPT, and see how it
>> goes. Out issue is highly reproducible so I think it would be a good
>> first test for a change?
>>
> Sure, I can try to cook a patch for you.
Thanks for your help with this guys. Send me the patch when you have one 
you want me to test out for you.

Cheers,
Hamish M.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* mtd: pxa3xx_nand: issue with command time out
@ 2016-03-07 20:04                       ` Hamish Martin
  0 siblings, 0 replies; 30+ messages in thread
From: Hamish Martin @ 2016-03-07 20:04 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Ezequiel,

On 03/04/2016 01:03 PM, Ezequiel Garcia wrote:
> On 3 March 2016 at 17:16, Hamish Martin
> <Hamish.Martin@alliedtelesis.co.nz> wrote:
>> Hi Ezequiel,
>>
>> I'm glad my analysis was correct and we more fully understand the
>> problem now. CONFIG_PREEMPT seems to work for us now, but it does seem
>> like the wrong fix, and if you guys think it will not guarantee
>> preventing the issue then we'd rather back that change out.
>>
> Do you have any good reason for NOT enabling CONFIG_PREEMPT?
>
> I've had very bad experiences on embedded systems when not
> using CONFIG_PREEMPT, and came to the conclusion that
> enable full preemption, unless there's a good reason not to.
>
> Enabling CONFIG_PREEMPT would result in slightly lower throughput,
> but maximum latency will be greatly reduced.
>
> For instance, in a kernel without preemption, a UBI static volume check
> (which happens on the first UBI static volume open call), can take
> several seconds.
We don't have a good reason, it is mainly an historic artifact. We are 
happy to go with that as a fix and we are investigating it for other 
platforms too. I think it is an oversight that hasn't been given the 
thought it should.
I think we still need to consider operation of the driver without 
CONFIG_PREEMPT enabled.

>
>> I had contemplated using wait_for_completion_interruptible() but was a
>> bit unsure if it was safe. Do you have a proposed patch? We would be
>> happy to apply it to our system, turn off CONFIG_PREEMPT, and see how it
>> goes. Out issue is highly reproducible so I think it would be a good
>> first test for a change?
>>
> Sure, I can try to cook a patch for you.
Thanks for your help with this guys. Send me the patch when you have one 
you want me to test out for you.

Cheers,
Hamish M.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2016-03-07 20:05 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-13  4:13 mtd: pxa3xx_nand: issue with command time out Michael Wang
2016-01-13  4:13 ` Michael Wang
2016-01-13  7:47 ` Robert Jarzmik
2016-01-13  7:47   ` Robert Jarzmik
2016-01-14 22:57   ` Michael Wang
2016-01-16 23:53     ` Robert Jarzmik
2016-01-16 23:53       ` Robert Jarzmik
2016-01-18  2:15       ` Michael Wang
2016-01-18 21:13         ` Robert Jarzmik
2016-01-18 21:13           ` Robert Jarzmik
2016-01-19 22:33           ` Michael Wang
2016-01-19 22:33             ` Michael Wang
2016-01-25 18:30       ` Brian Norris
2016-01-25 18:30         ` Brian Norris
2016-01-25 20:48         ` Robert Jarzmik
2016-01-25 20:48           ` Robert Jarzmik
2016-01-26 13:03           ` Ezequiel Garcia
2016-01-26 13:03             ` Ezequiel Garcia
2016-03-01 22:17             ` Hamish Martin
2016-03-01 22:17               ` Hamish Martin
2016-03-03 19:24               ` Ezequiel Garcia
2016-03-03 19:24                 ` Ezequiel Garcia
2016-03-03 20:16                 ` Hamish Martin
2016-03-03 20:16                   ` Hamish Martin
2016-03-04  0:03                   ` Ezequiel Garcia
2016-03-04  0:03                     ` Ezequiel Garcia
2016-03-07 20:04                     ` Hamish Martin
2016-03-07 20:04                       ` Hamish Martin
2016-03-03 23:18                 ` Richard Weinberger
2016-03-03 23:18                   ` Richard Weinberger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.