[PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
@ 2018-09-27  7:17 Daniel Mack
  2018-09-27  8:11 ` Miquel Raynal
  2018-10-01 22:44 ` Boris Brezillon
  0 siblings, 2 replies; 28+ messages in thread
From: Daniel Mack @ 2018-09-27  7:17 UTC (permalink / raw)
  To: miquel.raynal, boris.brezillon
  Cc: linux-mtd, chris.packham, Daniel Mack, stable

At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register
will only cause the IRQ to latch when the RDY lanes are changing, and not
in case they are already asserted.

This means that if the controller finished the command in flight before
marvell_nfc_wait_op() is called, that function will wait for a change in
the bit that can't ever happen as it is already set.

To address this race, check for the RDY bits after the IRQ was enabled,
and complete the completion immediately if the condition is already met.

This fixes a bug that was observed with a NAND chip that holds a UBIFS
parition on which file system stress tests were executed. When
marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount
the filesystem read-only, reporting lots of warnings along the way.

Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver
Cc: stable@vger.kernel.org
Signed-off-by: Daniel Mack <daniel@zonque.org>
---
v1 → v2:

 * Use complete(&nfc->complete) when the condition is met, and do
   wait_for_completion_timeout() in all cases. Suggested by Boris
   Brezillon.

 drivers/mtd/nand/raw/marvell_nand.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/mtd/nand/raw/marvell_nand.c b/drivers/mtd/nand/raw/marvell_nand.c
index 666f34b58dec..4870b5bae296 100644
--- a/drivers/mtd/nand/raw/marvell_nand.c
+++ b/drivers/mtd/nand/raw/marvell_nand.c
@@ -614,6 +614,7 @@ static int marvell_nfc_wait_op(struct nand_chip *chip, unsigned int timeout_ms)
 {
 	struct marvell_nfc *nfc = to_marvell_nfc(chip->controller);
 	int ret;
+	u32 st;

 	/* Timeout is expressed in ms */
 	if (!timeout_ms)
@@ -622,6 +623,15 @@ static int marvell_nfc_wait_op(struct nand_chip *chip, unsigned int timeout_ms)
 	init_completion(&nfc->complete);

 	marvell_nfc_enable_int(nfc, NDCR_RDYM);
+
+	/*
+	 * Check if the NDSR_RDY bits have already been set before the
+	 * interrupt was enabled.
+	 */
+	st = readl_relaxed(nfc->regs + NDSR);
+	if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
+		complete(&nfc->complete);
+
 	ret = wait_for_completion_timeout(&nfc->complete,
 					  msecs_to_jiffies(timeout_ms));
 	marvell_nfc_disable_int(nfc, NDCR_RDYM);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-09-27  7:17 [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ Daniel Mack
@ 2018-09-27  8:11 ` Miquel Raynal
  2018-09-27  8:56   ` Boris Brezillon
  2018-10-01 22:44 ` Boris Brezillon
  1 sibling, 1 reply; 28+ messages in thread
From: Miquel Raynal @ 2018-09-27  8:11 UTC (permalink / raw)
  To: Daniel Mack; +Cc: boris.brezillon, linux-mtd, chris.packham, stable

Hi Daniel,

Daniel Mack <daniel@zonque.org> wrote on Thu, 27 Sep 2018 09:17:51
+0200:

> At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register
> will only cause the IRQ to latch when the RDY lanes are changing, and not
> in case they are already asserted.
> 
> This means that if the controller finished the command in flight before
> marvell_nfc_wait_op() is called, that function will wait for a change in
> the bit that can't ever happen as it is already set.
> 
> To address this race, check for the RDY bits after the IRQ was enabled,
> and complete the completion immediately if the condition is already met.
> 
> This fixes a bug that was observed with a NAND chip that holds a UBIFS
> parition on which file system stress tests were executed. When
> marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount
> the filesystem read-only, reporting lots of warnings along the way.
> 
> Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver
> Cc: stable@vger.kernel.org
> Signed-off-by: Daniel Mack <daniel@zonque.org>
> ---

Sorry I haven't had the time to check on my Armada, but you figured it
out, and the fix looks good to me!

Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>

Boris, do you plan to send another fixes PR of can I take it into
the nand/next branch?

Thanks,
Miquèl

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-09-27  8:11 ` Miquel Raynal
@ 2018-09-27  8:56   ` Boris Brezillon
  2018-09-27 21:55     ` Chris Packham
  0 siblings, 1 reply; 28+ messages in thread
From: Boris Brezillon @ 2018-09-27  8:56 UTC (permalink / raw)
  To: Miquel Raynal; +Cc: Daniel Mack, linux-mtd, chris.packham, stable

On Thu, 27 Sep 2018 10:11:45 +0200
Miquel Raynal <miquel.raynal@bootlin.com> wrote:

> Hi Daniel,
> 
> Daniel Mack <daniel@zonque.org> wrote on Thu, 27 Sep 2018 09:17:51
> +0200:
> 
> > At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register
> > will only cause the IRQ to latch when the RDY lanes are changing, and not
> > in case they are already asserted.
> > 
> > This means that if the controller finished the command in flight before
> > marvell_nfc_wait_op() is called, that function will wait for a change in
> > the bit that can't ever happen as it is already set.
> > 
> > To address this race, check for the RDY bits after the IRQ was enabled,
> > and complete the completion immediately if the condition is already met.
> > 
> > This fixes a bug that was observed with a NAND chip that holds a UBIFS
> > parition on which file system stress tests were executed. When
> > marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount
> > the filesystem read-only, reporting lots of warnings along the way.
> > 
> > Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Daniel Mack <daniel@zonque.org>
> > ---  
> 
> Sorry I haven't had the time to check on my Armada, but you figured it
> out, and the fix looks good to me!
> 
> Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>
> 
> Boris, do you plan to send another fixes PR of can I take it into
> the nand/next branch?

Queued to mtd/master.

Thanks,

Boris

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-09-27  8:56   ` Boris Brezillon
@ 2018-09-27 21:55     ` Chris Packham
  2018-09-28  6:40       ` Boris Brezillon
  2018-09-28  7:43       ` Daniel Mack
  0 siblings, 2 replies; 28+ messages in thread
From: Chris Packham @ 2018-09-27 21:55 UTC (permalink / raw)
  To: Boris Brezillon, Miquel Raynal; +Cc: Daniel Mack, linux-mtd, stable

Hi All,

On 27/09/18 20:56, Boris Brezillon wrote:
> On Thu, 27 Sep 2018 10:11:45 +0200
> Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> 
>> Hi Daniel,
>>
>> Daniel Mack <daniel@zonque.org> wrote on Thu, 27 Sep 2018 09:17:51
>> +0200:
>>
>>> At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register
>>> will only cause the IRQ to latch when the RDY lanes are changing, and not
>>> in case they are already asserted.
>>>
>>> This means that if the controller finished the command in flight before
>>> marvell_nfc_wait_op() is called, that function will wait for a change in
>>> the bit that can't ever happen as it is already set.
>>>
>>> To address this race, check for the RDY bits after the IRQ was enabled,
>>> and complete the completion immediately if the condition is already met.
>>>
>>> This fixes a bug that was observed with a NAND chip that holds a UBIFS
>>> parition on which file system stress tests were executed. When
>>> marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount
>>> the filesystem read-only, reporting lots of warnings along the way.
>>>
>>> Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Daniel Mack <daniel@zonque.org>
>>> ---
>>
>> Sorry I haven't had the time to check on my Armada, but you figured it
>> out, and the fix looks good to me!
>>
>> Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>
>>
>> Boris, do you plan to send another fixes PR of can I take it into
>> the nand/next branch?
> 
> Queued to mtd/master.

After fixing my R/B configuration I get a new error with this patch when 
running stress_1 from mtd-utils-2.0.0. I don't see this without the patch.

My board is a custom design using an Armada-385 SoC with Macronix NAND.

# stress_1
marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ/WRDREQ while 
draining raw data (NDSR: 0x00000000)
ubi0 error: ubi_io_write: error -5 while writing 4096 bytes to PEB 
1858:110592, written 0 bytes
CPU: 1 PID: 1170 Comm: stress_1 Not tainted 4.19.0-rc5-at1+ #8
Hardware name: Marvell Armada 380/385 (Device Tree)
[<8011143c>] (unwind_backtrace) from [<8010c17c>] (show_stack+0x10/0x14)
[<8010c17c>] (show_stack) from [<805ec28c>] (dump_stack+0x88/0x9c)
[<805ec28c>] (dump_stack) from [<80418a28>] (ubi_io_write+0x55c/0x6c0)
[<80418a28>] (ubi_io_write) from [<80415b4c>] (ubi_eba_write_leb+0x80/0x780)
[<80415b4c>] (ubi_eba_write_leb) from [<80414580>] (ubi_leb_write+0xbc/0xe0)
[<80414580>] (ubi_leb_write) from [<802d46b4>] (ubifs_leb_write+0xa0/0x118)
[<802d46b4>] (ubifs_leb_write) from [<802d5620>] 
(ubifs_wbuf_write_nolock+0x184/0x6ac)
[<802d5620>] (ubifs_wbuf_write_nolock) from [<802c8a18>] 
(ubifs_jnl_write_data+0x1c0/0x2bc)
[<802c8a18>] (ubifs_jnl_write_data) from [<802caed8>] 
(do_writepage+0xa4/0x1b0)
[<802caed8>] (do_writepage) from [<801aa160>] (__writepage+0x14/0x48)
[<801aa160>] (__writepage) from [<801aa900>] (write_cache_pages+0x1d0/0x3e4)
[<801aa900>] (write_cache_pages) from [<801aab68>] 
(generic_writepages+0x54/0x80)
[<801aab68>] (generic_writepages) from [<801ac9a0>] 
(do_writepages+0x68/0x8c)
[<801ac9a0>] (do_writepages) from [<801a0ac8>] 
(__filemap_fdatawrite_range+0x88/0xc0)
[<801a0ac8>] (__filemap_fdatawrite_range) from [<801a0cc4>] 
(file_write_and_wait_range+0x3c/0x98)
[<801a0cc4>] (file_write_and_wait_range) from [<802cb600>] 
(ubifs_fsync+0x3c/0xb0)
[<802cb600>] (ubifs_fsync) from [<801a2828>] 
(generic_file_write_iter+0x198/0x24c)
[<801a2828>] (generic_file_write_iter) from [<802ccb84>] 
(ubifs_write_iter+0xf0/0x158)
[<802ccb84>] (ubifs_write_iter) from [<801ef854>] (__vfs_write+0xfc/0x160)
[<801ef854>] (__vfs_write) from [<801efa60>] (vfs_write+0xa4/0x1ac)
[<801efa60>] (vfs_write) from [<801efcac>] (ksys_write+0x54/0xb8)
[<801efcac>] (ksys_write) from [<80101000>] (ret_fast_syscall+0x0/0x54)
Exception stack(0xbd789fa8 to 0xbd789ff0)
9fa0:                   0ca5d000 00000000 00000003 7e9f2900 00008000 
ffffffff
9fc0: 0ca5d000 00000000 00008000 00000004 00000003 00000000 76f24fb4 
00000000
9fe0: 00000000 7e9f27fc 00010fd8 76e775ec
marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
FIFO (data) (NDSR: 0x00000810)
ttyS ttyS1: tty_port_close_start: tty->count = 1 port count = 2
marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
FIFO (data) (NDSR: 0x00000810)
marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
FIFO (data) (NDSR: 0x00000810)
marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
FIFO (data) (NDSR: 0x00000810)
marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
FIFO (data) (NDSR: 0x00000810)
marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
FIFO (data) (NDSR: 0x00000810)
marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
FIFO (data) (NDSR: 0x00000810)

... (RDDREQ messages repeat).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-09-27 21:55     ` Chris Packham
@ 2018-09-28  6:40       ` Boris Brezillon
  2018-09-28  6:56         ` Boris Brezillon
  2018-09-28  8:12         ` Miquel Raynal
  2018-09-28  7:43       ` Daniel Mack
  1 sibling, 2 replies; 28+ messages in thread
From: Boris Brezillon @ 2018-09-28  6:40 UTC (permalink / raw)
  To: Chris Packham; +Cc: Miquel Raynal, Daniel Mack, linux-mtd, stable

On Thu, 27 Sep 2018 21:55:57 +0000
Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:

> Hi All,
> 
> On 27/09/18 20:56, Boris Brezillon wrote:
> > On Thu, 27 Sep 2018 10:11:45 +0200
> > Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> >   
> >> Hi Daniel,
> >>
> >> Daniel Mack <daniel@zonque.org> wrote on Thu, 27 Sep 2018 09:17:51
> >> +0200:
> >>  
> >>> At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register
> >>> will only cause the IRQ to latch when the RDY lanes are changing, and not
> >>> in case they are already asserted.
> >>>
> >>> This means that if the controller finished the command in flight before
> >>> marvell_nfc_wait_op() is called, that function will wait for a change in
> >>> the bit that can't ever happen as it is already set.
> >>>
> >>> To address this race, check for the RDY bits after the IRQ was enabled,
> >>> and complete the completion immediately if the condition is already met.
> >>>
> >>> This fixes a bug that was observed with a NAND chip that holds a UBIFS
> >>> parition on which file system stress tests were executed. When
> >>> marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount
> >>> the filesystem read-only, reporting lots of warnings along the way.
> >>>
> >>> Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver
> >>> Cc: stable@vger.kernel.org
> >>> Signed-off-by: Daniel Mack <daniel@zonque.org>
> >>> ---  
> >>
> >> Sorry I haven't had the time to check on my Armada, but you figured it
> >> out, and the fix looks good to me!
> >>
> >> Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>
> >>
> >> Boris, do you plan to send another fixes PR of can I take it into
> >> the nand/next branch?  
> > 
> > Queued to mtd/master.  
> 
> After fixing my R/B configuration I get a new error with this patch when 
> running stress_1 from mtd-utils-2.0.0. I don't see this without the patch.
> 
> My board is a custom design using an Armada-385 SoC with Macronix NAND.
> 
> # stress_1
> marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ/WRDREQ while 
> draining raw data (NDSR: 0x00000000)
> ubi0 error: ubi_io_write: error -5 while writing 4096 bytes to PEB 
> 1858:110592, written 0 bytes
> CPU: 1 PID: 1170 Comm: stress_1 Not tainted 4.19.0-rc5-at1+ #8
> Hardware name: Marvell Armada 380/385 (Device Tree)
> [<8011143c>] (unwind_backtrace) from [<8010c17c>] (show_stack+0x10/0x14)
> [<8010c17c>] (show_stack) from [<805ec28c>] (dump_stack+0x88/0x9c)
> [<805ec28c>] (dump_stack) from [<80418a28>] (ubi_io_write+0x55c/0x6c0)
> [<80418a28>] (ubi_io_write) from [<80415b4c>] (ubi_eba_write_leb+0x80/0x780)
> [<80415b4c>] (ubi_eba_write_leb) from [<80414580>] (ubi_leb_write+0xbc/0xe0)
> [<80414580>] (ubi_leb_write) from [<802d46b4>] (ubifs_leb_write+0xa0/0x118)
> [<802d46b4>] (ubifs_leb_write) from [<802d5620>] 
> (ubifs_wbuf_write_nolock+0x184/0x6ac)
> [<802d5620>] (ubifs_wbuf_write_nolock) from [<802c8a18>] 
> (ubifs_jnl_write_data+0x1c0/0x2bc)
> [<802c8a18>] (ubifs_jnl_write_data) from [<802caed8>] 
> (do_writepage+0xa4/0x1b0)
> [<802caed8>] (do_writepage) from [<801aa160>] (__writepage+0x14/0x48)
> [<801aa160>] (__writepage) from [<801aa900>] (write_cache_pages+0x1d0/0x3e4)
> [<801aa900>] (write_cache_pages) from [<801aab68>] 
> (generic_writepages+0x54/0x80)
> [<801aab68>] (generic_writepages) from [<801ac9a0>] 
> (do_writepages+0x68/0x8c)
> [<801ac9a0>] (do_writepages) from [<801a0ac8>] 
> (__filemap_fdatawrite_range+0x88/0xc0)
> [<801a0ac8>] (__filemap_fdatawrite_range) from [<801a0cc4>] 
> (file_write_and_wait_range+0x3c/0x98)
> [<801a0cc4>] (file_write_and_wait_range) from [<802cb600>] 
> (ubifs_fsync+0x3c/0xb0)
> [<802cb600>] (ubifs_fsync) from [<801a2828>] 
> (generic_file_write_iter+0x198/0x24c)
> [<801a2828>] (generic_file_write_iter) from [<802ccb84>] 
> (ubifs_write_iter+0xf0/0x158)
> [<802ccb84>] (ubifs_write_iter) from [<801ef854>] (__vfs_write+0xfc/0x160)
> [<801ef854>] (__vfs_write) from [<801efa60>] (vfs_write+0xa4/0x1ac)
> [<801efa60>] (vfs_write) from [<801efcac>] (ksys_write+0x54/0xb8)
> [<801efcac>] (ksys_write) from [<80101000>] (ret_fast_syscall+0x0/0x54)
> Exception stack(0xbd789fa8 to 0xbd789ff0)
> 9fa0:                   0ca5d000 00000000 00000003 7e9f2900 00008000 
> ffffffff
> 9fc0: 0ca5d000 00000000 00008000 00000004 00000003 00000000 76f24fb4 
> 00000000
> 9fe0: 00000000 7e9f27fc 00010fd8 76e775ec
> marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> FIFO (data) (NDSR: 0x00000810)
> ttyS ttyS1: tty_port_close_start: tty->count = 1 port count = 2
> marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> FIFO (data) (NDSR: 0x00000810)
> marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> FIFO (data) (NDSR: 0x00000810)
> marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> FIFO (data) (NDSR: 0x00000810)
> marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> FIFO (data) (NDSR: 0x00000810)
> marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> FIFO (data) (NDSR: 0x00000810)
> marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> FIFO (data) (NDSR: 0x00000810)
> 
> ... (RDDREQ messages repeat).

Hm, that's weird, unless RDDREQ is a 'clear-on-read' bit, that
shouldn't happen.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-09-28  6:40       ` Boris Brezillon
@ 2018-09-28  6:56         ` Boris Brezillon
  2018-09-28  8:12         ` Miquel Raynal
  1 sibling, 0 replies; 28+ messages in thread
From: Boris Brezillon @ 2018-09-28  6:56 UTC (permalink / raw)
  To: Chris Packham; +Cc: linux-mtd, stable, Daniel Mack, Miquel Raynal

On Fri, 28 Sep 2018 08:40:46 +0200
Boris Brezillon <boris.brezillon@bootlin.com> wrote:

> > marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> > FIFO (data) (NDSR: 0x00000810)
> > ttyS ttyS1: tty_port_close_start: tty->count = 1 port count = 2
> > marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> > FIFO (data) (NDSR: 0x00000810)
> > marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> > FIFO (data) (NDSR: 0x00000810)
> > marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> > FIFO (data) (NDSR: 0x00000810)
> > marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> > FIFO (data) (NDSR: 0x00000810)
> > marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> > FIFO (data) (NDSR: 0x00000810)
> > marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> > FIFO (data) (NDSR: 0x00000810)
> > 
> > ... (RDDREQ messages repeat).  
> 
> Hm, that's weird, unless RDDREQ is a 'clear-on-read' bit, that
> shouldn't happen.

BTW, I dropped the patch.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-09-27 21:55     ` Chris Packham
  2018-09-28  6:40       ` Boris Brezillon
@ 2018-09-28  7:43       ` Daniel Mack
  2018-09-28  8:24         ` Miquel Raynal
  1 sibling, 1 reply; 28+ messages in thread
From: Daniel Mack @ 2018-09-28  7:43 UTC (permalink / raw)
  To: Chris Packham, Boris Brezillon, Miquel Raynal; +Cc: linux-mtd, stable

Hi Chris,

On 27/9/2018 11:55 PM, Chris Packham wrote:
> On 27/09/18 20:56, Boris Brezillon wrote:
>> On Thu, 27 Sep 2018 10:11:45 +0200
>> Miquel Raynal <miquel.raynal@bootlin.com> wrote:
>>
>>> Hi Daniel,
>>>
>>> Daniel Mack <daniel@zonque.org> wrote on Thu, 27 Sep 2018 09:17:51
>>> +0200:
>>>
>>>> At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register
>>>> will only cause the IRQ to latch when the RDY lanes are changing, and not
>>>> in case they are already asserted.
>>>>
>>>> This means that if the controller finished the command in flight before
>>>> marvell_nfc_wait_op() is called, that function will wait for a change in
>>>> the bit that can't ever happen as it is already set.
>>>>
>>>> To address this race, check for the RDY bits after the IRQ was enabled,
>>>> and complete the completion immediately if the condition is already met.
>>>>
>>>> This fixes a bug that was observed with a NAND chip that holds a UBIFS
>>>> parition on which file system stress tests were executed. When
>>>> marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount
>>>> the filesystem read-only, reporting lots of warnings along the way.
>>>>
>>>> Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver
>>>> Cc: stable@vger.kernel.org
>>>> Signed-off-by: Daniel Mack <daniel@zonque.org>
>>>> ---
>>>
>>> Sorry I haven't had the time to check on my Armada, but you figured it
>>> out, and the fix looks good to me!
>>>
>>> Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>
>>>
>>> Boris, do you plan to send another fixes PR of can I take it into
>>> the nand/next branch?
>>
>> Queued to mtd/master.
> 
> After fixing my R/B configuration I get a new error with this patch when
> running stress_1 from mtd-utils-2.0.0. I don't see this without the patch.

That's strange. So your controller sets the RDY bits before it is ready? 
Could you check whether only checking for NDSR_RDY(0) changes anything? 
Not sure about the handling of NDSR_RDY(1) in this driver anyway ...

Also, does my .EALREADY approach (v1) make any difference?


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-09-28  6:40       ` Boris Brezillon
  2018-09-28  6:56         ` Boris Brezillon
@ 2018-09-28  8:12         ` Miquel Raynal
  1 sibling, 0 replies; 28+ messages in thread
From: Miquel Raynal @ 2018-09-28  8:12 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: Chris Packham, Daniel Mack, linux-mtd, stable

Hi Boris,

Boris Brezillon <boris.brezillon@bootlin.com> wrote on Fri, 28 Sep 2018
08:40:46 +0200:

> On Thu, 27 Sep 2018 21:55:57 +0000
> Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
> 
> > Hi All,
> > 
> > On 27/09/18 20:56, Boris Brezillon wrote:  
> > > On Thu, 27 Sep 2018 10:11:45 +0200
> > > Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> > >     
> > >> Hi Daniel,
> > >>
> > >> Daniel Mack <daniel@zonque.org> wrote on Thu, 27 Sep 2018 09:17:51
> > >> +0200:
> > >>    
> > >>> At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register
> > >>> will only cause the IRQ to latch when the RDY lanes are changing, and not
> > >>> in case they are already asserted.
> > >>>
> > >>> This means that if the controller finished the command in flight before
> > >>> marvell_nfc_wait_op() is called, that function will wait for a change in
> > >>> the bit that can't ever happen as it is already set.
> > >>>
> > >>> To address this race, check for the RDY bits after the IRQ was enabled,
> > >>> and complete the completion immediately if the condition is already met.
> > >>>
> > >>> This fixes a bug that was observed with a NAND chip that holds a UBIFS
> > >>> parition on which file system stress tests were executed. When
> > >>> marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount
> > >>> the filesystem read-only, reporting lots of warnings along the way.
> > >>>
> > >>> Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver
> > >>> Cc: stable@vger.kernel.org
> > >>> Signed-off-by: Daniel Mack <daniel@zonque.org>
> > >>> ---    
> > >>
> > >> Sorry I haven't had the time to check on my Armada, but you figured it
> > >> out, and the fix looks good to me!
> > >>
> > >> Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>
> > >>
> > >> Boris, do you plan to send another fixes PR of can I take it into
> > >> the nand/next branch?    
> > > 
> > > Queued to mtd/master.    
> > 
> > After fixing my R/B configuration I get a new error with this patch when 
> > running stress_1 from mtd-utils-2.0.0. I don't see this without the patch.
> > 
> > My board is a custom design using an Armada-385 SoC with Macronix NAND.
> > 
> > # stress_1
> > marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ/WRDREQ while 
> > draining raw data (NDSR: 0x00000000)
> > ubi0 error: ubi_io_write: error -5 while writing 4096 bytes to PEB 
> > 1858:110592, written 0 bytes
> > CPU: 1 PID: 1170 Comm: stress_1 Not tainted 4.19.0-rc5-at1+ #8
> > Hardware name: Marvell Armada 380/385 (Device Tree)
> > [<8011143c>] (unwind_backtrace) from [<8010c17c>] (show_stack+0x10/0x14)
> > [<8010c17c>] (show_stack) from [<805ec28c>] (dump_stack+0x88/0x9c)
> > [<805ec28c>] (dump_stack) from [<80418a28>] (ubi_io_write+0x55c/0x6c0)
> > [<80418a28>] (ubi_io_write) from [<80415b4c>] (ubi_eba_write_leb+0x80/0x780)
> > [<80415b4c>] (ubi_eba_write_leb) from [<80414580>] (ubi_leb_write+0xbc/0xe0)
> > [<80414580>] (ubi_leb_write) from [<802d46b4>] (ubifs_leb_write+0xa0/0x118)
> > [<802d46b4>] (ubifs_leb_write) from [<802d5620>] 
> > (ubifs_wbuf_write_nolock+0x184/0x6ac)
> > [<802d5620>] (ubifs_wbuf_write_nolock) from [<802c8a18>] 
> > (ubifs_jnl_write_data+0x1c0/0x2bc)
> > [<802c8a18>] (ubifs_jnl_write_data) from [<802caed8>] 
> > (do_writepage+0xa4/0x1b0)
> > [<802caed8>] (do_writepage) from [<801aa160>] (__writepage+0x14/0x48)
> > [<801aa160>] (__writepage) from [<801aa900>] (write_cache_pages+0x1d0/0x3e4)
> > [<801aa900>] (write_cache_pages) from [<801aab68>] 
> > (generic_writepages+0x54/0x80)
> > [<801aab68>] (generic_writepages) from [<801ac9a0>] 
> > (do_writepages+0x68/0x8c)
> > [<801ac9a0>] (do_writepages) from [<801a0ac8>] 
> > (__filemap_fdatawrite_range+0x88/0xc0)
> > [<801a0ac8>] (__filemap_fdatawrite_range) from [<801a0cc4>] 
> > (file_write_and_wait_range+0x3c/0x98)
> > [<801a0cc4>] (file_write_and_wait_range) from [<802cb600>] 
> > (ubifs_fsync+0x3c/0xb0)
> > [<802cb600>] (ubifs_fsync) from [<801a2828>] 
> > (generic_file_write_iter+0x198/0x24c)
> > [<801a2828>] (generic_file_write_iter) from [<802ccb84>] 
> > (ubifs_write_iter+0xf0/0x158)
> > [<802ccb84>] (ubifs_write_iter) from [<801ef854>] (__vfs_write+0xfc/0x160)
> > [<801ef854>] (__vfs_write) from [<801efa60>] (vfs_write+0xa4/0x1ac)
> > [<801efa60>] (vfs_write) from [<801efcac>] (ksys_write+0x54/0xb8)
> > [<801efcac>] (ksys_write) from [<80101000>] (ret_fast_syscall+0x0/0x54)
> > Exception stack(0xbd789fa8 to 0xbd789ff0)
> > 9fa0:                   0ca5d000 00000000 00000003 7e9f2900 00008000 
> > ffffffff
> > 9fc0: 0ca5d000 00000000 00008000 00000004 00000003 00000000 76f24fb4 
> > 00000000
> > 9fe0: 00000000 7e9f27fc 00010fd8 76e775ec
> > marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> > FIFO (data) (NDSR: 0x00000810)
> > ttyS ttyS1: tty_port_close_start: tty->count = 1 port count = 2
> > marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> > FIFO (data) (NDSR: 0x00000810)
> > marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> > FIFO (data) (NDSR: 0x00000810)
> > marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> > FIFO (data) (NDSR: 0x00000810)
> > marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> > FIFO (data) (NDSR: 0x00000810)
> > marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> > FIFO (data) (NDSR: 0x00000810)
> > marvell-nfc f10d0000.nand-controller: Timeout on RDDREQ while draining 
> > FIFO (data) (NDSR: 0x00000810)
> > 
> > ... (RDDREQ messages repeat).  
> 
> Hm, that's weird, unless RDDREQ is a 'clear-on-read' bit, that
> shouldn't happen.
> 

The spec clearly states it's not. In the status register (NDSR), you
must write a bit to clear it.

The RDDREQ and RDY0/1 bit checks do not share the same path. Would it
be possible that in some situations all the bits are not actually
cleared, and so reading the NDSR register tells us the event happened
while it has not in reality?

IIRC this patch fixes pxa implementation (NFCv1) while the errors
happen on an Armada 385 (NFCv2). It would be great to understand (and
document) this behavior, otherwise it is still possible to do the
manual check only for NFCv1 if this is the only IP suffering about the
race.

Thanks,
Miquèl

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-09-28  7:43       ` Daniel Mack
@ 2018-09-28  8:24         ` Miquel Raynal
  2018-09-28  8:29           ` Daniel Mack
  0 siblings, 1 reply; 28+ messages in thread
From: Miquel Raynal @ 2018-09-28  8:24 UTC (permalink / raw)
  To: Daniel Mack; +Cc: Chris Packham, Boris Brezillon, linux-mtd, stable

Hi Daniel,

Daniel Mack <daniel@zonque.org> wrote on Fri, 28 Sep 2018 09:43:18
+0200:

> Hi Chris,
> 
> On 27/9/2018 11:55 PM, Chris Packham wrote:
> > On 27/09/18 20:56, Boris Brezillon wrote:  
> >> On Thu, 27 Sep 2018 10:11:45 +0200
> >> Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> >>  
> >>> Hi Daniel,
> >>>
> >>> Daniel Mack <daniel@zonque.org> wrote on Thu, 27 Sep 2018 09:17:51
> >>> +0200:
> >>>  
> >>>> At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register
> >>>> will only cause the IRQ to latch when the RDY lanes are changing, and not
> >>>> in case they are already asserted.
> >>>>
> >>>> This means that if the controller finished the command in flight before
> >>>> marvell_nfc_wait_op() is called, that function will wait for a change in
> >>>> the bit that can't ever happen as it is already set.
> >>>>
> >>>> To address this race, check for the RDY bits after the IRQ was enabled,
> >>>> and complete the completion immediately if the condition is already met.
> >>>>
> >>>> This fixes a bug that was observed with a NAND chip that holds a UBIFS
> >>>> parition on which file system stress tests were executed. When
> >>>> marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount
> >>>> the filesystem read-only, reporting lots of warnings along the way.
> >>>>
> >>>> Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver
> >>>> Cc: stable@vger.kernel.org
> >>>> Signed-off-by: Daniel Mack <daniel@zonque.org>
> >>>> ---  
> >>>
> >>> Sorry I haven't had the time to check on my Armada, but you figured it
> >>> out, and the fix looks good to me!
> >>>
> >>> Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>
> >>>
> >>> Boris, do you plan to send another fixes PR of can I take it into
> >>> the nand/next branch?  
> >>
> >> Queued to mtd/master.
> > > After fixing my R/B configuration I get a new error with this patch when  
> > running stress_1 from mtd-utils-2.0.0. I don't see this without the patch.  
> 
> That's strange. So your controller sets the RDY bits before it is ready? Could
> you check whether only checking for NDSR_RDY(0) changes anything? Not sure
> about the handling of NDSR_RDY(1) in this driver anyway ...
> 

I suppose you mean this portion of code is not clear enough?


        u32 st = readl_relaxed(nfc->regs + NDSR);
        u32 ien = (~readl_relaxed(nfc->regs + NDCR)) & NDCR_ALL_INT;

        /*
         * RDY interrupt mask is one bit in NDCR while there are two status
         * bit in NDSR (RDY[cs0/cs2] and RDY[cs1/cs3]).
         */
        if (st & NDSR_RDY(1))
                st |= NDSR_RDY(0);

        if (!(st & ien))
                return IRQ_NONE;


-> st is the status in the NDSR register which has two RDY bits, one
   for each RDY line.
-> ien is a view of the NDCR register which commands the interrupts
   and has one bit to enable both interrupt lines (let's call it
   NDCR_RDYM).

The trick is that NDSR_RDY(0) is the same bit as NDCR_RDYM.
So whenever NDSR_RDY(1) is set, we fake NDSR_RDY(0) to be set (by
setting manually the bit in 'st') so that the (st & ien) comparison
can be true if NDSR_RDY(1) is valid and RDY interrupts are enabled.

With this in mind, I don't see why this

+	st = readl_relaxed(nfc->regs + NDSR);
+	if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
+		complete(&nfc->complete);

would break the driver.

Thanks,
Miquèl

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-09-28  8:24         ` Miquel Raynal
@ 2018-09-28  8:29           ` Daniel Mack
  2018-09-30 21:10             ` Chris Packham
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Mack @ 2018-09-28  8:29 UTC (permalink / raw)
  To: Miquel Raynal; +Cc: Chris Packham, Boris Brezillon, linux-mtd, stable

On 28/9/2018 10:24 AM, Miquel Raynal wrote:
> Hi Daniel,
> 
> Daniel Mack <daniel@zonque.org> wrote on Fri, 28 Sep 2018 09:43:18
> +0200:
> 
>> Hi Chris,
>>
>> On 27/9/2018 11:55 PM, Chris Packham wrote:
>>> On 27/09/18 20:56, Boris Brezillon wrote:
>>>> On Thu, 27 Sep 2018 10:11:45 +0200
>>>> Miquel Raynal <miquel.raynal@bootlin.com> wrote:
>>>>   
>>>>> Hi Daniel,
>>>>>
>>>>> Daniel Mack <daniel@zonque.org> wrote on Thu, 27 Sep 2018 09:17:51
>>>>> +0200:
>>>>>   
>>>>>> At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register
>>>>>> will only cause the IRQ to latch when the RDY lanes are changing, and not
>>>>>> in case they are already asserted.
>>>>>>
>>>>>> This means that if the controller finished the command in flight before
>>>>>> marvell_nfc_wait_op() is called, that function will wait for a change in
>>>>>> the bit that can't ever happen as it is already set.
>>>>>>
>>>>>> To address this race, check for the RDY bits after the IRQ was enabled,
>>>>>> and complete the completion immediately if the condition is already met.
>>>>>>
>>>>>> This fixes a bug that was observed with a NAND chip that holds a UBIFS
>>>>>> parition on which file system stress tests were executed. When
>>>>>> marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount
>>>>>> the filesystem read-only, reporting lots of warnings along the way.
>>>>>>
>>>>>> Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Signed-off-by: Daniel Mack <daniel@zonque.org>
>>>>>> ---
>>>>>
>>>>> Sorry I haven't had the time to check on my Armada, but you figured it
>>>>> out, and the fix looks good to me!
>>>>>
>>>>> Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>
>>>>>
>>>>> Boris, do you plan to send another fixes PR of can I take it into
>>>>> the nand/next branch?
>>>>
>>>> Queued to mtd/master.
>>>> After fixing my R/B configuration I get a new error with this patch when
>>> running stress_1 from mtd-utils-2.0.0. I don't see this without the patch.
>>
>> That's strange. So your controller sets the RDY bits before it is ready? Could
>> you check whether only checking for NDSR_RDY(0) changes anything? Not sure
>> about the handling of NDSR_RDY(1) in this driver anyway ...
>>
> 
> I suppose you mean this portion of code is not clear enough?
> 
> 
>          u32 st = readl_relaxed(nfc->regs + NDSR);
>          u32 ien = (~readl_relaxed(nfc->regs + NDCR)) & NDCR_ALL_INT;
> 
>          /*
>           * RDY interrupt mask is one bit in NDCR while there are two status
>           * bit in NDSR (RDY[cs0/cs2] and RDY[cs1/cs3]).
>           */
>          if (st & NDSR_RDY(1))
>                  st |= NDSR_RDY(0);
> 
>          if (!(st & ien))
>                  return IRQ_NONE;
> 
> 
> -> st is the status in the NDSR register which has two RDY bits, one
>     for each RDY line.
> -> ien is a view of the NDCR register which commands the interrupts
>     and has one bit to enable both interrupt lines (let's call it
>     NDCR_RDYM).
> 
> The trick is that NDSR_RDY(0) is the same bit as NDCR_RDYM.
> So whenever NDSR_RDY(1) is set, we fake NDSR_RDY(0) to be set (by
> setting manually the bit in 'st') so that the (st & ien) comparison
> can be true if NDSR_RDY(1) is valid and RDY interrupts are enabled.
> 
> With this in mind, I don't see why this
> 
> +	st = readl_relaxed(nfc->regs + NDSR);
> +	if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
> +		complete(&nfc->complete);

Yeah, me neither. Chris, are you absolutely sure this is the reason? I'm 
asking because it took me several tries sometimes to trigger the bug, so 
is there a chance that you see an error at all times, regardless of 
whether my patch is applied?


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-09-28  8:29           ` Daniel Mack
@ 2018-09-30 21:10             ` Chris Packham
  2018-10-01  5:31               ` Daniel Mack
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Packham @ 2018-09-30 21:10 UTC (permalink / raw)
  To: Daniel Mack, Miquel Raynal; +Cc: Boris Brezillon, linux-mtd, stable

On 28/09/18 20:29, Daniel Mack wrote:
> On 28/9/2018 10:24 AM, Miquel Raynal wrote:
>> Hi Daniel,
>>
>> Daniel Mack <daniel@zonque.org> wrote on Fri, 28 Sep 2018 09:43:18
>> +0200:
>>
>>> Hi Chris,
>>>
>>> On 27/9/2018 11:55 PM, Chris Packham wrote:
>>>> On 27/09/18 20:56, Boris Brezillon wrote:
>>>>> On Thu, 27 Sep 2018 10:11:45 +0200
>>>>> Miquel Raynal <miquel.raynal@bootlin.com> wrote:
>>>>>    
>>>>>> Hi Daniel,
>>>>>>
>>>>>> Daniel Mack <daniel@zonque.org> wrote on Thu, 27 Sep 2018 09:17:51
>>>>>> +0200:
>>>>>>    
>>>>>>> At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register
>>>>>>> will only cause the IRQ to latch when the RDY lanes are changing, and not
>>>>>>> in case they are already asserted.
>>>>>>>
>>>>>>> This means that if the controller finished the command in flight before
>>>>>>> marvell_nfc_wait_op() is called, that function will wait for a change in
>>>>>>> the bit that can't ever happen as it is already set.
>>>>>>>
>>>>>>> To address this race, check for the RDY bits after the IRQ was enabled,
>>>>>>> and complete the completion immediately if the condition is already met.
>>>>>>>
>>>>>>> This fixes a bug that was observed with a NAND chip that holds a UBIFS
>>>>>>> parition on which file system stress tests were executed. When
>>>>>>> marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount
>>>>>>> the filesystem read-only, reporting lots of warnings along the way.
>>>>>>>
>>>>>>> Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver
>>>>>>> Cc: stable@vger.kernel.org
>>>>>>> Signed-off-by: Daniel Mack <daniel@zonque.org>
>>>>>>> ---
>>>>>>
>>>>>> Sorry I haven't had the time to check on my Armada, but you figured it
>>>>>> out, and the fix looks good to me!
>>>>>>
>>>>>> Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>
>>>>>>
>>>>>> Boris, do you plan to send another fixes PR of can I take it into
>>>>>> the nand/next branch?
>>>>>
>>>>> Queued to mtd/master.
>>>>> After fixing my R/B configuration I get a new error with this patch when
>>>> running stress_1 from mtd-utils-2.0.0. I don't see this without the patch.
>>>
>>> That's strange. So your controller sets the RDY bits before it is ready? Could
>>> you check whether only checking for NDSR_RDY(0) changes anything? Not sure
>>> about the handling of NDSR_RDY(1) in this driver anyway ...
>>>
>>
>> I suppose you mean this portion of code is not clear enough?
>>
>>
>>           u32 st = readl_relaxed(nfc->regs + NDSR);
>>           u32 ien = (~readl_relaxed(nfc->regs + NDCR)) & NDCR_ALL_INT;
>>
>>           /*
>>            * RDY interrupt mask is one bit in NDCR while there are two status
>>            * bit in NDSR (RDY[cs0/cs2] and RDY[cs1/cs3]).
>>            */
>>           if (st & NDSR_RDY(1))
>>                   st |= NDSR_RDY(0);
>>
>>           if (!(st & ien))
>>                   return IRQ_NONE;
>>
>>
>> -> st is the status in the NDSR register which has two RDY bits, one
>>      for each RDY line.
>> -> ien is a view of the NDCR register which commands the interrupts
>>      and has one bit to enable both interrupt lines (let's call it
>>      NDCR_RDYM).
>>
>> The trick is that NDSR_RDY(0) is the same bit as NDCR_RDYM.
>> So whenever NDSR_RDY(1) is set, we fake NDSR_RDY(0) to be set (by
>> setting manually the bit in 'st') so that the (st & ien) comparison
>> can be true if NDSR_RDY(1) is valid and RDY interrupts are enabled.
>>
>> With this in mind, I don't see why this
>>
>> +	st = readl_relaxed(nfc->regs + NDSR);
>> +	if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
>> +		complete(&nfc->complete);
> 
> Yeah, me neither. Chris, are you absolutely sure this is the reason? I'm
> asking because it took me several tries sometimes to trigger the bug, so
> is there a chance that you see an error at all times, regardless of
> whether my patch is applied?

It seems pretty consistent. Without this patch there seems to be no 
problem. With this patch it triggers pretty much straight away. I can't 
discount that there might be something wrong with my dts (the R/B 
configuration was missing initially).

I've also been able to run this on the DB-88F6820-AMC board with the 
same result (the dts for this is in the for-next branch of 
git://git.infradead.org/linux-mvebu.git).

The really odd thing is the following seems to avoid the problem

+        st = readl_relaxed(nfc->regs + NDSR);
+        udelay(1000);
+        if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
+                complete(&nfc->complete);

Which is weird because the st value has already been read so the udelay 
should have no effect.

On 28/09/18 19:43, Daniel Mack wrote:
 >
 > Also, does my .EALREADY approach (v1) make any difference?
 >

The v1 of this patch doesn't show the problem.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-09-30 21:10             ` Chris Packham
@ 2018-10-01  5:31               ` Daniel Mack
  2018-10-01 19:59                 ` Chris Packham
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Mack @ 2018-10-01  5:31 UTC (permalink / raw)
  To: Chris Packham, Miquel Raynal; +Cc: Boris Brezillon, linux-mtd, stable

On 30/9/2018 11:10 PM, Chris Packham wrote:
>>> With this in mind, I don't see why this
>>>
>>> +	st = readl_relaxed(nfc->regs + NDSR);
>>> +	if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
>>> +		complete(&nfc->complete);
>> Yeah, me neither. Chris, are you absolutely sure this is the reason? I'm
>> asking because it took me several tries sometimes to trigger the bug, so
>> is there a chance that you see an error at all times, regardless of
>> whether my patch is applied?
> It seems pretty consistent. Without this patch there seems to be no
> problem. With this patch it triggers pretty much straight away. I can't
> discount that there might be something wrong with my dts (the R/B
> configuration was missing initially).
> 
> I've also been able to run this on the DB-88F6820-AMC board with the
> same result (the dts for this is in the for-next branch of
> git://git.infradead.org/linux-mvebu.git).
> 
> The really odd thing is the following seems to avoid the problem
> 
> +        st = readl_relaxed(nfc->regs + NDSR);
> +        udelay(1000);
> +        if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
> +                complete(&nfc->complete);
> 
> Which is weird because the st value has already been read so the udelay
> should have no effect.

Erm, yes. That's totally weird. Which gcc are you using for this?

Could you please try and use readl() here instead of readl_relaxed()? 
That will place a memory barrier after the read to enforce ordering.

But if this is a problem, many other parts of that driver should be 
equally affected.

> On 28/09/18 19:43, Daniel Mack wrote:
>   >
>   > Also, does my .EALREADY approach (v1) make any difference?
>   >
> 
> The v1 of this patch doesn't show the problem.

That's also very strange because the condition it triggers on is exactly 
the same.


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-10-01  5:31               ` Daniel Mack
@ 2018-10-01 19:59                 ` Chris Packham
  2018-10-01 20:34                   ` Boris Brezillon
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Packham @ 2018-10-01 19:59 UTC (permalink / raw)
  To: Daniel Mack, Miquel Raynal; +Cc: Boris Brezillon, linux-mtd, stable

On 01/10/18 18:31, Daniel Mack wrote:
> On 30/9/2018 11:10 PM, Chris Packham wrote:
>>>> With this in mind, I don't see why this
>>>>
>>>> +	st = readl_relaxed(nfc->regs + NDSR);
>>>> +	if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
>>>> +		complete(&nfc->complete);
>>> Yeah, me neither. Chris, are you absolutely sure this is the reason? I'm
>>> asking because it took me several tries sometimes to trigger the bug, so
>>> is there a chance that you see an error at all times, regardless of
>>> whether my patch is applied?
>> It seems pretty consistent. Without this patch there seems to be no
>> problem. With this patch it triggers pretty much straight away. I can't
>> discount that there might be something wrong with my dts (the R/B
>> configuration was missing initially).
>>
>> I've also been able to run this on the DB-88F6820-AMC board with the
>> same result (the dts for this is in the for-next branch of
>> git://git.infradead.org/linux-mvebu.git).
>>
>> The really odd thing is the following seems to avoid the problem
>>
>> +        st = readl_relaxed(nfc->regs + NDSR);
>> +        udelay(1000);
>> +        if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
>> +                complete(&nfc->complete);
>>
>> Which is weird because the st value has already been read so the udelay
>> should have no effect.
> 
> Erm, yes. That's totally weird. Which gcc are you using for this?

arm-softfloat-linux-gnueabi-gcc (crosstool-NG crosstool-ng-1.22.0) 4.9.3

> Could you please try and use readl() here instead of readl_relaxed()?
> That will place a memory barrier after the read to enforce ordering.

I'd previously tried readl() based on the same hunch. No change.

I think my snippet above might be misleading. While a delay between 
readl_relaxed() and the if should not change the outcome, this is also a 
delay between marvell_nfc_enable_int() and marvell_nfc_disable_int() 
which is probably more significant. Sure enough if I move the delay to 
just before the marvell_nfc_disable_int() the error is not seen.

> But if this is a problem, many other parts of that driver should be
> equally affected.
> 
>> On 28/09/18 19:43, Daniel Mack wrote:
>>    >
>>    > Also, does my .EALREADY approach (v1) make any difference?
>>    >
>>
>> The v1 of this patch doesn't show the problem.
> 
> That's also very strange because the condition it triggers on is exactly
> the same.

One difference is that by calling complete() interrupts will be disabled 
in the spinlock.

> 
> 
> Thanks,
> Daniel
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-10-01 19:59                 ` Chris Packham
@ 2018-10-01 20:34                   ` Boris Brezillon
  2018-10-01 21:41                     ` Boris Brezillon
  0 siblings, 1 reply; 28+ messages in thread
From: Boris Brezillon @ 2018-10-01 20:34 UTC (permalink / raw)
  To: Chris Packham; +Cc: Daniel Mack, Miquel Raynal, linux-mtd, stable

On Mon, 1 Oct 2018 19:59:11 +0000
Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:

> On 01/10/18 18:31, Daniel Mack wrote:
> > On 30/9/2018 11:10 PM, Chris Packham wrote:  
> >>>> With this in mind, I don't see why this
> >>>>
> >>>> +	st = readl_relaxed(nfc->regs + NDSR);
> >>>> +	if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
> >>>> +		complete(&nfc->complete);  
> >>> Yeah, me neither. Chris, are you absolutely sure this is the reason? I'm
> >>> asking because it took me several tries sometimes to trigger the bug, so
> >>> is there a chance that you see an error at all times, regardless of
> >>> whether my patch is applied?  
> >> It seems pretty consistent. Without this patch there seems to be no
> >> problem. With this patch it triggers pretty much straight away. I can't
> >> discount that there might be something wrong with my dts (the R/B
> >> configuration was missing initially).
> >>
> >> I've also been able to run this on the DB-88F6820-AMC board with the
> >> same result (the dts for this is in the for-next branch of
> >> git://git.infradead.org/linux-mvebu.git).
> >>
> >> The really odd thing is the following seems to avoid the problem
> >>
> >> +        st = readl_relaxed(nfc->regs + NDSR);
> >> +        udelay(1000);
> >> +        if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
> >> +                complete(&nfc->complete);
> >>
> >> Which is weird because the st value has already been read so the udelay
> >> should have no effect.  
> > 
> > Erm, yes. That's totally weird. Which gcc are you using for this?  
> 
> arm-softfloat-linux-gnueabi-gcc (crosstool-NG crosstool-ng-1.22.0) 4.9.3
> 
> > Could you please try and use readl() here instead of readl_relaxed()?
> > That will place a memory barrier after the read to enforce ordering.  
> 
> I'd previously tried readl() based on the same hunch. No change.
> 
> I think my snippet above might be misleading. While a delay between 
> readl_relaxed() and the if should not change the outcome, this is also a 
> delay between marvell_nfc_enable_int() and marvell_nfc_disable_int() 
> which is probably more significant. Sure enough if I move the delay to 
> just before the marvell_nfc_disable_int() the error is not seen.

AFAICT, your timeout always happens when waiting for RDREQ, not RDYM.
So maybe disabling MRDY too early has a side-effect on the RDREQ event.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-10-01 20:34                   ` Boris Brezillon
@ 2018-10-01 21:41                     ` Boris Brezillon
  2018-10-01 22:01                       ` Chris Packham
  0 siblings, 1 reply; 28+ messages in thread
From: Boris Brezillon @ 2018-10-01 21:41 UTC (permalink / raw)
  To: Chris Packham; +Cc: Daniel Mack, Miquel Raynal, linux-mtd, stable

On Mon, 1 Oct 2018 22:34:38 +0200
Boris Brezillon <boris.brezillon@bootlin.com> wrote:
  
> > 
> > I'd previously tried readl() based on the same hunch. No change.
> > 
> > I think my snippet above might be misleading. While a delay between 
> > readl_relaxed() and the if should not change the outcome, this is also a 
> > delay between marvell_nfc_enable_int() and marvell_nfc_disable_int() 
> > which is probably more significant. Sure enough if I move the delay to 
> > just before the marvell_nfc_disable_int() the error is not seen.  
> 
> AFAICT, your timeout always happens when waiting for RDREQ, not RDYM.
> So maybe disabling MRDY too early has a side-effect on the RDREQ event.

Can you try with this patch [1]? It should ensure that NDSR_RDY bits
are cleared before starting an operation.

[1]http://code.bulix.org/lgs30c-468205

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-10-01 21:41                     ` Boris Brezillon
@ 2018-10-01 22:01                       ` Chris Packham
  2018-10-01 22:13                         ` Boris Brezillon
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Packham @ 2018-10-01 22:01 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: Daniel Mack, Miquel Raynal, linux-mtd, stable

On 02/10/18 10:41, Boris Brezillon wrote:
> On Mon, 1 Oct 2018 22:34:38 +0200
> Boris Brezillon <boris.brezillon@bootlin.com> wrote:
>    
>>>
>>> I'd previously tried readl() based on the same hunch. No change.
>>>
>>> I think my snippet above might be misleading. While a delay between
>>> readl_relaxed() and the if should not change the outcome, this is also a
>>> delay between marvell_nfc_enable_int() and marvell_nfc_disable_int()
>>> which is probably more significant. Sure enough if I move the delay to
>>> just before the marvell_nfc_disable_int() the error is not seen.
>>
>> AFAICT, your timeout always happens when waiting for RDREQ, not RDYM.
>> So maybe disabling MRDY too early has a side-effect on the RDREQ event.
> 
> Can you try with this patch [1]? It should ensure that NDSR_RDY bits
> are cleared before starting an operation.
> 
> [1]http://code.bulix.org/lgs30c-468205
> 

No luck. I applied that on top of Daniel's and got the same result.

One thing that does look promising is the following modification of 
Daniel's patch[1]. Which moves the RDY check to before where the 
interrupts are enabled.

[1] - http://code.bulix.org/gc7a6d-468226

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-10-01 22:01                       ` Chris Packham
@ 2018-10-01 22:13                         ` Boris Brezillon
  2018-10-01 22:15                           ` Chris Packham
  2018-10-02  6:46                           ` Miquel Raynal
  0 siblings, 2 replies; 28+ messages in thread
From: Boris Brezillon @ 2018-10-01 22:13 UTC (permalink / raw)
  To: Chris Packham; +Cc: Daniel Mack, Miquel Raynal, linux-mtd, stable

On Mon, 1 Oct 2018 22:01:27 +0000
Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:

> On 02/10/18 10:41, Boris Brezillon wrote:
> > On Mon, 1 Oct 2018 22:34:38 +0200
> > Boris Brezillon <boris.brezillon@bootlin.com> wrote:
> >      
> >>>
> >>> I'd previously tried readl() based on the same hunch. No change.
> >>>
> >>> I think my snippet above might be misleading. While a delay between
> >>> readl_relaxed() and the if should not change the outcome, this is also a
> >>> delay between marvell_nfc_enable_int() and marvell_nfc_disable_int()
> >>> which is probably more significant. Sure enough if I move the delay to
> >>> just before the marvell_nfc_disable_int() the error is not seen.  
> >>
> >> AFAICT, your timeout always happens when waiting for RDREQ, not RDYM.
> >> So maybe disabling MRDY too early has a side-effect on the RDREQ event.  
> > 
> > Can you try with this patch [1]? It should ensure that NDSR_RDY bits
> > are cleared before starting an operation.
> > 
> > [1]http://code.bulix.org/lgs30c-468205
> >   
> 
> No luck. I applied that on top of Daniel's and got the same result.
> 
> One thing that does look promising is the following modification of 
> Daniel's patch[1]. Which moves the RDY check to before where the 
> interrupts are enabled.

Except we still don't know why this is happening, and I'm not sure I
want to take a fix without understanding why it does fix the problem.

Also, it looks like complete() is not called until the RDDREQ, WRDREQ
and WRCMDREQ are cleared in the interrupt handler [1], which is weird.
Miquel, do you happen to remember why you had to do that?

[1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/raw/marvell_nand.c?h=v4.19-rc6#n689

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-10-01 22:13                         ` Boris Brezillon
@ 2018-10-01 22:15                           ` Chris Packham
  2018-10-02  9:36                             ` Boris Brezillon
  2018-10-02  6:46                           ` Miquel Raynal
  1 sibling, 1 reply; 28+ messages in thread
From: Chris Packham @ 2018-10-01 22:15 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: Daniel Mack, Miquel Raynal, linux-mtd, stable

On 02/10/18 11:13, Boris Brezillon wrote:
> On Mon, 1 Oct 2018 22:01:27 +0000
> Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
> 
>> On 02/10/18 10:41, Boris Brezillon wrote:
>>> On Mon, 1 Oct 2018 22:34:38 +0200
>>> Boris Brezillon <boris.brezillon@bootlin.com> wrote:
>>>       
>>>>>
>>>>> I'd previously tried readl() based on the same hunch. No change.
>>>>>
>>>>> I think my snippet above might be misleading. While a delay between
>>>>> readl_relaxed() and the if should not change the outcome, this is also a
>>>>> delay between marvell_nfc_enable_int() and marvell_nfc_disable_int()
>>>>> which is probably more significant. Sure enough if I move the delay to
>>>>> just before the marvell_nfc_disable_int() the error is not seen.
>>>>
>>>> AFAICT, your timeout always happens when waiting for RDREQ, not RDYM.
>>>> So maybe disabling MRDY too early has a side-effect on the RDREQ event.
>>>
>>> Can you try with this patch [1]? It should ensure that NDSR_RDY bits
>>> are cleared before starting an operation.
>>>
>>> [1]http://code.bulix.org/lgs30c-468205
>>>    
>>
>> No luck. I applied that on top of Daniel's and got the same result.
>>
>> One thing that does look promising is the following modification of
>> Daniel's patch[1]. Which moves the RDY check to before where the
>> interrupts are enabled.
> 
> Except we still don't know why this is happening, and I'm not sure I
> want to take a fix without understanding why it does fix the problem.

Agreed. My only guess is that there is some interrupt that is missed in 
the short period they are disabled when calling complete().

> Also, it looks like complete() is not called until the RDDREQ, WRDREQ
> and WRCMDREQ are cleared in the interrupt handler [1], which is weird.
> Miquel, do you happen to remember why you had to do that?
> 
> [1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/raw/marvell_nand.c?h=v4.19-rc6#n689
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-09-27  7:17 [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ Daniel Mack
  2018-09-27  8:11 ` Miquel Raynal
@ 2018-10-01 22:44 ` Boris Brezillon
  2018-10-02  7:42   ` Daniel Mack
  1 sibling, 1 reply; 28+ messages in thread
From: Boris Brezillon @ 2018-10-01 22:44 UTC (permalink / raw)
  To: Daniel Mack; +Cc: miquel.raynal, linux-mtd, chris.packham, stable

Hi Daniel,

On Thu, 27 Sep 2018 09:17:51 +0200
Daniel Mack <daniel@zonque.org> wrote:

> At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register
> will only cause the IRQ to latch when the RDY lanes are changing, and not
> in case they are already asserted.
> 
> This means that if the controller finished the command in flight before
> marvell_nfc_wait_op() is called, that function will wait for a change in
> the bit that can't ever happen as it is already set.
> 
> To address this race, check for the RDY bits after the IRQ was enabled,
> and complete the completion immediately if the condition is already met.
> 
> This fixes a bug that was observed with a NAND chip that holds a UBIFS
> parition on which file system stress tests were executed. When
> marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount
> the filesystem read-only, reporting lots of warnings along the way.
> 
> Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver
> Cc: stable@vger.kernel.org
> Signed-off-by: Daniel Mack <daniel@zonque.org>

Can you try to replace your patch by the following one and let me know if
it solves your problem?

Thanks,

Boris

--->8---
diff --git a/drivers/mtd/nand/raw/marvell_nand.c b/drivers/mtd/nand/raw/marvell_nand.c
index e63f714f7639..295a86a5545f 100644
--- a/drivers/mtd/nand/raw/marvell_nand.c
+++ b/drivers/mtd/nand/raw/marvell_nand.c
@@ -1123,7 +1123,7 @@ static int marvell_nfc_hw_ecc_hmg_do_write_page(struct nand_chip *chip,
                memcpy(nfc->dma_buf, data_buf, lt->data_bytes);
                memcpy(nfc->dma_buf + lt->data_bytes, oob_buf, oob_bytes);
                marvell_nfc_xfer_data_dma(nfc, DMA_TO_DEVICE, lt->data_bytes +
-                                         lt->ecc_bytes + lt->spare_bytes);
+                                         oob_bytes);
        } else {
                marvell_nfc_xfer_data_out_pio(nfc, data_buf, lt->data_bytes);
                marvell_nfc_xfer_data_out_pio(nfc, oob_buf, oob_bytes);

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-10-01 22:13                         ` Boris Brezillon
  2018-10-01 22:15                           ` Chris Packham
@ 2018-10-02  6:46                           ` Miquel Raynal
  2018-10-02  7:25                             ` Miquel Raynal
  1 sibling, 1 reply; 28+ messages in thread
From: Miquel Raynal @ 2018-10-02  6:46 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: Chris Packham, Daniel Mack, linux-mtd, stable

Hi Boris,

Boris Brezillon <boris.brezillon@bootlin.com> wrote on Tue, 2 Oct 2018
00:13:28 +0200:

> On Mon, 1 Oct 2018 22:01:27 +0000
> Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
> 
> > On 02/10/18 10:41, Boris Brezillon wrote:  
> > > On Mon, 1 Oct 2018 22:34:38 +0200
> > > Boris Brezillon <boris.brezillon@bootlin.com> wrote:
> > >        
> > >>>
> > >>> I'd previously tried readl() based on the same hunch. No change.
> > >>>
> > >>> I think my snippet above might be misleading. While a delay between
> > >>> readl_relaxed() and the if should not change the outcome, this is also a
> > >>> delay between marvell_nfc_enable_int() and marvell_nfc_disable_int()
> > >>> which is probably more significant. Sure enough if I move the delay to
> > >>> just before the marvell_nfc_disable_int() the error is not seen.    
> > >>
> > >> AFAICT, your timeout always happens when waiting for RDREQ, not RDYM.
> > >> So maybe disabling MRDY too early has a side-effect on the RDREQ event.    
> > > 
> > > Can you try with this patch [1]? It should ensure that NDSR_RDY bits
> > > are cleared before starting an operation.
> > > 
> > > [1]http://code.bulix.org/lgs30c-468205
> > >     
> > 
> > No luck. I applied that on top of Daniel's and got the same result.
> > 
> > One thing that does look promising is the following modification of 
> > Daniel's patch[1]. Which moves the RDY check to before where the 
> > interrupts are enabled.  
> 
> Except we still don't know why this is happening, and I'm not sure I
> want to take a fix without understanding why it does fix the problem.
> 
> Also, it looks like complete() is not called until the RDDREQ, WRDREQ
> and WRCMDREQ are cleared in the interrupt handler [1], which is weird.
> Miquel, do you happen to remember why you had to do that?

The RDDREQ, WRDREQ and WRCMDREQ events might potentially happen while
the interrupts are enabled while we only wait for R/B signalling. This
check is to avoid calling complete() on these situations.

> 
> [1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/raw/marvell_nand.c?h=v4.19-rc6#n689


Thanks,
Miquèl

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-10-02  6:46                           ` Miquel Raynal
@ 2018-10-02  7:25                             ` Miquel Raynal
  2018-10-02  8:22                               ` Daniel Mack
  0 siblings, 1 reply; 28+ messages in thread
From: Miquel Raynal @ 2018-10-02  7:25 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: Chris Packham, Daniel Mack, linux-mtd, stable

Hi Daniel, Chris,

Miquel Raynal <miquel.raynal@bootlin.com> wrote on Tue, 2 Oct 2018
08:46:01 +0200:

> Hi Boris,
> 
> Boris Brezillon <boris.brezillon@bootlin.com> wrote on Tue, 2 Oct 2018
> 00:13:28 +0200:
> 
> > On Mon, 1 Oct 2018 22:01:27 +0000
> > Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
> >   
> > > On 02/10/18 10:41, Boris Brezillon wrote:    
> > > > On Mon, 1 Oct 2018 22:34:38 +0200
> > > > Boris Brezillon <boris.brezillon@bootlin.com> wrote:
> > > >          
> > > >>>
> > > >>> I'd previously tried readl() based on the same hunch. No change.
> > > >>>
> > > >>> I think my snippet above might be misleading. While a delay between
> > > >>> readl_relaxed() and the if should not change the outcome, this is also a
> > > >>> delay between marvell_nfc_enable_int() and marvell_nfc_disable_int()
> > > >>> which is probably more significant. Sure enough if I move the delay to
> > > >>> just before the marvell_nfc_disable_int() the error is not seen.      
> > > >>
> > > >> AFAICT, your timeout always happens when waiting for RDREQ, not RDYM.
> > > >> So maybe disabling MRDY too early has a side-effect on the RDREQ event.      
> > > > 
> > > > Can you try with this patch [1]? It should ensure that NDSR_RDY bits
> > > > are cleared before starting an operation.
> > > > 
> > > > [1]http://code.bulix.org/lgs30c-468205
> > > >       
> > > 
> > > No luck. I applied that on top of Daniel's and got the same result.
> > > 
> > > One thing that does look promising is the following modification of 
> > > Daniel's patch[1]. Which moves the RDY check to before where the 
> > > interrupts are enabled.    
> > 
> > Except we still don't know why this is happening, and I'm not sure I
> > want to take a fix without understanding why it does fix the problem.
> > 
> > Also, it looks like complete() is not called until the RDDREQ, WRDREQ
> > and WRCMDREQ are cleared in the interrupt handler [1], which is weird.
> > Miquel, do you happen to remember why you had to do that?  
> 
> The RDDREQ, WRDREQ and WRCMDREQ events might potentially happen while
> the interrupts are enabled while we only wait for R/B signalling. This
> check is to avoid calling complete() on these situations.

Actually Boris is right on the fact that while the intention is good,
the writing of [1] is not accurate. Daniel, could you please test if
the following diff changes something with your setup, without your
patch?


diff --git a/drivers/mtd/nand/raw/marvell_nand.c b/drivers/mtd/nand/raw/marvell_nand.c
index bc2ef5209783..c7573ccdbacd 100644
--- a/drivers/mtd/nand/raw/marvell_nand.c
+++ b/drivers/mtd/nand/raw/marvell_nand.c
@@ -686,7 +686,7 @@ static irqreturn_t marvell_nfc_isr(int irq, void *dev_id)
 
        marvell_nfc_disable_int(nfc, st & NDCR_ALL_INT);
 
-       if (!(st & (NDSR_RDDREQ | NDSR_WRDREQ | NDSR_WRCMDREQ)))
+       if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
                complete(&nfc->complete);
 
        return IRQ_HANDLED;

> 
> > 
> > [1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/raw/marvell_nand.c?h=v4.19-rc6#n689  
> 


Thanks,
Miquèl

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-10-01 22:44 ` Boris Brezillon
@ 2018-10-02  7:42   ` Daniel Mack
  0 siblings, 0 replies; 28+ messages in thread
From: Daniel Mack @ 2018-10-02  7:42 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: miquel.raynal, linux-mtd, chris.packham, stable

Hi Boris,

On 2/10/2018 12:44 AM, Boris Brezillon wrote:
> On Thu, 27 Sep 2018 09:17:51 +0200
> Daniel Mack <daniel@zonque.org> wrote:
> 
>> At least on PXA3xx platforms, enabling RDY interrupts in the NDCR register
>> will only cause the IRQ to latch when the RDY lanes are changing, and not
>> in case they are already asserted.
>>
>> This means that if the controller finished the command in flight before
>> marvell_nfc_wait_op() is called, that function will wait for a change in
>> the bit that can't ever happen as it is already set.
>>
>> To address this race, check for the RDY bits after the IRQ was enabled,
>> and complete the completion immediately if the condition is already met.
>>
>> This fixes a bug that was observed with a NAND chip that holds a UBIFS
>> parition on which file system stress tests were executed. When
>> marvell_nfc_wait_op() reports an error, UBI/UBIFS will eventually mount
>> the filesystem read-only, reporting lots of warnings along the way.
>>
>> Fixes: 02f26ecf8c77 mtd: nand: add reworked Marvell NAND controller driver
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Daniel Mack <daniel@zonque.org>
> 
> Can you try to replace your patch by the following one and let me know if
> it solves your problem?

Nope, this doesn't fix it. Same problem as before.


Thanks,
Daniel


> --->8---
> diff --git a/drivers/mtd/nand/raw/marvell_nand.c b/drivers/mtd/nand/raw/marvell_nand.c
> index e63f714f7639..295a86a5545f 100644
> --- a/drivers/mtd/nand/raw/marvell_nand.c
> +++ b/drivers/mtd/nand/raw/marvell_nand.c
> @@ -1123,7 +1123,7 @@ static int marvell_nfc_hw_ecc_hmg_do_write_page(struct nand_chip *chip,
>                  memcpy(nfc->dma_buf, data_buf, lt->data_bytes);
>                  memcpy(nfc->dma_buf + lt->data_bytes, oob_buf, oob_bytes);
>                  marvell_nfc_xfer_data_dma(nfc, DMA_TO_DEVICE, lt->data_bytes +
> -                                         lt->ecc_bytes + lt->spare_bytes);
> +                                         oob_bytes);
>          } else {
>                  marvell_nfc_xfer_data_out_pio(nfc, data_buf, lt->data_bytes);
>                  marvell_nfc_xfer_data_out_pio(nfc, oob_buf, oob_bytes);
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-10-02  7:25                             ` Miquel Raynal
@ 2018-10-02  8:22                               ` Daniel Mack
  2018-10-02 20:53                                 ` Chris Packham
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Mack @ 2018-10-02  8:22 UTC (permalink / raw)
  To: Miquel Raynal, Boris Brezillon; +Cc: Chris Packham, linux-mtd, stable

Hi Miquel,

On 2/10/2018 9:25 AM, Miquel Raynal wrote:
> Miquel Raynal <miquel.raynal@bootlin.com> wrote on Tue, 2 Oct 2018
> 08:46:01 +0200:
>> Boris Brezillon <boris.brezillon@bootlin.com> wrote on Tue, 2 Oct 2018
>> 00:13:28 +0200:

>>> Also, it looks like complete() is not called until the RDDREQ, WRDREQ
>>> and WRCMDREQ are cleared in the interrupt handler [1], which is weird.
>>> Miquel, do you happen to remember why you had to do that?
>>
>> The RDDREQ, WRDREQ and WRCMDREQ events might potentially happen while
>> the interrupts are enabled while we only wait for R/B signalling. This
>> check is to avoid calling complete() on these situations.
> 
> Actually Boris is right on the fact that while the intention is good,
> the writing of [1] is not accurate. Daniel, could you please test if
> the following diff changes something with your setup, without your
> patch?
> 
> 
> diff --git a/drivers/mtd/nand/raw/marvell_nand.c b/drivers/mtd/nand/raw/marvell_nand.c
> index bc2ef5209783..c7573ccdbacd 100644
> --- a/drivers/mtd/nand/raw/marvell_nand.c
> +++ b/drivers/mtd/nand/raw/marvell_nand.c
> @@ -686,7 +686,7 @@ static irqreturn_t marvell_nfc_isr(int irq, void *dev_id)
>   
>          marvell_nfc_disable_int(nfc, st & NDCR_ALL_INT);
>   
> -       if (!(st & (NDSR_RDDREQ | NDSR_WRDREQ | NDSR_WRCMDREQ)))
> +       if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
>                  complete(&nfc->complete);
>   
>          return IRQ_HANDLED;

Yes! That seems to work nicely as a replacement for my patch.

Chris, how is that going on your board?


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-10-01 22:15                           ` Chris Packham
@ 2018-10-02  9:36                             ` Boris Brezillon
  2018-10-02  9:37                               ` Boris Brezillon
  0 siblings, 1 reply; 28+ messages in thread
From: Boris Brezillon @ 2018-10-02  9:36 UTC (permalink / raw)
  To: Chris Packham; +Cc: Daniel Mack, Miquel Raynal, linux-mtd, stable

On Mon, 1 Oct 2018 22:15:28 +0000
Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:

> On 02/10/18 11:13, Boris Brezillon wrote:
> > On Mon, 1 Oct 2018 22:01:27 +0000
> > Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
> >   
> >> On 02/10/18 10:41, Boris Brezillon wrote:  
> >>> On Mon, 1 Oct 2018 22:34:38 +0200
> >>> Boris Brezillon <boris.brezillon@bootlin.com> wrote:
> >>>         
> >>>>>
> >>>>> I'd previously tried readl() based on the same hunch. No change.
> >>>>>
> >>>>> I think my snippet above might be misleading. While a delay between
> >>>>> readl_relaxed() and the if should not change the outcome, this is also a
> >>>>> delay between marvell_nfc_enable_int() and marvell_nfc_disable_int()
> >>>>> which is probably more significant. Sure enough if I move the delay to
> >>>>> just before the marvell_nfc_disable_int() the error is not seen.  
> >>>>
> >>>> AFAICT, your timeout always happens when waiting for RDREQ, not RDYM.
> >>>> So maybe disabling MRDY too early has a side-effect on the RDREQ event.  
> >>>
> >>> Can you try with this patch [1]? It should ensure that NDSR_RDY bits
> >>> are cleared before starting an operation.
> >>>
> >>> [1]http://code.bulix.org/lgs30c-468205
> >>>      
> >>
> >> No luck. I applied that on top of Daniel's and got the same result.
> >>
> >> One thing that does look promising is the following modification of
> >> Daniel's patch[1]. Which moves the RDY check to before where the
> >> interrupts are enabled.  
> > 
> > Except we still don't know why this is happening, and I'm not sure I
> > want to take a fix without understanding why it does fix the problem.  
> 
> Agreed. My only guess is that there is some interrupt that is missed in 
> the short period they are disabled when calling complete().

Disabling interrupts when taking a spinlock means masking the IRQ line,
but the interrupt still exists and should be there when linux unmasks
the IRQ line. I don't think this is the problem we're chasing.

Looks more like something 

> 
> > Also, it looks like complete() is not called until the RDDREQ, WRDREQ
> > and WRCMDREQ are cleared in the interrupt handler [1], which is weird.
> > Miquel, do you happen to remember why you had to do that?
> > 
> > [1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/raw/marvell_nand.c?h=v4.19-rc6#n689
> >   
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-10-02  9:36                             ` Boris Brezillon
@ 2018-10-02  9:37                               ` Boris Brezillon
  0 siblings, 0 replies; 28+ messages in thread
From: Boris Brezillon @ 2018-10-02  9:37 UTC (permalink / raw)
  To: Chris Packham, Daniel Mack, linux-mtd, stable; +Cc: Miquel Raynal

On Tue, 2 Oct 2018 11:36:10 +0200
Boris Brezillon <boris.brezillon@bootlin.com> wrote:

> On Mon, 1 Oct 2018 22:15:28 +0000
> Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
> 
> > On 02/10/18 11:13, Boris Brezillon wrote:  
> > > On Mon, 1 Oct 2018 22:01:27 +0000
> > > Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote:
> > >     
> > >> On 02/10/18 10:41, Boris Brezillon wrote:    
> > >>> On Mon, 1 Oct 2018 22:34:38 +0200
> > >>> Boris Brezillon <boris.brezillon@bootlin.com> wrote:
> > >>>           
> > >>>>>
> > >>>>> I'd previously tried readl() based on the same hunch. No change.
> > >>>>>
> > >>>>> I think my snippet above might be misleading. While a delay between
> > >>>>> readl_relaxed() and the if should not change the outcome, this is also a
> > >>>>> delay between marvell_nfc_enable_int() and marvell_nfc_disable_int()
> > >>>>> which is probably more significant. Sure enough if I move the delay to
> > >>>>> just before the marvell_nfc_disable_int() the error is not seen.    
> > >>>>
> > >>>> AFAICT, your timeout always happens when waiting for RDREQ, not RDYM.
> > >>>> So maybe disabling MRDY too early has a side-effect on the RDREQ event.    
> > >>>
> > >>> Can you try with this patch [1]? It should ensure that NDSR_RDY bits
> > >>> are cleared before starting an operation.
> > >>>
> > >>> [1]http://code.bulix.org/lgs30c-468205
> > >>>        
> > >>
> > >> No luck. I applied that on top of Daniel's and got the same result.
> > >>
> > >> One thing that does look promising is the following modification of
> > >> Daniel's patch[1]. Which moves the RDY check to before where the
> > >> interrupts are enabled.    
> > > 
> > > Except we still don't know why this is happening, and I'm not sure I
> > > want to take a fix without understanding why it does fix the problem.    
> > 
> > Agreed. My only guess is that there is some interrupt that is missed in 
> > the short period they are disabled when calling complete().  
> 
> Disabling interrupts when taking a spinlock means masking the IRQ line,
> but the interrupt still exists and should be there when linux unmasks
> the IRQ line. I don't think this is the problem we're chasing.
> 
> Looks more like something 

Please ignore this email, I inadvertently hit the send button on a
draft I started yesterday :-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-10-02  8:22                               ` Daniel Mack
@ 2018-10-02 20:53                                 ` Chris Packham
  2018-10-03  7:33                                   ` Miquel Raynal
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Packham @ 2018-10-02 20:53 UTC (permalink / raw)
  To: Daniel Mack, Miquel Raynal, Boris Brezillon; +Cc: linux-mtd, stable

On 02/10/18 21:23, Daniel Mack wrote:
> Hi Miquel,
> 
> On 2/10/2018 9:25 AM, Miquel Raynal wrote:
>> Miquel Raynal <miquel.raynal@bootlin.com> wrote on Tue, 2 Oct 2018
>> 08:46:01 +0200:
>>> Boris Brezillon <boris.brezillon@bootlin.com> wrote on Tue, 2 Oct 2018
>>> 00:13:28 +0200:
> 
>>>> Also, it looks like complete() is not called until the RDDREQ, WRDREQ
>>>> and WRCMDREQ are cleared in the interrupt handler [1], which is weird.
>>>> Miquel, do you happen to remember why you had to do that?
>>>
>>> The RDDREQ, WRDREQ and WRCMDREQ events might potentially happen while
>>> the interrupts are enabled while we only wait for R/B signalling. This
>>> check is to avoid calling complete() on these situations.
>>
>> Actually Boris is right on the fact that while the intention is good,
>> the writing of [1] is not accurate. Daniel, could you please test if
>> the following diff changes something with your setup, without your
>> patch?
>>
>>
>> diff --git a/drivers/mtd/nand/raw/marvell_nand.c b/drivers/mtd/nand/raw/marvell_nand.c
>> index bc2ef5209783..c7573ccdbacd 100644
>> --- a/drivers/mtd/nand/raw/marvell_nand.c
>> +++ b/drivers/mtd/nand/raw/marvell_nand.c
>> @@ -686,7 +686,7 @@ static irqreturn_t marvell_nfc_isr(int irq, void *dev_id)
>>    
>>           marvell_nfc_disable_int(nfc, st & NDCR_ALL_INT);
>>    
>> -       if (!(st & (NDSR_RDDREQ | NDSR_WRDREQ | NDSR_WRCMDREQ)))
>> +       if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
>>                   complete(&nfc->complete);
>>    
>>           return IRQ_HANDLED;
> 
> Yes! That seems to work nicely as a replacement for my patch.
> 
> Chris, how is that going on your board?

Looks good to me. I've just re-run stress_{1,2,3} on the custom board 
and DB-88F6820-AMC.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-10-02 20:53                                 ` Chris Packham
@ 2018-10-03  7:33                                   ` Miquel Raynal
  2018-10-03  7:54                                     ` Daniel Mack
  0 siblings, 1 reply; 28+ messages in thread
From: Miquel Raynal @ 2018-10-03  7:33 UTC (permalink / raw)
  To: Chris Packham; +Cc: Daniel Mack, Boris Brezillon, linux-mtd, stable

Hi Chris,

Chris Packham <Chris.Packham@alliedtelesis.co.nz> wrote on Tue, 2 Oct
2018 20:53:14 +0000:

> On 02/10/18 21:23, Daniel Mack wrote:
> > Hi Miquel,
> > 
> > On 2/10/2018 9:25 AM, Miquel Raynal wrote:  
> >> Miquel Raynal <miquel.raynal@bootlin.com> wrote on Tue, 2 Oct 2018
> >> 08:46:01 +0200:  
> >>> Boris Brezillon <boris.brezillon@bootlin.com> wrote on Tue, 2 Oct 2018
> >>> 00:13:28 +0200:  
> >   
> >>>> Also, it looks like complete() is not called until the RDDREQ, WRDREQ
> >>>> and WRCMDREQ are cleared in the interrupt handler [1], which is weird.
> >>>> Miquel, do you happen to remember why you had to do that?  
> >>>
> >>> The RDDREQ, WRDREQ and WRCMDREQ events might potentially happen while
> >>> the interrupts are enabled while we only wait for R/B signalling. This
> >>> check is to avoid calling complete() on these situations.  
> >>
> >> Actually Boris is right on the fact that while the intention is good,
> >> the writing of [1] is not accurate. Daniel, could you please test if
> >> the following diff changes something with your setup, without your
> >> patch?
> >>
> >>
> >> diff --git a/drivers/mtd/nand/raw/marvell_nand.c b/drivers/mtd/nand/raw/marvell_nand.c
> >> index bc2ef5209783..c7573ccdbacd 100644
> >> --- a/drivers/mtd/nand/raw/marvell_nand.c
> >> +++ b/drivers/mtd/nand/raw/marvell_nand.c
> >> @@ -686,7 +686,7 @@ static irqreturn_t marvell_nfc_isr(int irq, void *dev_id)
> >>    
> >>           marvell_nfc_disable_int(nfc, st & NDCR_ALL_INT);
> >>    
> >> -       if (!(st & (NDSR_RDDREQ | NDSR_WRDREQ | NDSR_WRCMDREQ)))
> >> +       if (st & (NDSR_RDY(0) | NDSR_RDY(1)))
> >>                   complete(&nfc->complete);
> >>    
> >>           return IRQ_HANDLED;  
> > 
> > Yes! That seems to work nicely as a replacement for my patch.
> > 
> > Chris, how is that going on your board?  
> 
> Looks good to me. I've just re-run stress_{1,2,3} on the custom board 
> and DB-88F6820-AMC.

Great! There is still something weird about the complete() though.

Daniel, do you plan to send the above change or do you want me to do
it? I would like to integrate the fix in nand/next before sending the
PR.


Thanks,
Miquèl

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ
  2018-10-03  7:33                                   ` Miquel Raynal
@ 2018-10-03  7:54                                     ` Daniel Mack
  0 siblings, 0 replies; 28+ messages in thread
From: Daniel Mack @ 2018-10-03  7:54 UTC (permalink / raw)
  To: Miquel Raynal, Chris Packham; +Cc: Boris Brezillon, linux-mtd, stable

Hi Miquel,

On 3/10/2018 9:33 AM, Miquel Raynal wrote:
>>> Yes! That seems to work nicely as a replacement for my patch.
>>>
>>> Chris, how is that going on your board?
>> Looks good to me. I've just re-run stress_{1,2,3} on the custom board
>> and DB-88F6820-AMC.
> Great! There is still something weird about the complete() though.
> 
> Daniel, do you plan to send the above change or do you want me to do
> it? I would like to integrate the fix in nand/next before sending the
> PR.

Nah, your version has barely anything to do with my patch anymore, 
except that it fixes the same effect, so go ahead please :)

You can add my Reported-and-tested-by though.


Thanks for chasing this, everyone!

Daniel

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2018-10-03 14:41 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-27  7:17 [PATCH v2] mtd: rawnand: marvell: check for RDY bits after enabling the IRQ Daniel Mack
2018-09-27  8:11 ` Miquel Raynal
2018-09-27  8:56   ` Boris Brezillon
2018-09-27 21:55     ` Chris Packham
2018-09-28  6:40       ` Boris Brezillon
2018-09-28  6:56         ` Boris Brezillon
2018-09-28  8:12         ` Miquel Raynal
2018-09-28  7:43       ` Daniel Mack
2018-09-28  8:24         ` Miquel Raynal
2018-09-28  8:29           ` Daniel Mack
2018-09-30 21:10             ` Chris Packham
2018-10-01  5:31               ` Daniel Mack
2018-10-01 19:59                 ` Chris Packham
2018-10-01 20:34                   ` Boris Brezillon
2018-10-01 21:41                     ` Boris Brezillon
2018-10-01 22:01                       ` Chris Packham
2018-10-01 22:13                         ` Boris Brezillon
2018-10-01 22:15                           ` Chris Packham
2018-10-02  9:36                             ` Boris Brezillon
2018-10-02  9:37                               ` Boris Brezillon
2018-10-02  6:46                           ` Miquel Raynal
2018-10-02  7:25                             ` Miquel Raynal
2018-10-02  8:22                               ` Daniel Mack
2018-10-02 20:53                                 ` Chris Packham
2018-10-03  7:33                                   ` Miquel Raynal
2018-10-03  7:54                                     ` Daniel Mack
2018-10-01 22:44 ` Boris Brezillon
2018-10-02  7:42   ` Daniel Mack

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.