All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] mtd: cfi_cmdset_0002: remove unnecessary initialization
@ 2013-06-04  1:46 Brian Norris
  2013-06-04  1:46 ` [PATCH 2/3] mtd: cfi_cmdset_0002: use msecs_to_jiffies() macro Brian Norris
  2013-06-04  1:46 ` [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout Brian Norris
  0 siblings, 2 replies; 12+ messages in thread
From: Brian Norris @ 2013-06-04  1:46 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: Huang Shijie, Brian Norris, linux-mtd

These timeout initializations are arbitrary and pointless. We simply
overwrite them with new values anyway. (In one case, the variable is not
ever even used, so we kill it.)

Signed-off-by: Brian Norris <computersforpeace@gmail.com>
---
 drivers/mtd/chips/cfi_cmdset_0002.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/mtd/chips/cfi_cmdset_0002.c b/drivers/mtd/chips/cfi_cmdset_0002.c
index 89b9d68..6c85f61 100644
--- a/drivers/mtd/chips/cfi_cmdset_0002.c
+++ b/drivers/mtd/chips/cfi_cmdset_0002.c
@@ -1146,7 +1146,6 @@ static int cfi_amdstd_read (struct mtd_info *mtd, loff_t from, size_t len, size_
 static inline int do_read_secsi_onechip(struct map_info *map, struct flchip *chip, loff_t adr, size_t len, u_char *buf)
 {
 	DECLARE_WAITQUEUE(wait, current);
-	unsigned long timeo = jiffies + HZ;
 	struct cfi_private *cfi = map->fldrv_priv;
 
  retry:
@@ -1160,7 +1159,6 @@ static inline int do_read_secsi_onechip(struct map_info *map, struct flchip *chi
 
 		schedule();
 		remove_wait_queue(&chip->wq, &wait);
-		timeo = jiffies + HZ;
 
 		goto retry;
 	}
@@ -1228,7 +1226,7 @@ static int cfi_amdstd_secsi_read (struct mtd_info *mtd, loff_t from, size_t len,
 static int __xipram do_write_oneword(struct map_info *map, struct flchip *chip, unsigned long adr, map_word datum)
 {
 	struct cfi_private *cfi = map->fldrv_priv;
-	unsigned long timeo = jiffies + HZ;
+	unsigned long timeo;
 	/*
 	 * We use a 1ms + 1 jiffies generic timeout for writes (most devices
 	 * have a max write time of a few hundreds usec). However, we should
@@ -1466,7 +1464,7 @@ static int __xipram do_write_buffer(struct map_info *map, struct flchip *chip,
 				    int len)
 {
 	struct cfi_private *cfi = map->fldrv_priv;
-	unsigned long timeo = jiffies + HZ;
+	unsigned long timeo;
 	/* see comments in do_write_oneword() regarding uWriteTimeo. */
 	unsigned long uWriteTimeout = ( HZ / 1000 ) + 1;
 	int ret = -EIO;
@@ -1900,7 +1898,7 @@ static int cfi_amdstd_panic_write(struct mtd_info *mtd, loff_t to, size_t len,
 static int __xipram do_erase_chip(struct map_info *map, struct flchip *chip)
 {
 	struct cfi_private *cfi = map->fldrv_priv;
-	unsigned long timeo = jiffies + HZ;
+	unsigned long timeo;
 	unsigned long int adr;
 	DECLARE_WAITQUEUE(wait, current);
 	int ret = 0;
@@ -1990,7 +1988,7 @@ static int __xipram do_erase_chip(struct map_info *map, struct flchip *chip)
 static int __xipram do_erase_oneblock(struct map_info *map, struct flchip *chip, unsigned long adr, int len, void *thunk)
 {
 	struct cfi_private *cfi = map->fldrv_priv;
-	unsigned long timeo = jiffies + HZ;
+	unsigned long timeo;
 	DECLARE_WAITQUEUE(wait, current);
 	int ret = 0;
 
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] mtd: cfi_cmdset_0002: use msecs_to_jiffies() macro
  2013-06-04  1:46 [PATCH 1/3] mtd: cfi_cmdset_0002: remove unnecessary initialization Brian Norris
@ 2013-06-04  1:46 ` Brian Norris
  2013-06-04  1:46 ` [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout Brian Norris
  1 sibling, 0 replies; 12+ messages in thread
From: Brian Norris @ 2013-06-04  1:46 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: Huang Shijie, Brian Norris, linux-mtd

Rather than a bunch of calculations based on HZ, which are error-prone
and subject to "+ 1" (to avoid a 0 result), just use the
msecs_to_jiffies() macro. This makes both the intent (to wait some
approximate number of milliseconds) and the calculation clearer.

During the conversion, I noticed two places where we convert from
milliseconds to jiffies to microseconds, to use in a udelay(1) loop; I
dropped this in one place, but I left it in another, where the code
notes that it is "intentionally similar to do_write_oneword()".

Signed-off-by: Brian Norris <computersforpeace@gmail.com>
---
 drivers/mtd/chips/cfi_cmdset_0002.c | 41 +++++++++++++++++--------------------
 1 file changed, 19 insertions(+), 22 deletions(-)

diff --git a/drivers/mtd/chips/cfi_cmdset_0002.c b/drivers/mtd/chips/cfi_cmdset_0002.c
index 6c85f61..4e28081 100644
--- a/drivers/mtd/chips/cfi_cmdset_0002.c
+++ b/drivers/mtd/chips/cfi_cmdset_0002.c
@@ -754,7 +754,7 @@ static int get_chip(struct map_info *map, struct flchip *chip, unsigned long adr
 	struct cfi_pri_amdstd *cfip = (struct cfi_pri_amdstd *)cfi->cmdset_priv;
 
  resettime:
-	timeo = jiffies + HZ;
+	timeo = jiffies + msecs_to_jiffies(1000);
  retry:
 	switch (chip->state) {
 
@@ -1228,15 +1228,14 @@ static int __xipram do_write_oneword(struct map_info *map, struct flchip *chip,
 	struct cfi_private *cfi = map->fldrv_priv;
 	unsigned long timeo;
 	/*
-	 * We use a 1ms + 1 jiffies generic timeout for writes (most devices
-	 * have a max write time of a few hundreds usec). However, we should
-	 * use the maximum timeout value given by the chip at probe time
-	 * instead.  Unfortunately, struct flchip does have a field for
-	 * maximum timeout, only for typical which can be far too short
-	 * depending of the conditions.	 The ' + 1' is to avoid having a
-	 * timeout of 0 jiffies if HZ is smaller than 1000.
+	 * We use a 1ms generic timeout for writes (most devices have a max
+	 * write time of a few hundreds usec). However, we should use the
+	 * maximum timeout value given by the chip at probe time instead.
+	 * Unfortunately, struct flchip does have a field for maximum timeout,
+	 * only for typical which can be far too short depending of the
+	 * conditions.
 	 */
-	unsigned long uWriteTimeout = ( HZ / 1000 ) + 1;
+	unsigned long uWriteTimeout = msecs_to_jiffies(1);
 	int ret = 0;
 	map_word oldd;
 	int retry_cnt = 0;
@@ -1292,7 +1291,7 @@ static int __xipram do_write_oneword(struct map_info *map, struct flchip *chip,
 			mutex_unlock(&chip->mutex);
 			schedule();
 			remove_wait_queue(&chip->wq, &wait);
-			timeo = jiffies + (HZ / 2); /* FIXME */
+			timeo = jiffies + msecs_to_jiffies(500); /* FIXME */
 			mutex_lock(&chip->mutex);
 			continue;
 		}
@@ -1465,8 +1464,8 @@ static int __xipram do_write_buffer(struct map_info *map, struct flchip *chip,
 {
 	struct cfi_private *cfi = map->fldrv_priv;
 	unsigned long timeo;
-	/* see comments in do_write_oneword() regarding uWriteTimeo. */
-	unsigned long uWriteTimeout = ( HZ / 1000 ) + 1;
+	/* see comments in do_write_oneword() regarding uWriteTimeout */
+	unsigned long uWriteTimeout = msecs_to_jiffies(1);
 	int ret = -EIO;
 	unsigned long cmd_adr;
 	int z, words;
@@ -1535,7 +1534,7 @@ static int __xipram do_write_buffer(struct map_info *map, struct flchip *chip,
 			mutex_unlock(&chip->mutex);
 			schedule();
 			remove_wait_queue(&chip->wq, &wait);
-			timeo = jiffies + (HZ / 2); /* FIXME */
+			timeo = jiffies + msecs_to_jiffies(500); /* FIXME */
 			mutex_lock(&chip->mutex);
 			continue;
 		}
@@ -1687,13 +1686,11 @@ static int cfi_amdstd_panic_wait(struct map_info *map, struct flchip *chip,
 	 * is more important to save the messages.
 	 */
 	while (retries > 0) {
-		const unsigned long timeo = (HZ / 1000) + 1;
-
 		/* send the reset command */
 		map_write(map, CMD(0xF0), chip->start);
 
-		/* wait for the chip to become ready */
-		for (i = 0; i < jiffies_to_usecs(timeo); i++) {
+		/* wait for the chip to become ready (approx. 1ms) */
+		for (i = 0; i < 1000; i++) {
 			if (chip_ready(map, adr))
 				return 0;
 
@@ -1719,7 +1716,7 @@ static int cfi_amdstd_panic_wait(struct map_info *map, struct flchip *chip,
 static int do_panic_write_oneword(struct map_info *map, struct flchip *chip,
 				  unsigned long adr, map_word datum)
 {
-	const unsigned long uWriteTimeout = (HZ / 1000) + 1;
+	const unsigned long uWriteTimeout = msecs_to_jiffies(1);
 	struct cfi_private *cfi = map->fldrv_priv;
 	int retry_cnt = 0;
 	map_word oldd;
@@ -1934,7 +1931,7 @@ static int __xipram do_erase_chip(struct map_info *map, struct flchip *chip)
 				adr, map->size,
 				chip->erase_time*500);
 
-	timeo = jiffies + (HZ*20);
+	timeo = jiffies + msecs_to_jiffies(20000);
 
 	for (;;) {
 		if (chip->state != FL_ERASING) {
@@ -1950,7 +1947,7 @@ static int __xipram do_erase_chip(struct map_info *map, struct flchip *chip)
 		if (chip->erase_suspended) {
 			/* This erase was suspended and resumed.
 			   Adjust the timeout */
-			timeo = jiffies + (HZ*20); /* FIXME */
+			timeo = jiffies + msecs_to_jiffies(20000); /* FIXME */
 			chip->erase_suspended = 0;
 		}
 
@@ -2023,7 +2020,7 @@ static int __xipram do_erase_oneblock(struct map_info *map, struct flchip *chip,
 				adr, len,
 				chip->erase_time*500);
 
-	timeo = jiffies + (HZ*20);
+	timeo = jiffies + msecs_to_jiffies(20000);
 
 	for (;;) {
 		if (chip->state != FL_ERASING) {
@@ -2039,7 +2036,7 @@ static int __xipram do_erase_oneblock(struct map_info *map, struct flchip *chip,
 		if (chip->erase_suspended) {
 			/* This erase was suspended and resumed.
 			   Adjust the timeout */
-			timeo = jiffies + (HZ*20); /* FIXME */
+			timeo = jiffies + msecs_to_jiffies(20000); /* FIXME */
 			chip->erase_suspended = 0;
 		}
 
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout
  2013-06-04  1:46 [PATCH 1/3] mtd: cfi_cmdset_0002: remove unnecessary initialization Brian Norris
  2013-06-04  1:46 ` [PATCH 2/3] mtd: cfi_cmdset_0002: use msecs_to_jiffies() macro Brian Norris
@ 2013-06-04  1:46 ` Brian Norris
  2013-06-04  7:03   ` Huang Shijie
  1 sibling, 1 reply; 12+ messages in thread
From: Brian Norris @ 2013-06-04  1:46 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: Huang Shijie, Brian Norris, linux-mtd

I have seen various failures like this while doing a simple reboot +
mount UBIFS test:

  UBIFS: recovery needed
  MTD do_write_buffer(): software timeout
  UBI error: ubi_io_write: error -5 while writing 512 bytes to PEB 365:43648, written 192 bytes

(This message doens't exactly match the message in the current head of
tree, as Huang edited the timeout message.)

I'm using a 64Mbyte Spansion S29GL512 NOR flash. I've also seen reports
of the same failure on Micron 128Mbyte NOR flash AM29EWLD.

After various tests, it seems simply that the timeout is not long enough
for my system; increasing it by a few jiffies prevented all failures
(testing for 12+ hours). There is no harm in increasing the timeout, but
there is harm in having it too short, as evidenced here.

This patch increases the timeout value in 3 locations, as they all are
referencing the do_write_oneword function. I have only ever seen the
timeout in do_write_buffer().

Signed-off-by: Brian Norris <computersforpeace@gmail.com>
---
 drivers/mtd/chips/cfi_cmdset_0002.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/drivers/mtd/chips/cfi_cmdset_0002.c b/drivers/mtd/chips/cfi_cmdset_0002.c
index 4e28081..45de025 100644
--- a/drivers/mtd/chips/cfi_cmdset_0002.c
+++ b/drivers/mtd/chips/cfi_cmdset_0002.c
@@ -1228,14 +1228,11 @@ static int __xipram do_write_oneword(struct map_info *map, struct flchip *chip,
 	struct cfi_private *cfi = map->fldrv_priv;
 	unsigned long timeo;
 	/*
-	 * We use a 1ms generic timeout for writes (most devices have a max
-	 * write time of a few hundreds usec). However, we should use the
-	 * maximum timeout value given by the chip at probe time instead.
-	 * Unfortunately, struct flchip does have a field for maximum timeout,
-	 * only for typical which can be far too short depending of the
-	 * conditions.
+	 * We use a 10ms generic timeout for writes. Most devices have a max
+	 * write time of a few hundreds usec, but timeouts have been seen with
+	 * too few jiffies.
 	 */
-	unsigned long uWriteTimeout = msecs_to_jiffies(1);
+	unsigned long uWriteTimeout = msecs_to_jiffies(10);
 	int ret = 0;
 	map_word oldd;
 	int retry_cnt = 0;
@@ -1465,7 +1462,7 @@ static int __xipram do_write_buffer(struct map_info *map, struct flchip *chip,
 	struct cfi_private *cfi = map->fldrv_priv;
 	unsigned long timeo;
 	/* see comments in do_write_oneword() regarding uWriteTimeout */
-	unsigned long uWriteTimeout = msecs_to_jiffies(1);
+	unsigned long uWriteTimeout = msecs_to_jiffies(10);
 	int ret = -EIO;
 	unsigned long cmd_adr;
 	int z, words;
@@ -1716,7 +1713,7 @@ static int cfi_amdstd_panic_wait(struct map_info *map, struct flchip *chip,
 static int do_panic_write_oneword(struct map_info *map, struct flchip *chip,
 				  unsigned long adr, map_word datum)
 {
-	const unsigned long uWriteTimeout = msecs_to_jiffies(1);
+	const unsigned long uWriteTimeout = msecs_to_jiffies(10);
 	struct cfi_private *cfi = map->fldrv_priv;
 	int retry_cnt = 0;
 	map_word oldd;
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout
  2013-06-04  1:46 ` [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout Brian Norris
@ 2013-06-04  7:03   ` Huang Shijie
  2013-06-05 18:01       ` Brian Norris
  0 siblings, 1 reply; 12+ messages in thread
From: Huang Shijie @ 2013-06-04  7:03 UTC (permalink / raw)
  To: Brian Norris; +Cc: linux-mtd, Artem Bityutskiy

于 2013年06月04日 09:46, Brian Norris 写道:
> After various tests, it seems simply that the timeout is not long enough
> for my system; increasing it by a few jiffies prevented all failures
> (testing for 12+ hours). There is no harm in increasing the timeout, but
> there is harm in having it too short, as evidenced here.
>
I like the patch1 and patch 2.

But extending the timeout from 1ms to 10ms is like a workaround. :)

>From the NOR's spec, even the maximum write-to-buffer only costs several
hundreds us,
such as 200us.

I GUESS your problem is caused by the timer system, not the MTD code. I
ever met this type of bug.
The bug is in the kernel 3.5.7, but the latest kernel has fixed it with
NO_HZ_IDLE/NO_HZ_COMMON features.
I do not meet the issue the latest linux-next tree.

I try to describe the jiffies bug with my poor english:

[1] background:
CONFIG_HZ=100, CONFIG_NO_HZ=y

[2] call nand_wait() when we write a nand page.

[3] The jiffies was not updated at a _even_ speed.

In the nand_wait(), you wait for 20ms(2 jiffies) for a page write,
and the timeout occurs during the page write. Of course, you think that
we have already waited for 20ms.
But in actually, we only waited for 1ms or less!
How do i know this? I use the gettimeofday to check the real time when
the timeout occur.

[4] if i disable the local timer, the bug disappears.


So, could you check the real time when the timeout occurs?



Btw: My NOR's timeout is proved to be a silicon bug by Micron.

thanks
Huang Shijie

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout
  2013-06-04  7:03   ` Huang Shijie
@ 2013-06-05 18:01       ` Brian Norris
  0 siblings, 0 replies; 12+ messages in thread
From: Brian Norris @ 2013-06-05 18:01 UTC (permalink / raw)
  To: Huang Shijie; +Cc: Artem Bityutskiy, linux-mtd, Linux Kernel, Kevin Cernekee

On Tue, Jun 4, 2013 at 12:03 AM, Huang Shijie <b32955@freescale.com> wrote:
> 于 2013年06月04日 09:46, Brian Norris 写道:
>> After various tests, it seems simply that the timeout is not long enough
>> for my system; increasing it by a few jiffies prevented all failures
>> (testing for 12+ hours). There is no harm in increasing the timeout, but
>> there is harm in having it too short, as evidenced here.
>>
> I like the patch1 and patch 2.
>
> But extending the timeout from 1ms to 10ms is like a workaround. :)

I was afraid you might say that; that's why I stuck the first two
patches first ;)

> From the NOR's spec, even the maximum write-to-buffer only costs several
> hundreds us,
> such as 200us.
>
> I GUESS your problem is caused by the timer system, not the MTD code. I
> ever met this type of bug.

I suspected similarly, but I didn't (until now) believe that's the
case here. See below.

> The bug is in the kernel 3.5.7, but the latest kernel has fixed it with
> NO_HZ_IDLE/NO_HZ_COMMON features.

Did you track your bug down to a particular commit? 3.5.7 is the
stable kernel; do you know what mainline rev it showed up in? I'm not
quite interested in backporting all of the new 3.10 features!

> I do not meet the issue the latest linux-next tree.
>
> I try to describe the jiffies bug with my poor english:
>
> [1] background:
> CONFIG_HZ=100, CONFIG_NO_HZ=y
>
> [2] call nand_wait() when we write a nand page.
>
> [3] The jiffies was not updated at a _even_ speed.
>
> In the nand_wait(), you wait for 20ms(2 jiffies) for a page write,
> and the timeout occurs during the page write. Of course, you think that
> we have already waited for 20ms.
> But in actually, we only waited for 1ms or less!
> How do i know this? I use the gettimeofday to check the real time when
> the timeout occur.

I suspected this very type of thing, since this has come up in a few
different contexts. And for some time, with a number of different
checks, it appeared that this *wasn't* the case. But while writing
this very email, I had the bright idea that my time checkpoint was in
slightly the wrong place; so sure enough, I found that I was timing
out after only 72519 ns! (That is, 72 us, or well below the max write
buffer time.)

I'm testing on MIPS with a 3.3 kernel, by the way, but I believe this
sort of bug has been around a while.

> [4] if i disable the local timer, the bug disappears.
>
> So, could you check the real time when the timeout occurs?
>
>
>
> Btw: My NOR's timeout is proved to be a silicon bug by Micron.

Interesting.

Brian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout
@ 2013-06-05 18:01       ` Brian Norris
  0 siblings, 0 replies; 12+ messages in thread
From: Brian Norris @ 2013-06-05 18:01 UTC (permalink / raw)
  To: Huang Shijie; +Cc: Kevin Cernekee, linux-mtd, Linux Kernel, Artem Bityutskiy

On Tue, Jun 4, 2013 at 12:03 AM, Huang Shijie <b32955@freescale.com> wrote:
> 于 2013年06月04日 09:46, Brian Norris 写道:
>> After various tests, it seems simply that the timeout is not long enough
>> for my system; increasing it by a few jiffies prevented all failures
>> (testing for 12+ hours). There is no harm in increasing the timeout, but
>> there is harm in having it too short, as evidenced here.
>>
> I like the patch1 and patch 2.
>
> But extending the timeout from 1ms to 10ms is like a workaround. :)

I was afraid you might say that; that's why I stuck the first two
patches first ;)

> From the NOR's spec, even the maximum write-to-buffer only costs several
> hundreds us,
> such as 200us.
>
> I GUESS your problem is caused by the timer system, not the MTD code. I
> ever met this type of bug.

I suspected similarly, but I didn't (until now) believe that's the
case here. See below.

> The bug is in the kernel 3.5.7, but the latest kernel has fixed it with
> NO_HZ_IDLE/NO_HZ_COMMON features.

Did you track your bug down to a particular commit? 3.5.7 is the
stable kernel; do you know what mainline rev it showed up in? I'm not
quite interested in backporting all of the new 3.10 features!

> I do not meet the issue the latest linux-next tree.
>
> I try to describe the jiffies bug with my poor english:
>
> [1] background:
> CONFIG_HZ=100, CONFIG_NO_HZ=y
>
> [2] call nand_wait() when we write a nand page.
>
> [3] The jiffies was not updated at a _even_ speed.
>
> In the nand_wait(), you wait for 20ms(2 jiffies) for a page write,
> and the timeout occurs during the page write. Of course, you think that
> we have already waited for 20ms.
> But in actually, we only waited for 1ms or less!
> How do i know this? I use the gettimeofday to check the real time when
> the timeout occur.

I suspected this very type of thing, since this has come up in a few
different contexts. And for some time, with a number of different
checks, it appeared that this *wasn't* the case. But while writing
this very email, I had the bright idea that my time checkpoint was in
slightly the wrong place; so sure enough, I found that I was timing
out after only 72519 ns! (That is, 72 us, or well below the max write
buffer time.)

I'm testing on MIPS with a 3.3 kernel, by the way, but I believe this
sort of bug has been around a while.

> [4] if i disable the local timer, the bug disappears.
>
> So, could you check the real time when the timeout occurs?
>
>
>
> Btw: My NOR's timeout is proved to be a silicon bug by Micron.

Interesting.

Brian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout
  2013-06-05 18:01       ` Brian Norris
@ 2013-06-05 21:08         ` Brian Norris
  -1 siblings, 0 replies; 12+ messages in thread
From: Brian Norris @ 2013-06-05 21:08 UTC (permalink / raw)
  To: Huang Shijie
  Cc: Artem Bityutskiy, linux-mtd, Linux Kernel, Kevin Cernekee,
	Arnd Bergmann, Imre Deak

Adding a few others

For reference, this thread started with this patch:

http://lists.infradead.org/pipermail/linux-mtd/2013-June/047164.html

On Wed, Jun 5, 2013 at 11:01 AM, Brian Norris
<computersforpeace@gmail.com> wrote:
> On Tue, Jun 4, 2013 at 12:03 AM, Huang Shijie <b32955@freescale.com> wrote:
>> 于 2013年06月04日 09:46, Brian Norris 写道:
>>> After various tests, it seems simply that the timeout is not long enough
>>> for my system; increasing it by a few jiffies prevented all failures
>>> (testing for 12+ hours). There is no harm in increasing the timeout, but
>>> there is harm in having it too short, as evidenced here.
>>>
>> I like the patch1 and patch 2.
>>
>> But extending the timeout from 1ms to 10ms is like a workaround. :)
>
> I was afraid you might say that; that's why I stuck the first two
> patches first ;)
...
>> I GUESS your problem is caused by the timer system, not the MTD code. I
>> ever met this type of bug.
...
>> I try to describe the jiffies bug with my poor english:
>>
>> [1] background:
>> CONFIG_HZ=100, CONFIG_NO_HZ=y
>>
>> [2] call nand_wait() when we write a nand page.
>>
>> [3] The jiffies was not updated at a _even_ speed.
>>
>> In the nand_wait(), you wait for 20ms(2 jiffies) for a page write,
>> and the timeout occurs during the page write. Of course, you think that
>> we have already waited for 20ms.
>> But in actually, we only waited for 1ms or less!
>> How do i know this? I use the gettimeofday to check the real time when
>> the timeout occur.
>
> I suspected this very type of thing, since this has come up in a few
> different contexts. And for some time, with a number of different
> checks, it appeared that this *wasn't* the case. But while writing
> this very email, I had the bright idea that my time checkpoint was in
> slightly the wrong place; so sure enough, I found that I was timing
> out after only 72519 ns! (That is, 72 us, or well below the max write
> buffer time.)

So I can confirm that with the 1ms timeout, I actually am sometimes
timing out at 40 to 70 microseconds. I think this may have multiple
causes:
(1) uneven timer interrupts, as suggested by Huang?
(2) a jiffies timeout of 1 is two short (with HZ=1000, msecs_to_jiffies(1) is 1)

Regarding reason (2):

My thought (which matches with Imre's comments from his [1]) is that
one problem here is that we do not know how long it will be until the
*next* timer tick -- "waiting 1 jiffy" is really just waiting until
the next timer tick, which very well might be in 40us! So the correct
timeout calculation is something like:

uWriteTimeout = msecs_to_jiffies(1) + 1;

or with Imre's proposed methods (not merged upstream yet), just:

uWriteTimeout = msecs_to_jiffies_timeout(1);

Thoughts?

Note that a 2-jiffy timeout does not, in fact, totally resolve my
problems; with a timeout of 2 jiffies, I still get a timeout that
(according to getnstimeofday()) occurs after only 56us. It does
decrease its rate of occurrence, but Huang may still be right that
reason (1) is involved.

Brian

[1] http://marc.info/?l=linux-kernel&m=136854294730957

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout
@ 2013-06-05 21:08         ` Brian Norris
  0 siblings, 0 replies; 12+ messages in thread
From: Brian Norris @ 2013-06-05 21:08 UTC (permalink / raw)
  To: Huang Shijie
  Cc: Arnd Bergmann, Artem Bityutskiy, Imre Deak, Kevin Cernekee,
	Linux Kernel, linux-mtd

Adding a few others

For reference, this thread started with this patch:

http://lists.infradead.org/pipermail/linux-mtd/2013-June/047164.html

On Wed, Jun 5, 2013 at 11:01 AM, Brian Norris
<computersforpeace@gmail.com> wrote:
> On Tue, Jun 4, 2013 at 12:03 AM, Huang Shijie <b32955@freescale.com> wrote:
>> 于 2013年06月04日 09:46, Brian Norris 写道:
>>> After various tests, it seems simply that the timeout is not long enough
>>> for my system; increasing it by a few jiffies prevented all failures
>>> (testing for 12+ hours). There is no harm in increasing the timeout, but
>>> there is harm in having it too short, as evidenced here.
>>>
>> I like the patch1 and patch 2.
>>
>> But extending the timeout from 1ms to 10ms is like a workaround. :)
>
> I was afraid you might say that; that's why I stuck the first two
> patches first ;)
...
>> I GUESS your problem is caused by the timer system, not the MTD code. I
>> ever met this type of bug.
...
>> I try to describe the jiffies bug with my poor english:
>>
>> [1] background:
>> CONFIG_HZ=100, CONFIG_NO_HZ=y
>>
>> [2] call nand_wait() when we write a nand page.
>>
>> [3] The jiffies was not updated at a _even_ speed.
>>
>> In the nand_wait(), you wait for 20ms(2 jiffies) for a page write,
>> and the timeout occurs during the page write. Of course, you think that
>> we have already waited for 20ms.
>> But in actually, we only waited for 1ms or less!
>> How do i know this? I use the gettimeofday to check the real time when
>> the timeout occur.
>
> I suspected this very type of thing, since this has come up in a few
> different contexts. And for some time, with a number of different
> checks, it appeared that this *wasn't* the case. But while writing
> this very email, I had the bright idea that my time checkpoint was in
> slightly the wrong place; so sure enough, I found that I was timing
> out after only 72519 ns! (That is, 72 us, or well below the max write
> buffer time.)

So I can confirm that with the 1ms timeout, I actually am sometimes
timing out at 40 to 70 microseconds. I think this may have multiple
causes:
(1) uneven timer interrupts, as suggested by Huang?
(2) a jiffies timeout of 1 is two short (with HZ=1000, msecs_to_jiffies(1) is 1)

Regarding reason (2):

My thought (which matches with Imre's comments from his [1]) is that
one problem here is that we do not know how long it will be until the
*next* timer tick -- "waiting 1 jiffy" is really just waiting until
the next timer tick, which very well might be in 40us! So the correct
timeout calculation is something like:

uWriteTimeout = msecs_to_jiffies(1) + 1;

or with Imre's proposed methods (not merged upstream yet), just:

uWriteTimeout = msecs_to_jiffies_timeout(1);

Thoughts?

Note that a 2-jiffy timeout does not, in fact, totally resolve my
problems; with a timeout of 2 jiffies, I still get a timeout that
(according to getnstimeofday()) occurs after only 56us. It does
decrease its rate of occurrence, but Huang may still be right that
reason (1) is involved.

Brian

[1] http://marc.info/?l=linux-kernel&m=136854294730957

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout
  2013-06-05 21:08         ` Brian Norris
@ 2013-06-05 22:39           ` Imre Deak
  -1 siblings, 0 replies; 12+ messages in thread
From: Imre Deak @ 2013-06-05 22:39 UTC (permalink / raw)
  To: Brian Norris
  Cc: Huang Shijie, Artem Bityutskiy, linux-mtd, Linux Kernel,
	Kevin Cernekee, Arnd Bergmann

On Wed, 2013-06-05 at 14:08 -0700, Brian Norris wrote:
> Adding a few others
> 
> For reference, this thread started with this patch:
> 
> http://lists.infradead.org/pipermail/linux-mtd/2013-June/047164.html
> 
> On Wed, Jun 5, 2013 at 11:01 AM, Brian Norris
> <computersforpeace@gmail.com> wrote:
> > On Tue, Jun 4, 2013 at 12:03 AM, Huang Shijie <b32955@freescale.com> wrote:
> >> 于 2013年06月04日 09:46, Brian Norris 写道:
> >>> After various tests, it seems simply that the timeout is not long enough
> >>> for my system; increasing it by a few jiffies prevented all failures
> >>> (testing for 12+ hours). There is no harm in increasing the timeout, but
> >>> there is harm in having it too short, as evidenced here.
> >>>
> >> I like the patch1 and patch 2.
> >>
> >> But extending the timeout from 1ms to 10ms is like a workaround. :)
> >
> > I was afraid you might say that; that's why I stuck the first two
> > patches first ;)
> ...
> >> I GUESS your problem is caused by the timer system, not the MTD code. I
> >> ever met this type of bug.
> ...
> >> I try to describe the jiffies bug with my poor english:
> >>
> >> [1] background:
> >> CONFIG_HZ=100, CONFIG_NO_HZ=y
> >>
> >> [2] call nand_wait() when we write a nand page.
> >>
> >> [3] The jiffies was not updated at a _even_ speed.
> >>
> >> In the nand_wait(), you wait for 20ms(2 jiffies) for a page write,
> >> and the timeout occurs during the page write. Of course, you think that
> >> we have already waited for 20ms.
> >> But in actually, we only waited for 1ms or less!
> >> How do i know this? I use the gettimeofday to check the real time when
> >> the timeout occur.
> >
> > I suspected this very type of thing, since this has come up in a few
> > different contexts. And for some time, with a number of different
> > checks, it appeared that this *wasn't* the case. But while writing
> > this very email, I had the bright idea that my time checkpoint was in
> > slightly the wrong place; so sure enough, I found that I was timing
> > out after only 72519 ns! (That is, 72 us, or well below the max write
> > buffer time.)
> 
> So I can confirm that with the 1ms timeout, I actually am sometimes
> timing out at 40 to 70 microseconds. I think this may have multiple
> causes:
> (1) uneven timer interrupts, as suggested by Huang?
> (2) a jiffies timeout of 1 is two short (with HZ=1000, msecs_to_jiffies(1) is 1)
> 
> Regarding reason (2):
> 
> My thought (which matches with Imre's comments from his [1]) is that
> one problem here is that we do not know how long it will be until the
> *next* timer tick -- "waiting 1 jiffy" is really just waiting until
> the next timer tick, which very well might be in 40us! So the correct
> timeout calculation is something like:
> 
> uWriteTimeout = msecs_to_jiffies(1) + 1;
> 
> or with Imre's proposed methods (not merged upstream yet), just:
> 
> uWriteTimeout = msecs_to_jiffies_timeout(1);
> 
> Thoughts?

I think what you describe at (2) wouldn't cause a premature timeout in
your case. The driver uses the returned jiffy value something like the
following in all cases (before applying the patch with the +1 change):

uWriteTimeout = msecs_to_jiffies(1);
timeout = jiffies + uWriteTimeout;
while (!condition)
	if (time_after(jiffies, timeout))
		return -ETIMEDOUT;

Here using time_after() as opposed to time_after_eq() serves as an
implicit +1 and thus guarantees that you wait at least 1 msec.

A bit off-topic:
Though using msecs_to_jiffies() is not a problem here, I think in this
case and almost always it would need less thinking and thus be safer to
still use msecs_to_jiffies_timeout(). A rare exception would be when the
+1 adjustment would accumulate to a significant error, like in the
following polling loop:

for (i = 0; i <= 50; i++) {
	if (poll_condition)
		return 0;
	schedule_timeout(msecs_to_jiffies(1));
}
return -ETIMEDOUT;

Here on HZ=1000 we would time out in average after 100 msec using
msecs_to_jiffies_timeout(1), whereas the intention was 50 msecs. 

--Imre

> Note that a 2-jiffy timeout does not, in fact, totally resolve my
> problems; with a timeout of 2 jiffies, I still get a timeout that
> (according to getnstimeofday()) occurs after only 56us. It does
> decrease its rate of occurrence, but Huang may still be right that
> reason (1) is involved.
> 
> Brian
> 
> [1] http://marc.info/?l=linux-kernel&m=136854294730957



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout
@ 2013-06-05 22:39           ` Imre Deak
  0 siblings, 0 replies; 12+ messages in thread
From: Imre Deak @ 2013-06-05 22:39 UTC (permalink / raw)
  To: Brian Norris
  Cc: Arnd Bergmann, Artem Bityutskiy, Kevin Cernekee, Linux Kernel,
	Huang Shijie, linux-mtd

On Wed, 2013-06-05 at 14:08 -0700, Brian Norris wrote:
> Adding a few others
> 
> For reference, this thread started with this patch:
> 
> http://lists.infradead.org/pipermail/linux-mtd/2013-June/047164.html
> 
> On Wed, Jun 5, 2013 at 11:01 AM, Brian Norris
> <computersforpeace@gmail.com> wrote:
> > On Tue, Jun 4, 2013 at 12:03 AM, Huang Shijie <b32955@freescale.com> wrote:
> >> 于 2013年06月04日 09:46, Brian Norris 写道:
> >>> After various tests, it seems simply that the timeout is not long enough
> >>> for my system; increasing it by a few jiffies prevented all failures
> >>> (testing for 12+ hours). There is no harm in increasing the timeout, but
> >>> there is harm in having it too short, as evidenced here.
> >>>
> >> I like the patch1 and patch 2.
> >>
> >> But extending the timeout from 1ms to 10ms is like a workaround. :)
> >
> > I was afraid you might say that; that's why I stuck the first two
> > patches first ;)
> ...
> >> I GUESS your problem is caused by the timer system, not the MTD code. I
> >> ever met this type of bug.
> ...
> >> I try to describe the jiffies bug with my poor english:
> >>
> >> [1] background:
> >> CONFIG_HZ=100, CONFIG_NO_HZ=y
> >>
> >> [2] call nand_wait() when we write a nand page.
> >>
> >> [3] The jiffies was not updated at a _even_ speed.
> >>
> >> In the nand_wait(), you wait for 20ms(2 jiffies) for a page write,
> >> and the timeout occurs during the page write. Of course, you think that
> >> we have already waited for 20ms.
> >> But in actually, we only waited for 1ms or less!
> >> How do i know this? I use the gettimeofday to check the real time when
> >> the timeout occur.
> >
> > I suspected this very type of thing, since this has come up in a few
> > different contexts. And for some time, with a number of different
> > checks, it appeared that this *wasn't* the case. But while writing
> > this very email, I had the bright idea that my time checkpoint was in
> > slightly the wrong place; so sure enough, I found that I was timing
> > out after only 72519 ns! (That is, 72 us, or well below the max write
> > buffer time.)
> 
> So I can confirm that with the 1ms timeout, I actually am sometimes
> timing out at 40 to 70 microseconds. I think this may have multiple
> causes:
> (1) uneven timer interrupts, as suggested by Huang?
> (2) a jiffies timeout of 1 is two short (with HZ=1000, msecs_to_jiffies(1) is 1)
> 
> Regarding reason (2):
> 
> My thought (which matches with Imre's comments from his [1]) is that
> one problem here is that we do not know how long it will be until the
> *next* timer tick -- "waiting 1 jiffy" is really just waiting until
> the next timer tick, which very well might be in 40us! So the correct
> timeout calculation is something like:
> 
> uWriteTimeout = msecs_to_jiffies(1) + 1;
> 
> or with Imre's proposed methods (not merged upstream yet), just:
> 
> uWriteTimeout = msecs_to_jiffies_timeout(1);
> 
> Thoughts?

I think what you describe at (2) wouldn't cause a premature timeout in
your case. The driver uses the returned jiffy value something like the
following in all cases (before applying the patch with the +1 change):

uWriteTimeout = msecs_to_jiffies(1);
timeout = jiffies + uWriteTimeout;
while (!condition)
	if (time_after(jiffies, timeout))
		return -ETIMEDOUT;

Here using time_after() as opposed to time_after_eq() serves as an
implicit +1 and thus guarantees that you wait at least 1 msec.

A bit off-topic:
Though using msecs_to_jiffies() is not a problem here, I think in this
case and almost always it would need less thinking and thus be safer to
still use msecs_to_jiffies_timeout(). A rare exception would be when the
+1 adjustment would accumulate to a significant error, like in the
following polling loop:

for (i = 0; i <= 50; i++) {
	if (poll_condition)
		return 0;
	schedule_timeout(msecs_to_jiffies(1));
}
return -ETIMEDOUT;

Here on HZ=1000 we would time out in average after 100 msec using
msecs_to_jiffies_timeout(1), whereas the intention was 50 msecs. 

--Imre

> Note that a 2-jiffy timeout does not, in fact, totally resolve my
> problems; with a timeout of 2 jiffies, I still get a timeout that
> (according to getnstimeofday()) occurs after only 56us. It does
> decrease its rate of occurrence, but Huang may still be right that
> reason (1) is involved.
> 
> Brian
> 
> [1] http://marc.info/?l=linux-kernel&m=136854294730957

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout
  2013-06-05 21:08         ` Brian Norris
@ 2013-06-06  2:20           ` Huang Shijie
  -1 siblings, 0 replies; 12+ messages in thread
From: Huang Shijie @ 2013-06-06  2:20 UTC (permalink / raw)
  To: Brian Norris
  Cc: Artem Bityutskiy, linux-mtd, Linux Kernel, Kevin Cernekee,
	Arnd Bergmann, Imre Deak

于 2013年06月06日 05:08, Brian Norris 写道:
> Note that a 2-jiffy timeout does not, in fact, totally resolve my
> problems; with a timeout of 2 jiffies, I still get a timeout that
> (according to getnstimeofday()) occurs after only 56us. It does
since the 2-jiffy does not resolve your problem, i suggest you try the
latest linux-next
tree.


thanks
Huang Shijie


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout
@ 2013-06-06  2:20           ` Huang Shijie
  0 siblings, 0 replies; 12+ messages in thread
From: Huang Shijie @ 2013-06-06  2:20 UTC (permalink / raw)
  To: Brian Norris
  Cc: Arnd Bergmann, Artem Bityutskiy, Imre Deak, Kevin Cernekee,
	Linux Kernel, linux-mtd

于 2013年06月06日 05:08, Brian Norris 写道:
> Note that a 2-jiffy timeout does not, in fact, totally resolve my
> problems; with a timeout of 2 jiffies, I still get a timeout that
> (according to getnstimeofday()) occurs after only 56us. It does
since the 2-jiffy does not resolve your problem, i suggest you try the
latest linux-next
tree.


thanks
Huang Shijie

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-06-06  2:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-04  1:46 [PATCH 1/3] mtd: cfi_cmdset_0002: remove unnecessary initialization Brian Norris
2013-06-04  1:46 ` [PATCH 2/3] mtd: cfi_cmdset_0002: use msecs_to_jiffies() macro Brian Norris
2013-06-04  1:46 ` [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout Brian Norris
2013-06-04  7:03   ` Huang Shijie
2013-06-05 18:01     ` Brian Norris
2013-06-05 18:01       ` Brian Norris
2013-06-05 21:08       ` Brian Norris
2013-06-05 21:08         ` Brian Norris
2013-06-05 22:39         ` Imre Deak
2013-06-05 22:39           ` Imre Deak
2013-06-06  2:20         ` Huang Shijie
2013-06-06  2:20           ` Huang Shijie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.