All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mtd: Add locking to cfi_cmdset_001:do_getlockstatus_oneblock
@ 2015-11-06 13:49 Mark Marshall
  2015-11-06 14:49 ` Joakim Tjernlund
  2015-11-07 10:11 ` Joakim Tjernlund
  0 siblings, 2 replies; 5+ messages in thread
From: Mark Marshall @ 2015-11-06 13:49 UTC (permalink / raw)
  To: linux-mtd; +Cc: dwmw2

The function (do_getlockstatus_oneblock) switches the flash out of
array-read mode and into query mode.  It should not run in parallel to
another function that uses array-read mode.  We therefore need to
acquire the chip mutex and call get_chip(..., FL_JEDEC_QUERY).

I also assume that we should invalidate the cache in the same way that
we do for do_otp_read.

Signed-off-by: Mark Marshall <Mark.Marshall@omicron.at>
---

Hi all,

We have had a device in production (and operation) for a long time
running Linux and using a CFI NOR flash chip.  We have recently
started to get devices come back from the field with strange errors in
the JFFS2 file system that is on this flash.  It appeared that blocks
of files on the image were being dropped (not flash blocks, just 4K
sections from the middle of the files were being replaced with 0's).

After some debugging I found that sometimes, when the JFFS2 FS was
reading the JFFS2 inode block it would get an CRC error.  After adding
some debug code I could see that in the error cases the "second half"
of the read data was replaced with a fixed repeating pattern, and that
this pattern was part of the CFI "QRY" data (It contained the flash
manufacturer and device ID's).

This led me to think that maybe the problem was that some part of the
MTD code was switching the device out of the normal "array-read" mode
and into the query mode.  After looking through the mtd code I found
that the function 'do_getlockstatus_oneblock' does switch the mode of
the flash without taking the chip mutex.  Adding code to take the
mutex has fixed our problem.

I am slightly surprised that no one else has seen this bug - the only
thing I can think of is that reading the lock status of the flash is
not something that happens in normal operation very often?  We have a
process on the device that generates a hardware check report at boot
up, and one of the things that it does is report on the 'locked'
status of the flash.  This is always running while the JFFS2 code is
performing it's initial scan.

The bug is fairly easy to reproduce if you have a system with the
correct type of flash.  In one terminal run something like:

 while true ; do ( hd -n 65536 /dev/mtd0 | md5sum ) ; done

And in a second terminal run a program or utility that will check the
lock status of the flash (mtdinfo should work).  Assuming nothing else
is writing to the flash partition mtd0 the md5sum will remain
constant.  When the bug bites you can see the md5sum change (but then
recover).

Best Regards,

Mark Marshall

 drivers/mtd/chips/cfi_cmdset_0001.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/mtd/chips/cfi_cmdset_0001.c
b/drivers/mtd/chips/cfi_cmdset_0001.c
index 286b97a..d675efb 100644
--- a/drivers/mtd/chips/cfi_cmdset_0001.c
+++ b/drivers/mtd/chips/cfi_cmdset_0001.c
@@ -2051,13 +2051,32 @@ static int __xipram
do_getlockstatus_oneblock(struct map_info *map,
 {
 	struct cfi_private *cfi = map->fldrv_priv;
 	int status, ofs_factor = cfi->interleave * cfi->device_type;
+	int ret;

 	adr += chip->start;
+
+	mutex_lock(&chip->mutex);
+	ret = get_chip(map, chip, adr, FL_JEDEC_QUERY);
+	if (ret) {
+		mutex_unlock(&chip->mutex);
+		return ret;
+	}
+
+	/* let's ensure we're not reading back cached data from array mode */
+	INVALIDATE_CACHED_RANGE(map, adr+(2*ofs_factor), 1);
+
 	xip_disable(map, chip, adr+(2*ofs_factor));
 	map_write(map, CMD(0x90), adr+(2*ofs_factor));
 	chip->state = FL_JEDEC_QUERY;
 	status = cfi_read_query(map, adr+(2*ofs_factor));
 	xip_enable(map, chip, 0);
+
+	/* then ensure we don't keep query data in the cache */
+	INVALIDATE_CACHED_RANGE(map, adr+(2*ofs_factor), 1);
+
+	put_chip(map, chip, adr);
+	mutex_unlock(&chip->mutex);
+
 	return status;
 }

-- 
1.9.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] mtd: Add locking to cfi_cmdset_001:do_getlockstatus_oneblock
  2015-11-06 13:49 [PATCH] mtd: Add locking to cfi_cmdset_001:do_getlockstatus_oneblock Mark Marshall
@ 2015-11-06 14:49 ` Joakim Tjernlund
  2015-11-07 10:11 ` Joakim Tjernlund
  1 sibling, 0 replies; 5+ messages in thread
From: Joakim Tjernlund @ 2015-11-06 14:49 UTC (permalink / raw)
  To: markmarshall14, linux-mtd; +Cc: dwmw2

On Fri, 2015-11-06 at 14:49 +0100, Mark Marshall wrote:
> The function (do_getlockstatus_oneblock) switches the flash out of
> array-read mode and into query mode.  It should not run in parallel to
> another function that uses array-read mode.  We therefore need to
> acquire the chip mutex and call get_chip(..., FL_JEDEC_QUERY).
> 
> I also assume that we should invalidate the cache in the same way that
> we do for do_otp_read.
> 
> Signed-off-by: Mark Marshall <Mark.Marshall@omicron.at>
> ---
> 
> Hi all,
> 
> We have had a device in production (and operation) for a long time
> running Linux and using a CFI NOR flash chip.  We have recently
> started to get devices come back from the field with strange errors in
> the JFFS2 file system that is on this flash.  It appeared that blocks
> of files on the image were being dropped (not flash blocks, just 4K
> sections from the middle of the files were being replaced with 0's).
> 
> After some debugging I found that sometimes, when the JFFS2 FS was
> reading the JFFS2 inode block it would get an CRC error.  After adding
> some debug code I could see that in the error cases the "second half"
> of the read data was replaced with a fixed repeating pattern, and that
> this pattern was part of the CFI "QRY" data (It contained the flash
> manufacturer and device ID's).
> 
> This led me to think that maybe the problem was that some part of the
> MTD code was switching the device out of the normal "array-read" mode
> and into the query mode.  After looking through the mtd code I found
> that the function 'do_getlockstatus_oneblock' does switch the mode of
> the flash without taking the chip mutex.  Adding code to take the
> mutex has fixed our problem.
> 
> I am slightly surprised that no one else has seen this bug - the only
> thing I can think of is that reading the lock status of the flash is
> not something that happens in normal operation very often?  We have a
> process on the device that generates a hardware check report at boot
> up, and one of the things that it does is report on the 'locked'
> status of the flash.  This is always running while the JFFS2 code is
> performing it's initial scan.

We have! Been looking at it for a while now, thinking it is an bug
in erase suspend for small blocks. We even contacted Micron for confirmation.
Even got at few patches to address som other issues we found during troubleshooting this.

I will test your patch 

> 
> The bug is fairly easy to reproduce if you have a system with the
> correct type of flash.  In one terminal run something like:
> 
>  while true ; do ( hd -n 65536 /dev/mtd0 | md5sum ) ; done
> 
> And in a second terminal run a program or utility that will check the
> lock status of the flash (mtdinfo should work).  Assuming nothing else
> is writing to the flash partition mtd0 the md5sum will remain
> constant.  When the bug bites you can see the md5sum change (but then
> recover).
> 
> Best Regards,
> 
> Mark Marshall
> 
>  drivers/mtd/chips/cfi_cmdset_0001.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/mtd/chips/cfi_cmdset_0001.c
> b/drivers/mtd/chips/cfi_cmdset_0001.c
> index 286b97a..d675efb 100644
> --- a/drivers/mtd/chips/cfi_cmdset_0001.c
> +++ b/drivers/mtd/chips/cfi_cmdset_0001.c
> @@ -2051,13 +2051,32 @@ static int __xipram
> do_getlockstatus_oneblock(struct map_info *map,
>  {
>  	struct cfi_private *cfi = map->fldrv_priv;
>  	int status, ofs_factor = cfi->interleave * cfi->device_type;
> +	int ret;
> 
>  	adr += chip->start;
> +
> +	mutex_lock(&chip->mutex);
> +	ret = get_chip(map, chip, adr, FL_JEDEC_QUERY);
> +	if (ret) {
> +		mutex_unlock(&chip->mutex);
> +		return ret;
> +	}
> +
> +	/* let's ensure we're not reading back cached data from array mode */
> +	INVALIDATE_CACHED_RANGE(map, adr+(2*ofs_factor), 1);
> +
>  	xip_disable(map, chip, adr+(2*ofs_factor));
>  	map_write(map, CMD(0x90), adr+(2*ofs_factor));
>  	chip->state = FL_JEDEC_QUERY;
>  	status = cfi_read_query(map, adr+(2*ofs_factor));
>  	xip_enable(map, chip, 0);
> +
> +	/* then ensure we don't keep query data in the cache */
> +	INVALIDATE_CACHED_RANGE(map, adr+(2*ofs_factor), 1);
> +
> +	put_chip(map, chip, adr);
> +	mutex_unlock(&chip->mutex);
> +
>  	return status;
>  }
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mtd: Add locking to cfi_cmdset_001:do_getlockstatus_oneblock
  2015-11-06 13:49 [PATCH] mtd: Add locking to cfi_cmdset_001:do_getlockstatus_oneblock Mark Marshall
  2015-11-06 14:49 ` Joakim Tjernlund
@ 2015-11-07 10:11 ` Joakim Tjernlund
  2015-11-07 21:52   ` Mark Marshall
  1 sibling, 1 reply; 5+ messages in thread
From: Joakim Tjernlund @ 2015-11-07 10:11 UTC (permalink / raw)
  To: markmarshall14, linux-mtd; +Cc: dwmw2

On Fri, 2015-11-06 at 14:49 +0100, Mark Marshall wrote:
> The function (do_getlockstatus_oneblock) switches the flash out of
> array-read mode and into query mode.  It should not run in parallel to
> another function that uses array-read mode.  We therefore need to
> acquire the chip mutex and call get_chip(..., FL_JEDEC_QUERY).
> 
Hmm, this mail has NL breaks in it for long lines and won't apply.
Is it on my end or yours?

>  drivers/mtd/chips/cfi_cmdset_0001.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/mtd/chips/cfi_cmdset_0001.c
> b/drivers/mtd/chips/cfi_cmdset_0001.c
> index 286b97a..d675efb 100644
> --- a/drivers/mtd/chips/cfi_cmdset_0001.c
> +++ b/drivers/mtd/chips/cfi_cmdset_0001.c
> @@ -2051,13 +2051,32 @@ static int __xipram
> do_getlockstatus_oneblock(struct map_info *map,
>  {
>  > 	> struct cfi_private *cfi = map->fldrv_priv;
>  > 	> int status, ofs_factor = cfi->interleave * cfi->device_type;
> +> 	> int ret;
> 
>  > 	> adr += chip->start;
> +
> +> 	> mutex_lock(&chip->mutex);
> +> 	> ret = get_chip(map, chip, adr, FL_JEDEC_QUERY);
> +> 	> if (ret) {
> +> 	> 	> mutex_unlock(&chip->mutex);
> +> 	> 	> return ret;
> +> 	> }
> +
> +> 	> /* let's ensure we're not reading back cached data from array mode */
> +> 	> INVALIDATE_CACHED_RANGE(map, adr+(2*ofs_factor), 1);
> +
>  > 	> xip_disable(map, chip, adr+(2*ofs_factor));
>  > 	> map_write(map, CMD(0x90), adr+(2*ofs_factor));
>  > 	> chip->state = FL_JEDEC_QUERY;
>  > 	> status = cfi_read_query(map, adr+(2*ofs_factor));
>  > 	> xip_enable(map, chip, 0);
> +
> +> 	> /* then ensure we don't keep query data in the cache */
> +> 	> INVALIDATE_CACHED_RANGE(map, adr+(2*ofs_factor), 1);
> +
> +> 	> put_chip(map, chip, adr);
> +> 	> mutex_unlock(&chip->mutex);
> +
>  > 	> return status;
>  }
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mtd: Add locking to cfi_cmdset_001:do_getlockstatus_oneblock
  2015-11-07 10:11 ` Joakim Tjernlund
@ 2015-11-07 21:52   ` Mark Marshall
  2015-11-09 17:37     ` Joakim Tjernlund
  0 siblings, 1 reply; 5+ messages in thread
From: Mark Marshall @ 2015-11-07 21:52 UTC (permalink / raw)
  To: Joakim Tjernlund, linux-mtd; +Cc: dwmw2



On 07/11/15 11:11, Joakim Tjernlund wrote:
> On Fri, 2015-11-06 at 14:49 +0100, Mark Marshall wrote:
>> The function (do_getlockstatus_oneblock) switches the flash out of
>> array-read mode and into query mode.  It should not run in parallel to
>> another function that uses array-read mode.  We therefore need to
>> acquire the chip mutex and call get_chip(..., FL_JEDEC_QUERY).
>>
> Hmm, this mail has NL breaks in it for long lines and won't apply.
> Is it on my end or yours?
> 

Sorry, it was on my end, here it is again.

>From 49a50d75c7f4eed7a367d35f11afcdbd4573d193 Mon Sep 17 00:00:00 2001
From: Mark Marshall <mark.marshall@omicron.at>
Date: Fri, 6 Nov 2015 14:09:49 +0100
Subject: [PATCH] mtd: Add locking to cfi_cmdset_001:do_getlockstatus_oneblock

The function (do_getlockstatus_oneblock) switches the flash out of
array-read mode and into query mode.  It should not run in parallel to
another function that uses array-read mode.  We therefore need to
acquire the chip mutex and call get_chip(..., FL_JEDEC_QUERY).

I also assume that we should invalidate the cache in the same way that
we do for do_otp_read.

Signed-off-by: Mark Marshall <Mark.Marshall@omicron.at>
---

Hi all,

We have had a device in production (and operation) for a long time
running Linux and using a CFI NOR flash chip.  We have recently
started to get devices come back from the field with strange errors in
the JFFS2 file system that is on this flash.  It appeared that blocks
of files on the image were being dropped (not flash blocks, just 4K
sections from the middle of the files were being replaced with 0's).

After some debugging I found that sometimes, when the JFFS2 FS was
reading the JFFS2 inode block it would get an CRC error.  After adding
some debug code I could see that in the error cases the "second half"
of the read data was replaced with a fixed repeating pattern, and that
this pattern was part of the CFI "QRY" data (It contained the flash
manufacturer and device ID's).

This led me to think that maybe the problem was that some part of the
MTD code was switching the device out of the normal "array-read" mode
and into the query mode.  After looking through the mtd code I found
that the function 'do_getlockstatus_oneblock' does switch the mode of
the flash without taking the chip mutex.  Adding code to take the
mutex has fixed our problem.

I am slightly surprised that no one else has seen this bug - the only
thing I can think of is that reading the lock status of the flash is
not something that happens in normal operation very often?  We have a
process on the device that generates a hardware check report at boot
up, and one of the things that it does is report on the 'locked'
status of the flash.  This is always running while the JFFS2 code is
performing it's initial scan.

The bug is fairly easy to reproduce if you have a system with the
correct type of flash.  In one terminal run something like:

 while true ; do ( hd -n 65536 /dev/mtd0 | md5sum ) ; done

And in a second terminal run a program or utility that will check the
lock status of the flash (mtdinfo should work).  Assuming nothing else
is writing to the flash partition mtd0 the md5sum will remain
constant.  When the bug bites you can see the md5sum change (but then
recover).

Best Regards,

Mark Marshall

(mark dot marshall at omicron dot at)

 drivers/mtd/chips/cfi_cmdset_0001.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/mtd/chips/cfi_cmdset_0001.c b/drivers/mtd/chips/cfi_cmdset_0001.c
index 286b97a..d675efb 100644
--- a/drivers/mtd/chips/cfi_cmdset_0001.c
+++ b/drivers/mtd/chips/cfi_cmdset_0001.c
@@ -2051,13 +2051,32 @@ static int __xipram do_getlockstatus_oneblock(struct map_info *map,
 {
 	struct cfi_private *cfi = map->fldrv_priv;
 	int status, ofs_factor = cfi->interleave * cfi->device_type;
+	int ret;
 
 	adr += chip->start;
+
+	mutex_lock(&chip->mutex);
+	ret = get_chip(map, chip, adr, FL_JEDEC_QUERY);
+	if (ret) {
+		mutex_unlock(&chip->mutex);
+		return ret;
+	}
+
+	/* let's ensure we're not reading back cached data from array mode */
+	INVALIDATE_CACHED_RANGE(map, adr+(2*ofs_factor), 1);
+
 	xip_disable(map, chip, adr+(2*ofs_factor));
 	map_write(map, CMD(0x90), adr+(2*ofs_factor));
 	chip->state = FL_JEDEC_QUERY;
 	status = cfi_read_query(map, adr+(2*ofs_factor));
 	xip_enable(map, chip, 0);
+
+	/* then ensure we don't keep query data in the cache */
+	INVALIDATE_CACHED_RANGE(map, adr+(2*ofs_factor), 1);
+
+	put_chip(map, chip, adr);
+	mutex_unlock(&chip->mutex);
+
 	return status;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] mtd: Add locking to cfi_cmdset_001:do_getlockstatus_oneblock
  2015-11-07 21:52   ` Mark Marshall
@ 2015-11-09 17:37     ` Joakim Tjernlund
  0 siblings, 0 replies; 5+ messages in thread
From: Joakim Tjernlund @ 2015-11-09 17:37 UTC (permalink / raw)
  To: markmarshall14, linux-mtd; +Cc: dwmw2

On Sat, 2015-11-07 at 22:52 +0100, Mark Marshall wrote:
> 
> On 07/11/15 11:11, Joakim Tjernlund wrote:
> > On Fri, 2015-11-06 at 14:49 +0100, Mark Marshall wrote:
> > > The function (do_getlockstatus_oneblock) switches the flash out of
> > > array-read mode and into query mode.  It should not run in parallel to
> > > another function that uses array-read mode.  We therefore need to
> > > acquire the chip mutex and call get_chip(..., FL_JEDEC_QUERY).
> > > 
> > Hmm, this mail has NL breaks in it for long lines and won't apply.
> > Is it on my end or yours?
> > 
> 
> Sorry, it was on my end, here it is again.
> 
> From 49a50d75c7f4eed7a367d35f11afcdbd4573d193 Mon Sep 17 00:00:00 2001
> From: Mark Marshall <mark.marshall@omicron.at>
> Date: Fri, 6 Nov 2015 14:09:49 +0100
> Subject: [PATCH] mtd: Add locking to cfi_cmdset_001:do_getlockstatus_oneblock
> 
> The function (do_getlockstatus_oneblock) switches the flash out of
> array-read mode and into query mode.  It should not run in parallel to
> another function that uses array-read mode.  We therefore need to
> acquire the chip mutex and call get_chip(..., FL_JEDEC_QUERY).
> 
> I also assume that we should invalidate the cache in the same way that
> we do for do_otp_read.
> 
> Signed-off-by: Mark Marshall <Mark.Marshall@omicron.at>

Acked-by: Joakim Tjernlund <joakim.tjernlund@transmode.se>

> ---

I tested and reviewed this patch.
Unfortunately it didn't fix our problem but I see now that we have a different bug.

 Jocke

> 
> Hi all,
> 
> We have had a device in production (and operation) for a long time
> running Linux and using a CFI NOR flash chip.  We have recently
> started to get devices come back from the field with strange errors in
> the JFFS2 file system that is on this flash.  It appeared that blocks
> of files on the image were being dropped (not flash blocks, just 4K
> sections from the middle of the files were being replaced with 0's).
> 
> After some debugging I found that sometimes, when the JFFS2 FS was
> reading the JFFS2 inode block it would get an CRC error.  After adding
> some debug code I could see that in the error cases the "second half"
> of the read data was replaced with a fixed repeating pattern, and that
> this pattern was part of the CFI "QRY" data (It contained the flash
> manufacturer and device ID's).
> 
> This led me to think that maybe the problem was that some part of the
> MTD code was switching the device out of the normal "array-read" mode
> and into the query mode.  After looking through the mtd code I found
> that the function 'do_getlockstatus_oneblock' does switch the mode of
> the flash without taking the chip mutex.  Adding code to take the
> mutex has fixed our problem.
> 
> I am slightly surprised that no one else has seen this bug - the only
> thing I can think of is that reading the lock status of the flash is
> not something that happens in normal operation very often?  We have a
> process on the device that generates a hardware check report at boot
> up, and one of the things that it does is report on the 'locked'
> status of the flash.  This is always running while the JFFS2 code is
> performing it's initial scan.
> 
> The bug is fairly easy to reproduce if you have a system with the
> correct type of flash.  In one terminal run something like:
> 
>  while true ; do ( hd -n 65536 /dev/mtd0 | md5sum ) ; done
> 
> And in a second terminal run a program or utility that will check the
> lock status of the flash (mtdinfo should work).  Assuming nothing else
> is writing to the flash partition mtd0 the md5sum will remain
> constant.  When the bug bites you can see the md5sum change (but then
> recover).
> 
> Best Regards,
> 
> Mark Marshall
> 
> (mark dot marshall at omicron dot at)
> 
>  drivers/mtd/chips/cfi_cmdset_0001.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/mtd/chips/cfi_cmdset_0001.c b/drivers/mtd/chips/cfi_cmdset_0001.c
> index 286b97a..d675efb 100644
> --- a/drivers/mtd/chips/cfi_cmdset_0001.c
> +++ b/drivers/mtd/chips/cfi_cmdset_0001.c
> @@ -2051,13 +2051,32 @@ static int __xipram do_getlockstatus_oneblock(struct map_info *map,
>  {
>  	struct cfi_private *cfi = map->fldrv_priv;
>  	int status, ofs_factor = cfi->interleave * cfi->device_type;
> +	int ret;
>  
>  	adr += chip->start;
> +
> +	mutex_lock(&chip->mutex);
> +	ret = get_chip(map, chip, adr, FL_JEDEC_QUERY);
> +	if (ret) {
> +		mutex_unlock(&chip->mutex);
> +		return ret;
> +	}
> +
> +	/* let's ensure we're not reading back cached data from array mode */
> +	INVALIDATE_CACHED_RANGE(map, adr+(2*ofs_factor), 1);
> +
>  	xip_disable(map, chip, adr+(2*ofs_factor));
>  	map_write(map, CMD(0x90), adr+(2*ofs_factor));
>  	chip->state = FL_JEDEC_QUERY;
>  	status = cfi_read_query(map, adr+(2*ofs_factor));
>  	xip_enable(map, chip, 0);
> +
> +	/* then ensure we don't keep query data in the cache */
> +	INVALIDATE_CACHED_RANGE(map, adr+(2*ofs_factor), 1);
> +
> +	put_chip(map, chip, adr);
> +	mutex_unlock(&chip->mutex);
> +
>  	return status;
>  }
>  

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-11-09 17:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-06 13:49 [PATCH] mtd: Add locking to cfi_cmdset_001:do_getlockstatus_oneblock Mark Marshall
2015-11-06 14:49 ` Joakim Tjernlund
2015-11-07 10:11 ` Joakim Tjernlund
2015-11-07 21:52   ` Mark Marshall
2015-11-09 17:37     ` Joakim Tjernlund

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.