RE: Re: [PATCH] cfi: fix deadloop in cfi_cmdset_0002.c do_write_buffer

From: "Sobon, Przemyslaw" <psobon@amazon.com>
To: "ikegami_to@yahoo.co.jp" <ikegami_to@yahoo.co.jp>,
	Boris Brezillon <boris.brezillon@collabora.com>
Cc: "keescook@chromium.org" <keescook@chromium.org>,
	"marek.vasut@gmail.com" <marek.vasut@gmail.com>,
	"ikegami@allied-telesis.co.jp" <ikegami@allied-telesis.co.jp>,
	"richard@nod.at" <richard@nod.at>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"joakim.tjernlund@infinera.com" <joakim.tjernlund@infinera.com>,
	"linux-mtd@lists.infradead.org" <linux-mtd@lists.infradead.org>,
	"computersforpeace@gmail.com" <computersforpeace@gmail.com>,
	"dwmw2@infradead.org" <dwmw2@infradead.org>,
	Liu Jian <liujian56@huawei.com>
Subject: RE: Re: [PATCH] cfi: fix deadloop in cfi_cmdset_0002.c do_write_buffer
Date: Thu, 7 Feb 2019 23:50:34 +0000	[thread overview]
Message-ID: <632ed76bd3844ceab75066d1f30a7115@EX13D07UWA001.ant.amazon.com> (raw)
In-Reply-To: <193621849.44066.1549580387922.JavaMail.yahoo@mail.yahoo.co.jp>

Hi Ikegami,

I have seen a case myself where a value was written, chip changed
state to "ready" but when I was reading the value was incorrect.
This can happen as result of intermittent issue with flash. It is
hard to fall into scenario when testing on limited number of devices
but with large enough population you can see that. Another situation
is when a flash chip reaches its maximum number of writes. So for
example a chip is designed for 100k writes to a page. Once you 
reach that number of writes you can have invalid data written to
flash but chip itself reports everything was good and switches to
"ready" state.

Hope this explanation is clear. Please let me know.

Regards,
Przemek

> -----Original Message-----
> From: ikegami_to@yahoo.co.jp <ikegami_to@yahoo.co.jp> 
> Sent: Thursday, February 7, 2019 3:00 PM
> 
> Hi Przemek-san,
> 
> Could you please explain the case detail that the value is written incorrectly?
> I think that the value is only written correctly except a bug.
> 
> Regards,
> Ikegami
> 
> --- boris.brezillon@collabora.com wrote --- :
> > Hi Sobon,
> > 
> > On Tue, 5 Feb 2019 22:28:44 +0000
> > "Sobon, Przemyslaw" <psobon@amazon.com> wrote:
> > 
> > > > From: Boris Brezillon <bbrezillon@kernel.org>
> > > > Sent: Sunday, February 3, 2019 12:35 AM
> > > > > +Przemyslaw
> > > > > 
> > > > > On Fri, 1 Feb 2019 07:30:39 +0800 Liu Jian 
> > > > > <liujian56@huawei.com> wrote:
> > > > >   
> > > > > > In function do_write_buffer(), in the for loop, there is a 
> > > > > > case
> > > > > > chip_ready() returns 1 while chip_good() returns 0, so it 
> > > > > > never break the loop.
> > > > > > To fix this, chip_good() is enough and it should timeout if it 
> > > > > > stay bad for a while.
> > > > > 
> > > > > Looks like Przemyslaw reported and fixed the same problem.
> > > > >   
> > > > > > 
> > > > > > Fixes: dfeae1073583(mtd: cfi_cmdset_0002: Change write buffer 
> > > > > > to check correct value)
> > > > > 
> > > > > Can you put the Fixes tag on a single, and the format is
> > > > > 
> > > > > Fixes: <hash> ("message")
> > > > >   
> > > > > > Signed-off-by: Yi Huaijie <yihuaijie@huawei.com>
> > > > > > Signed-off-by: Liu Jian <liujian56@huawei.com>
> > > > > 
> > > > > [1]http://patchwork.ozlabs.org/patch/1025566/
> > > > >   
> > > > > > ---
> > > > > >  drivers/mtd/chips/cfi_cmdset_0002.c | 6 +++---
> > > > > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/mtd/chips/cfi_cmdset_0002.c
> > > > > > b/drivers/mtd/chips/cfi_cmdset_0002.c
> > > > > > index 72428b6..818e94b 100644
> > > > > > --- a/drivers/mtd/chips/cfi_cmdset_0002.c
> > > > > > +++ b/drivers/mtd/chips/cfi_cmdset_0002.c
> > > > > > @@ -1876,14 +1876,14 @@ static int __xipram do_write_buffer(struct map_info *map, struct flchip *chip,
> > > > > >              continue;
> > > > > >          }
> > > > > >  
> > > > > > -        if (time_after(jiffies, timeo) && !chip_ready(map, adr))
> > > > > > -            break;
> > > > > > -
> > > > > >          if (chip_good(map, adr, datum)) {
> > > > > >              xip_enable(map, chip, adr);
> > > > > >              goto op_done;
> > > > > >          }
> > > > > >  
> > > > > > +        if (time_after(jiffies, timeo))
> > > > > > +            break;
> > > > > > +
> > > > > >          /* Latency issues. Drop the lock, wait a while and retry */
> > > > > >          UDELAY(map, chip, adr, 1);
> > > > > >      }
> > > > >   
> > > > 
> > > > BTW, the patch itself looks good to me. Ikegami, can you confirm it does the right thing?
> > > > 
> > > > Thanks,
> > > > 
> > > > Boris
> > > >   
> > > 
> > > One comment to this patch. If value is written incorrectly quickly 
> > > we will be stuck in the loop even though nothing is going to change. 
> > > For example a value was written incorrectly after 1us, the loop was 
> > > set to 1ms, function will return after 1ms, this solution is not 
> > > optimized for performance. I considered same when working on this change and decided to do it different way.
> > 
> > Seems like you're right if we assume that checking for GOOD state does 
> > not require a delay after the READY check, but if that's not the case 
> > and an extra delay is actually required, you might end up with a BAD 
> > status while it could have turned GOOD at some point with the 'check 
> > only for GOOD state until we timeout' approach.
> > 
> > TBH, I don't know how CFI flashes work, so I'll let you guys sort this 
> > out.
> > 
> > Regards,
> > 
> > Boris
> > 
> > ______________________________________________________
> > Linux MTD discussion mailing list
> > http://lists.infradead.org/mailman/listinfo/linux-mtd/
> > 
> 
>
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/