All of lore.kernel.org
 help / color / mirror / Atom feed
* [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO
@ 2019-07-05  5:01 Sahitya Tummala
  2019-07-05 10:11 ` Chao Yu
  0 siblings, 1 reply; 14+ messages in thread
From: Sahitya Tummala @ 2019-07-05  5:01 UTC (permalink / raw)
  To: Jaegeuk Kim, Chao Yu, linux-f2fs-devel

With the commit <979b25727800> ("fsck.f2fs: check validity of nat journal"),
there is a serious data loss issue observed during sudden power off tests.

The reason is due to incorrect update of journal->n_nats in that patch.
When fsck encounters a nat journal entry with invalid blkaddr/nid, it
updates journal->n_nats with the index of such first entry and then continues
the for loop which scans all nats in journal in f2fs_init_nid_bitmap().
But that loop doesn't continue and doesn't update nm_i->nid_bitmap as well,
as it fails the for loop condition i < nats_in_cursum(journal).

There could be several other valid entries in the journal after this
invalid entry. Due to this invalid journal->n_nats, build_nat_area_bitmap()
could not be build properly for valid nat journal entries. It further results
into these fsck asserts/errors and causes data loss due to incorrect fsck fix.

[ASSERT] (sanity_check_nid: 372)  --> nid[0xfb01] ino is 0\x0a
[FIX] (__chk_dentries:1438)  --> Unlink [0xfb01] - _JeNcl242yB3Apz2MW,VOh6WmjsVht1W len[0x1c], type[0x1]\x0a
[ASSERT] (sanity_check_nid: 372)  --> nid[0xfa2d] ino is 0\x0a
[FIX] (__chk_dentries:1438)  --> Unlink [0xfa2d] - _++vN7axccjjGNI,TQEVmcOemY1EAAAA len[0x14], type[0x1]\x0a
...
NID[0xba5] is unreachable\x0a
NID[0xfa9] is unreachable\x0a
...
[FIX] (nullify_nat_entry:2089)  --> Remove nid [0xba5] in NAT\x0a
[FIX] (nullify_nat_entry:2089)  --> Remove nid [0xfa9] in NAT\x0a
...

Fixes: 979b25727800 ("fsck.f2fs: check validity of nat journal")
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
---
 fsck/mount.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fsck/mount.c b/fsck/mount.c
index 60e0e4a..1b93f49 100644
--- a/fsck/mount.c
+++ b/fsck/mount.c
@@ -1160,7 +1160,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
 		addr = le32_to_cpu(nat_in_journal(journal, i).block_addr);
 		if (!IS_VALID_BLK_ADDR(sbi, addr)) {
 			MSG(0, "\tError: f2fs_init_nid_bitmap: addr(%u) is invalid!!!\n", addr);
-			journal->n_nats = cpu_to_le16(i);
 			c.fix_on = 1;
 			continue;
 		}
@@ -1168,7 +1167,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
 		nid = le32_to_cpu(nid_in_journal(journal, i));
 		if (!IS_VALID_NID(sbi, nid)) {
 			MSG(0, "\tError: f2fs_init_nid_bitmap: nid(%u) is invalid!!!\n", nid);
-			journal->n_nats = cpu_to_le16(i);
 			c.fix_on = 1;
 			continue;
 		}
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.



_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO
  2019-07-05  5:01 [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO Sahitya Tummala
@ 2019-07-05 10:11 ` Chao Yu
  2019-07-05 11:03   ` Sahitya Tummala
  0 siblings, 1 reply; 14+ messages in thread
From: Chao Yu @ 2019-07-05 10:11 UTC (permalink / raw)
  To: Sahitya Tummala, Jaegeuk Kim, linux-f2fs-devel

Hi Sahitya,

On 2019/7/5 13:01, Sahitya Tummala wrote:
> With the commit <979b25727800> ("fsck.f2fs: check validity of nat journal"),
> there is a serious data loss issue observed during sudden power off tests.

In that patch, since journal's data is untrusty, I just truncated n_nats to last
valid one to fix original reported issue.

Actually, the problem here is we don't know which data is correct and which data
is corrupted in nat or sit journal, so even we skip broken entry without
modifying journal->n_{nats,sits}, latter entry may still contain broken data(but
it may pass validation check), and those entries may cause potential data
loss/corruption...

It looks no matter how we change the codes, we still may face data
loos/corruption issues.

IMO, we need to find out the root cause of data corruption, So my question is
that why sudden power off tests can make nat journal being corrupted? I doubt
that in-place update method on checkpoint area of fsck may cause this... any
other thoughts?

Thanks,

> 
> The reason is due to incorrect update of journal->n_nats in that patch.
> When fsck encounters a nat journal entry with invalid blkaddr/nid, it
> updates journal->n_nats with the index of such first entry and then continues
> the for loop which scans all nats in journal in f2fs_init_nid_bitmap().
> But that loop doesn't continue and doesn't update nm_i->nid_bitmap as well,
> as it fails the for loop condition i < nats_in_cursum(journal).
> 
> There could be several other valid entries in the journal after this
> invalid entry. Due to this invalid journal->n_nats, build_nat_area_bitmap()
> could not be build properly for valid nat journal entries. It further results
> into these fsck asserts/errors and causes data loss due to incorrect fsck fix.
> 
> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfb01] ino is 0\x0a
> [FIX] (__chk_dentries:1438)  --> Unlink [0xfb01] - _JeNcl242yB3Apz2MW,VOh6WmjsVht1W len[0x1c], type[0x1]\x0a
> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfa2d] ino is 0\x0a
> [FIX] (__chk_dentries:1438)  --> Unlink [0xfa2d] - _++vN7axccjjGNI,TQEVmcOemY1EAAAA len[0x14], type[0x1]\x0a
> ...
> NID[0xba5] is unreachable\x0a
> NID[0xfa9] is unreachable\x0a
> ...
> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xba5] in NAT\x0a
> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xfa9] in NAT\x0a
> ...
> 
> Fixes: 979b25727800 ("fsck.f2fs: check validity of nat journal")
> Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
> ---
>  fsck/mount.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fsck/mount.c b/fsck/mount.c
> index 60e0e4a..1b93f49 100644
> --- a/fsck/mount.c
> +++ b/fsck/mount.c
> @@ -1160,7 +1160,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>  		addr = le32_to_cpu(nat_in_journal(journal, i).block_addr);
>  		if (!IS_VALID_BLK_ADDR(sbi, addr)) {
>  			MSG(0, "\tError: f2fs_init_nid_bitmap: addr(%u) is invalid!!!\n", addr);
> -			journal->n_nats = cpu_to_le16(i);
>  			c.fix_on = 1;
>  			continue;
>  		}
> @@ -1168,7 +1167,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>  		nid = le32_to_cpu(nid_in_journal(journal, i));
>  		if (!IS_VALID_NID(sbi, nid)) {
>  			MSG(0, "\tError: f2fs_init_nid_bitmap: nid(%u) is invalid!!!\n", nid);
> -			journal->n_nats = cpu_to_le16(i);
>  			c.fix_on = 1;
>  			continue;
>  		}
> 


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO
  2019-07-05 10:11 ` Chao Yu
@ 2019-07-05 11:03   ` Sahitya Tummala
  2019-07-05 13:48     ` Sahitya Tummala
  2019-07-06  1:12     ` Chao Yu
  0 siblings, 2 replies; 14+ messages in thread
From: Sahitya Tummala @ 2019-07-05 11:03 UTC (permalink / raw)
  To: Chao Yu; +Cc: Jaegeuk Kim, linux-f2fs-devel

Hi Chao,

On Fri, Jul 05, 2019 at 06:11:47PM +0800, Chao Yu wrote:
> Hi Sahitya,
> 
> On 2019/7/5 13:01, Sahitya Tummala wrote:
> > With the commit <979b25727800> ("fsck.f2fs: check validity of nat journal"),
> > there is a serious data loss issue observed during sudden power off tests.
> 
> In that patch, since journal's data is untrusty, I just truncated n_nats to last
> valid one to fix original reported issue.

The original issue will be fixed even without updating/truncating the n_nats
as we still have sanity checks on nid and blkaddr, right?

> 
> Actually, the problem here is we don't know which data is correct and which data
> is corrupted in nat or sit journal, so even we skip broken entry without
> modifying journal->n_{nats,sits}, latter entry may still contain broken data(but
> it may pass validation check), and those entries may cause potential data
> loss/corruption...
> 

This issue was reported by a customer in the following test scenario -

Update the password in Android and then force crash. It is verified that
all the files related to password data has fsync() already.

After some iterations (3-4) of this test, it is found password related files are
lost, failing the phone to bootup.

Without this patch ("fsck.f2fs: check validity of nat journal"), they don't see
such issue. It means that the data present in journal seems to be valid atleast
in this case and truncating is causing to loose the valid data/files.

> It looks no matter how we change the codes, we still may face data
> loos/corruption issues.
> 

Yes, maybe in some cases, the data loss is still expected, but there are
chances that data may be valid as well, which is confirmed by the test case above.

> IMO, we need to find out the root cause of data corruption, So my question is
> that why sudden power off tests can make nat journal being corrupted? I doubt
> that in-place update method on checkpoint area of fsck may cause this... any
> other thoughts?
> 

In case of sudden power off, I have printed the all the journal entries in the
NAT and I several entries with valid blkaddr and nid as below -

[   21.007649] fsck.f2fs: withfix: 0: addr 1c373 nid 45b\x0a
[   21.014399] fsck.f2fs: withfix: 1: addr 1c2c1 nid 1b5\x0a
[   21.021151] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 2/37 is invalid!!!\x0a
[   21.030860] fsck.f2fs: withfix: 3: addr 1c1f7 nid 19b2\x0a
[   21.037674] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 4/37 is invalid!!!\x0a
[   21.047353] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 5/37 is invalid!!!\x0a
[   21.057062] fsck.f2fs: withfix: 6: addr 1c377 nid ac3\x0a
[   21.063841] fsck.f2fs: withfix: 7: addr 1c378 nid ac9\x0a
[   21.070634] fsck.f2fs: withfix: 8: addr 1b2fe nid 807\x0a
[   21.077356] fsck.f2fs: withfix: 9: addr 17aeb nid 805\x0a
[   21.084102] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 10/37 is invalid!!!\x0a
[   21.093891] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 11/37 is invalid!!!\x0a
[   21.103662] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 12/37 is invalid!!!\x0a
[   21.113450] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 13/37 is invalid!!!\x0a
[   21.123227] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 14/37 is invalid!!!\x0a
[   21.133022] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 15/37 is invalid!!!\x0a
[   21.142792] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 16/37 is invalid!!!\x0a
[   21.152580] fsck.f2fs: withfix: 17: addr 1c37f nid ff4\x0a
[   21.159415] fsck.f2fs: withfix: 18: addr 1c36c nid ff9\x0a
[   21.166229] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 19/37 is invalid!!!\x0a
[   21.176039] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 20/37 is invalid!!!\x0a
[   21.185831] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 21/37 is invalid!!!\x0a
[   21.195631] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 22/37 is invalid!!!\x0a
[   21.205394] fsck.f2fs: withfix: 23: addr 1c2f3 nid 61c\x0a
[   21.212257] fsck.f2fs: withfix: 24: addr 1c391 nid 641\x0a
[   21.219093] fsck.f2fs: withfix: 25: addr 1c386 nid 33b\x0a
[   21.225905] fsck.f2fs: withfix: 26: addr 1c356 nid 258\x0a
[   21.232767] fsck.f2fs: withfix: 27: addr 1c2f2 nid 382\x0a
[   21.239601] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 28/37 is invalid!!!\x0a
[   21.249398] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 29/37 is invalid!!!\x0a
[   21.259160] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 30/37 is invalid!!!\x0a
[   21.268970] fsck.f2fs: withfix: 31: addr 1ac03 nid 1ccd\x0a
[   21.275874] fsck.f2fs: withfix: 32: addr 1c395 nid a57\x0a
[   21.282709] fsck.f2fs: withfix: 33: addr 1c392 nid 92e\x0a
[   21.289564] fsck.f2fs: withfix: 34: addr 1c38b nid aa1\x0a
[   21.296377] fsck.f2fs: withfix: 35: addr 1c393 nid 931\x0a
[   21.303240] fsck.f2fs: withfix: 36: addr 1c394 nid a35\x0a

Thanks,

> Thanks,
> 
> > 
> > The reason is due to incorrect update of journal->n_nats in that patch.
> > When fsck encounters a nat journal entry with invalid blkaddr/nid, it
> > updates journal->n_nats with the index of such first entry and then continues
> > the for loop which scans all nats in journal in f2fs_init_nid_bitmap().
> > But that loop doesn't continue and doesn't update nm_i->nid_bitmap as well,
> > as it fails the for loop condition i < nats_in_cursum(journal).
> > 
> > There could be several other valid entries in the journal after this
> > invalid entry. Due to this invalid journal->n_nats, build_nat_area_bitmap()
> > could not be build properly for valid nat journal entries. It further results
> > into these fsck asserts/errors and causes data loss due to incorrect fsck fix.
> > 
> > [ASSERT] (sanity_check_nid: 372)  --> nid[0xfb01] ino is 0\x0a
> > [FIX] (__chk_dentries:1438)  --> Unlink [0xfb01] - _JeNcl242yB3Apz2MW,VOh6WmjsVht1W len[0x1c], type[0x1]\x0a
> > [ASSERT] (sanity_check_nid: 372)  --> nid[0xfa2d] ino is 0\x0a
> > [FIX] (__chk_dentries:1438)  --> Unlink [0xfa2d] - _++vN7axccjjGNI,TQEVmcOemY1EAAAA len[0x14], type[0x1]\x0a
> > ...
> > NID[0xba5] is unreachable\x0a
> > NID[0xfa9] is unreachable\x0a
> > ...
> > [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xba5] in NAT\x0a
> > [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xfa9] in NAT\x0a
> > ...
> > 
> > Fixes: 979b25727800 ("fsck.f2fs: check validity of nat journal")
> > Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
> > ---
> >  fsck/mount.c | 2 --
> >  1 file changed, 2 deletions(-)
> > 
> > diff --git a/fsck/mount.c b/fsck/mount.c
> > index 60e0e4a..1b93f49 100644
> > --- a/fsck/mount.c
> > +++ b/fsck/mount.c
> > @@ -1160,7 +1160,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
> >  		addr = le32_to_cpu(nat_in_journal(journal, i).block_addr);
> >  		if (!IS_VALID_BLK_ADDR(sbi, addr)) {
> >  			MSG(0, "\tError: f2fs_init_nid_bitmap: addr(%u) is invalid!!!\n", addr);
> > -			journal->n_nats = cpu_to_le16(i);
> >  			c.fix_on = 1;
> >  			continue;
> >  		}
> > @@ -1168,7 +1167,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
> >  		nid = le32_to_cpu(nid_in_journal(journal, i));
> >  		if (!IS_VALID_NID(sbi, nid)) {
> >  			MSG(0, "\tError: f2fs_init_nid_bitmap: nid(%u) is invalid!!!\n", nid);
> > -			journal->n_nats = cpu_to_le16(i);
> >  			c.fix_on = 1;
> >  			continue;
> >  		}
> > 

-- 
--
Sent by a consultant of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO
  2019-07-05 11:03   ` Sahitya Tummala
@ 2019-07-05 13:48     ` Sahitya Tummala
  2019-07-06  1:14       ` Chao Yu
  2019-07-06  1:12     ` Chao Yu
  1 sibling, 1 reply; 14+ messages in thread
From: Sahitya Tummala @ 2019-07-05 13:48 UTC (permalink / raw)
  To: Chao Yu; +Cc: Jaegeuk Kim, linux-f2fs-devel

Hi Chao,

The journal doesn't look to be corrupted, but only block_addr
in f2fs_nat_entry is 0 whereas nid and version seems to be valid.
These could be just truncated nodes, right?

Here is the full dump of nat journal entries after a SPO (this is not a 
issue case, but just a normal case with SPO) -

[   21.121325] fsck.f2fs: 0: addr 0 nid 5c5a ver 1\x0a
[   21.127509] fsck.f2fs: 1: addr 0 nid 4c1d ver 1\x0a
[   21.133760] fsck.f2fs: 2: addr 12075 nid 444f ver 0\x0a
[   21.140325] fsck.f2fs: 3: addr 0 nid 4307 ver 0\x0a
[   21.146510] fsck.f2fs: 4: addr 0 nid 4059 ver 1\x0a
[   21.152745] fsck.f2fs: 5: addr 0 nid 3e69 ver 1\x0a
[   21.158944] fsck.f2fs: 6: addr 0 nid 3cf9 ver 1\x0a
[   21.165128] fsck.f2fs: 7: addr 0 nid 3b7b ver 1\x0a
[   21.171362] fsck.f2fs: 8: addr 18f65 nid 34e2 ver 0\x0a
[   21.177903] fsck.f2fs: 9: addr 12126 nid 2c87 ver 0\x0a
[   21.184466] fsck.f2fs: 10: addr 0 nid 2a0a ver 1\x0a
[   21.190795] fsck.f2fs: 11: addr 0 nid 2565 ver 1\x0a
[   21.197070] fsck.f2fs: 12: addr 0 nid 1f48 ver 0\x0a
[   21.203369] fsck.f2fs: 13: addr 155bf nid 1d76 ver 0\x0a
[   21.210042] fsck.f2fs: 14: addr 19538 nid 18da ver 0\x0a
[   21.216675] fsck.f2fs: 15: addr 0 nid 157f ver 0\x0a
[   21.223037] fsck.f2fs: 16: addr d117 nid ac3 ver 0\x0a
[   21.229547] fsck.f2fs: 17: addr 11e48 nid 805 ver 0\x0a
[   21.236088] fsck.f2fs: 18: addr d112 nid 61c ver 0\x0a
[   21.242585] fsck.f2fs: 19: addr d00b nid 45b ver 0\x0a
[   21.249086] fsck.f2fs: 20: addr d111 nid 30d ver 0\x0a
[   21.255544] fsck.f2fs: 21: addr d110 nid 1b5 ver 0\x0a
[   21.262025] fsck.f2fs: 22: addr 0 nid 5e41 ver 1\x0a
[   21.268342] fsck.f2fs: 23: addr 0 nid 5e46 ver 0\x0a
[   21.274618] fsck.f2fs: 24: addr 0 nid 5a34 ver 1\x0a
[   21.280916] fsck.f2fs: 25: addr 0 nid 59fa ver 1\x0a
[   21.287192] fsck.f2fs: 26: addr 0 nid 57c3 ver 1\x0a
[   21.293526] fsck.f2fs: 27: addr 0 nid 5776 ver 1\x0a
[   21.299816] fsck.f2fs: 28: addr 0 nid 4f20 ver 1\x0a
[   21.306091] fsck.f2fs: 29: addr 0 nid 4ff7 ver 0\x0a
[   21.312416] fsck.f2fs: 30: addr 0 nid 47e1 ver 1\x0a
[   21.318715] fsck.f2fs: 31: addr 0 nid 4880 ver 1\x0a
[   21.324990] fsck.f2fs: 32: addr d118 nid ff4 ver 0\x0a
[   21.331498] fsck.f2fs: 33: addr d10c nid ff9 ver 0\x0a
[   21.337957] fsck.f2fs: 34: addr 0 nid 49ed ver 1\x0a
[   21.344290] fsck.f2fs: 35: addr 0 nid 497e ver 0\x0a
[   21.350612] fsck.f2fs: 36: addr fc5a nid 49d2 ver 0\x0a

Thanks,

On Fri, Jul 05, 2019 at 04:33:57PM +0530, Sahitya Tummala wrote:
> Hi Chao,
> 
> On Fri, Jul 05, 2019 at 06:11:47PM +0800, Chao Yu wrote:
> > Hi Sahitya,
> > 
> > On 2019/7/5 13:01, Sahitya Tummala wrote:
> > > With the commit <979b25727800> ("fsck.f2fs: check validity of nat journal"),
> > > there is a serious data loss issue observed during sudden power off tests.
> > 
> > In that patch, since journal's data is untrusty, I just truncated n_nats to last
> > valid one to fix original reported issue.
> 
> The original issue will be fixed even without updating/truncating the n_nats
> as we still have sanity checks on nid and blkaddr, right?
> 
> > 
> > Actually, the problem here is we don't know which data is correct and which data
> > is corrupted in nat or sit journal, so even we skip broken entry without
> > modifying journal->n_{nats,sits}, latter entry may still contain broken data(but
> > it may pass validation check), and those entries may cause potential data
> > loss/corruption...
> > 
> 
> This issue was reported by a customer in the following test scenario -
> 
> Update the password in Android and then force crash. It is verified that
> all the files related to password data has fsync() already.
> 
> After some iterations (3-4) of this test, it is found password related files are
> lost, failing the phone to bootup.
> 
> Without this patch ("fsck.f2fs: check validity of nat journal"), they don't see
> such issue. It means that the data present in journal seems to be valid atleast
> in this case and truncating is causing to loose the valid data/files.
> 
> > It looks no matter how we change the codes, we still may face data
> > loos/corruption issues.
> > 
> 
> Yes, maybe in some cases, the data loss is still expected, but there are
> chances that data may be valid as well, which is confirmed by the test case above.
> 
> > IMO, we need to find out the root cause of data corruption, So my question is
> > that why sudden power off tests can make nat journal being corrupted? I doubt
> > that in-place update method on checkpoint area of fsck may cause this... any
> > other thoughts?
> > 
> 
> In case of sudden power off, I have printed the all the journal entries in the
> NAT and I several entries with valid blkaddr and nid as below -
> 
> [   21.007649] fsck.f2fs: withfix: 0: addr 1c373 nid 45b\x0a
> [   21.014399] fsck.f2fs: withfix: 1: addr 1c2c1 nid 1b5\x0a
> [   21.021151] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 2/37 is invalid!!!\x0a
> [   21.030860] fsck.f2fs: withfix: 3: addr 1c1f7 nid 19b2\x0a
> [   21.037674] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 4/37 is invalid!!!\x0a
> [   21.047353] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 5/37 is invalid!!!\x0a
> [   21.057062] fsck.f2fs: withfix: 6: addr 1c377 nid ac3\x0a
> [   21.063841] fsck.f2fs: withfix: 7: addr 1c378 nid ac9\x0a
> [   21.070634] fsck.f2fs: withfix: 8: addr 1b2fe nid 807\x0a
> [   21.077356] fsck.f2fs: withfix: 9: addr 17aeb nid 805\x0a
> [   21.084102] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 10/37 is invalid!!!\x0a
> [   21.093891] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 11/37 is invalid!!!\x0a
> [   21.103662] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 12/37 is invalid!!!\x0a
> [   21.113450] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 13/37 is invalid!!!\x0a
> [   21.123227] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 14/37 is invalid!!!\x0a
> [   21.133022] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 15/37 is invalid!!!\x0a
> [   21.142792] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 16/37 is invalid!!!\x0a
> [   21.152580] fsck.f2fs: withfix: 17: addr 1c37f nid ff4\x0a
> [   21.159415] fsck.f2fs: withfix: 18: addr 1c36c nid ff9\x0a
> [   21.166229] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 19/37 is invalid!!!\x0a
> [   21.176039] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 20/37 is invalid!!!\x0a
> [   21.185831] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 21/37 is invalid!!!\x0a
> [   21.195631] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 22/37 is invalid!!!\x0a
> [   21.205394] fsck.f2fs: withfix: 23: addr 1c2f3 nid 61c\x0a
> [   21.212257] fsck.f2fs: withfix: 24: addr 1c391 nid 641\x0a
> [   21.219093] fsck.f2fs: withfix: 25: addr 1c386 nid 33b\x0a
> [   21.225905] fsck.f2fs: withfix: 26: addr 1c356 nid 258\x0a
> [   21.232767] fsck.f2fs: withfix: 27: addr 1c2f2 nid 382\x0a
> [   21.239601] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 28/37 is invalid!!!\x0a
> [   21.249398] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 29/37 is invalid!!!\x0a
> [   21.259160] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 30/37 is invalid!!!\x0a
> [   21.268970] fsck.f2fs: withfix: 31: addr 1ac03 nid 1ccd\x0a
> [   21.275874] fsck.f2fs: withfix: 32: addr 1c395 nid a57\x0a
> [   21.282709] fsck.f2fs: withfix: 33: addr 1c392 nid 92e\x0a
> [   21.289564] fsck.f2fs: withfix: 34: addr 1c38b nid aa1\x0a
> [   21.296377] fsck.f2fs: withfix: 35: addr 1c393 nid 931\x0a
> [   21.303240] fsck.f2fs: withfix: 36: addr 1c394 nid a35\x0a
> 
> Thanks,
> 
> > Thanks,
> > 
> > > 
> > > The reason is due to incorrect update of journal->n_nats in that patch.
> > > When fsck encounters a nat journal entry with invalid blkaddr/nid, it
> > > updates journal->n_nats with the index of such first entry and then continues
> > > the for loop which scans all nats in journal in f2fs_init_nid_bitmap().
> > > But that loop doesn't continue and doesn't update nm_i->nid_bitmap as well,
> > > as it fails the for loop condition i < nats_in_cursum(journal).
> > > 
> > > There could be several other valid entries in the journal after this
> > > invalid entry. Due to this invalid journal->n_nats, build_nat_area_bitmap()
> > > could not be build properly for valid nat journal entries. It further results
> > > into these fsck asserts/errors and causes data loss due to incorrect fsck fix.
> > > 
> > > [ASSERT] (sanity_check_nid: 372)  --> nid[0xfb01] ino is 0\x0a
> > > [FIX] (__chk_dentries:1438)  --> Unlink [0xfb01] - _JeNcl242yB3Apz2MW,VOh6WmjsVht1W len[0x1c], type[0x1]\x0a
> > > [ASSERT] (sanity_check_nid: 372)  --> nid[0xfa2d] ino is 0\x0a
> > > [FIX] (__chk_dentries:1438)  --> Unlink [0xfa2d] - _++vN7axccjjGNI,TQEVmcOemY1EAAAA len[0x14], type[0x1]\x0a
> > > ...
> > > NID[0xba5] is unreachable\x0a
> > > NID[0xfa9] is unreachable\x0a
> > > ...
> > > [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xba5] in NAT\x0a
> > > [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xfa9] in NAT\x0a
> > > ...
> > > 
> > > Fixes: 979b25727800 ("fsck.f2fs: check validity of nat journal")
> > > Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
> > > ---
> > >  fsck/mount.c | 2 --
> > >  1 file changed, 2 deletions(-)
> > > 
> > > diff --git a/fsck/mount.c b/fsck/mount.c
> > > index 60e0e4a..1b93f49 100644
> > > --- a/fsck/mount.c
> > > +++ b/fsck/mount.c
> > > @@ -1160,7 +1160,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
> > >  		addr = le32_to_cpu(nat_in_journal(journal, i).block_addr);
> > >  		if (!IS_VALID_BLK_ADDR(sbi, addr)) {
> > >  			MSG(0, "\tError: f2fs_init_nid_bitmap: addr(%u) is invalid!!!\n", addr);
> > > -			journal->n_nats = cpu_to_le16(i);
> > >  			c.fix_on = 1;
> > >  			continue;
> > >  		}
> > > @@ -1168,7 +1167,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
> > >  		nid = le32_to_cpu(nid_in_journal(journal, i));
> > >  		if (!IS_VALID_NID(sbi, nid)) {
> > >  			MSG(0, "\tError: f2fs_init_nid_bitmap: nid(%u) is invalid!!!\n", nid);
> > > -			journal->n_nats = cpu_to_le16(i);
> > >  			c.fix_on = 1;
> > >  			continue;
> > >  		}
> > > 
> 
> -- 
> --
> Sent by a consultant of the Qualcomm Innovation Center, Inc.
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
> 
> 
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

-- 
--
Sent by a consultant of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO
  2019-07-05 11:03   ` Sahitya Tummala
  2019-07-05 13:48     ` Sahitya Tummala
@ 2019-07-06  1:12     ` Chao Yu
  1 sibling, 0 replies; 14+ messages in thread
From: Chao Yu @ 2019-07-06  1:12 UTC (permalink / raw)
  To: Sahitya Tummala, Chao Yu; +Cc: Jaegeuk Kim, linux-f2fs-devel

Hi Sahitya,

On 2019-7-5 19:03, Sahitya Tummala wrote:
> Hi Chao,
> 
> On Fri, Jul 05, 2019 at 06:11:47PM +0800, Chao Yu wrote:
>> Hi Sahitya,
>>
>> On 2019/7/5 13:01, Sahitya Tummala wrote:
>>> With the commit <979b25727800> ("fsck.f2fs: check validity of nat journal"),
>>> there is a serious data loss issue observed during sudden power off tests.
>>
>> In that patch, since journal's data is untrusty, I just truncated n_nats to last
>> valid one to fix original reported issue.
> 
> The original issue will be fixed even without updating/truncating the n_nats
> as we still have sanity checks on nid and blkaddr, right?

No, actually that's v2 patch did in below link, but as Aravind reported: fsck
will still crash in latter build_nat_area_bitmap() where we access entries[nid]
out-of-range.

https://lore.kernel.org/patchwork/patch/1031172/

Quoted:

Program received signal SIGSEGV, Segmentation fault.
0x00005555555673de in build_nat_area_bitmap (sbi=<optimized out>) at
mount.c:2469
2469            if (fsck->entries[nid].block_addr != 0x0) {
(gdb) backtrace
#0  0x00005555555673de in build_nat_area_bitmap (sbi=<optimized out>)
    at mount.c:2469
#1  0x000055555555d85c in fsck_init (sbi=0x55555557db40 <gfsck>) at fsck.c:1904
#2  0x000055555555682a in do_fsck (sbi=0x55555557db40 <gfsck>) at main.c:578
#3  main (argc=<optimized out>, argv=<optimized out>) at main.c:781

Maybe we need just remove invalid entry directly to avoid this rather than just
skipping it.

> 
>>
>> Actually, the problem here is we don't know which data is correct and which data
>> is corrupted in nat or sit journal, so even we skip broken entry without
>> modifying journal->n_{nats,sits}, latter entry may still contain broken data(but
>> it may pass validation check), and those entries may cause potential data
>> loss/corruption...
>>
> 
> This issue was reported by a customer in the following test scenario -
> 
> Update the password in Android and then force crash. It is verified that
> all the files related to password data has fsync() already.
> 
> After some iterations (3-4) of this test, it is found password related files are
> lost, failing the phone to bootup.
> 
> Without this patch ("fsck.f2fs: check validity of nat journal"), they don't see
> such issue. It means that the data present in journal seems to be valid atleast
> in this case and truncating is causing to loose the valid data/files.
> 
>> It looks no matter how we change the codes, we still may face data
>> loos/corruption issues.
>>
> 
> Yes, maybe in some cases, the data loss is still expected, but there are
> chances that data may be valid as well, which is confirmed by the test case above.

Yes, I'm not against this, but still that was not the root cause...

> 
>> IMO, we need to find out the root cause of data corruption, So my question is
>> that why sudden power off tests can make nat journal being corrupted? I doubt
>> that in-place update method on checkpoint area of fsck may cause this... any
>> other thoughts?
>>
> 
> In case of sudden power off, I have printed the all the journal entries in the
> NAT and I several entries with valid blkaddr and nid as below -
> 
> [   21.007649] fsck.f2fs: withfix: 0: addr 1c373 nid 45b\x0a
> [   21.014399] fsck.f2fs: withfix: 1: addr 1c2c1 nid 1b5\x0a
> [   21.021151] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 2/37 is invalid!!!\x0a

addr(0) can pass the below condtion, why we will print above error message and
do the truncation...

if (!IS_VALID_BLK_ADDR(sbi, addr)) {

Thanks,

> [   21.030860] fsck.f2fs: withfix: 3: addr 1c1f7 nid 19b2\x0a
> [   21.037674] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 4/37 is invalid!!!\x0a
> [   21.047353] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 5/37 is invalid!!!\x0a
> [   21.057062] fsck.f2fs: withfix: 6: addr 1c377 nid ac3\x0a
> [   21.063841] fsck.f2fs: withfix: 7: addr 1c378 nid ac9\x0a
> [   21.070634] fsck.f2fs: withfix: 8: addr 1b2fe nid 807\x0a
> [   21.077356] fsck.f2fs: withfix: 9: addr 17aeb nid 805\x0a
> [   21.084102] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 10/37 is invalid!!!\x0a
> [   21.093891] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 11/37 is invalid!!!\x0a
> [   21.103662] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 12/37 is invalid!!!\x0a
> [   21.113450] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 13/37 is invalid!!!\x0a
> [   21.123227] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 14/37 is invalid!!!\x0a
> [   21.133022] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 15/37 is invalid!!!\x0a
> [   21.142792] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 16/37 is invalid!!!\x0a
> [   21.152580] fsck.f2fs: withfix: 17: addr 1c37f nid ff4\x0a
> [   21.159415] fsck.f2fs: withfix: 18: addr 1c36c nid ff9\x0a
> [   21.166229] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 19/37 is invalid!!!\x0a
> [   21.176039] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 20/37 is invalid!!!\x0a
> [   21.185831] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 21/37 is invalid!!!\x0a
> [   21.195631] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 22/37 is invalid!!!\x0a
> [   21.205394] fsck.f2fs: withfix: 23: addr 1c2f3 nid 61c\x0a
> [   21.212257] fsck.f2fs: withfix: 24: addr 1c391 nid 641\x0a
> [   21.219093] fsck.f2fs: withfix: 25: addr 1c386 nid 33b\x0a
> [   21.225905] fsck.f2fs: withfix: 26: addr 1c356 nid 258\x0a
> [   21.232767] fsck.f2fs: withfix: 27: addr 1c2f2 nid 382\x0a
> [   21.239601] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 28/37 is invalid!!!\x0a
> [   21.249398] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 29/37 is invalid!!!\x0a
> [   21.259160] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 30/37 is invalid!!!\x0a
> [   21.268970] fsck.f2fs: withfix: 31: addr 1ac03 nid 1ccd\x0a
> [   21.275874] fsck.f2fs: withfix: 32: addr 1c395 nid a57\x0a
> [   21.282709] fsck.f2fs: withfix: 33: addr 1c392 nid 92e\x0a
> [   21.289564] fsck.f2fs: withfix: 34: addr 1c38b nid aa1\x0a
> [   21.296377] fsck.f2fs: withfix: 35: addr 1c393 nid 931\x0a
> [   21.303240] fsck.f2fs: withfix: 36: addr 1c394 nid a35\x0a
> 
> Thanks,
> 
>> Thanks,
>>
>>>
>>> The reason is due to incorrect update of journal->n_nats in that patch.
>>> When fsck encounters a nat journal entry with invalid blkaddr/nid, it
>>> updates journal->n_nats with the index of such first entry and then continues
>>> the for loop which scans all nats in journal in f2fs_init_nid_bitmap().
>>> But that loop doesn't continue and doesn't update nm_i->nid_bitmap as well,
>>> as it fails the for loop condition i < nats_in_cursum(journal).
>>>
>>> There could be several other valid entries in the journal after this
>>> invalid entry. Due to this invalid journal->n_nats, build_nat_area_bitmap()
>>> could not be build properly for valid nat journal entries. It further results
>>> into these fsck asserts/errors and causes data loss due to incorrect fsck fix.
>>>
>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfb01] ino is 0\x0a
>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfb01] - _JeNcl242yB3Apz2MW,VOh6WmjsVht1W len[0x1c], type[0x1]\x0a
>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfa2d] ino is 0\x0a
>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfa2d] - _++vN7axccjjGNI,TQEVmcOemY1EAAAA len[0x14], type[0x1]\x0a
>>> ...
>>> NID[0xba5] is unreachable\x0a
>>> NID[0xfa9] is unreachable\x0a
>>> ...
>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xba5] in NAT\x0a
>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xfa9] in NAT\x0a
>>> ...
>>>
>>> Fixes: 979b25727800 ("fsck.f2fs: check validity of nat journal")
>>> Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
>>> ---
>>>  fsck/mount.c | 2 --
>>>  1 file changed, 2 deletions(-)
>>>
>>> diff --git a/fsck/mount.c b/fsck/mount.c
>>> index 60e0e4a..1b93f49 100644
>>> --- a/fsck/mount.c
>>> +++ b/fsck/mount.c
>>> @@ -1160,7 +1160,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>>>  		addr = le32_to_cpu(nat_in_journal(journal, i).block_addr);
>>>  		if (!IS_VALID_BLK_ADDR(sbi, addr)) {
>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: addr(%u) is invalid!!!\n", addr);
>>> -			journal->n_nats = cpu_to_le16(i);
>>>  			c.fix_on = 1;
>>>  			continue;
>>>  		}
>>> @@ -1168,7 +1167,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>>>  		nid = le32_to_cpu(nid_in_journal(journal, i));
>>>  		if (!IS_VALID_NID(sbi, nid)) {
>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: nid(%u) is invalid!!!\n", nid);
>>> -			journal->n_nats = cpu_to_le16(i);
>>>  			c.fix_on = 1;
>>>  			continue;
>>>  		}
>>>
> 


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO
  2019-07-05 13:48     ` Sahitya Tummala
@ 2019-07-06  1:14       ` Chao Yu
  2019-07-06  1:31         ` Sahitya Tummala
  0 siblings, 1 reply; 14+ messages in thread
From: Chao Yu @ 2019-07-06  1:14 UTC (permalink / raw)
  To: Sahitya Tummala, Chao Yu; +Cc: Jaegeuk Kim, linux-f2fs-devel

Hi Sahitya,

On 2019-7-5 21:48, Sahitya Tummala wrote:
> Hi Chao,
> 
> The journal doesn't look to be corrupted, but only block_addr
> in f2fs_nat_entry is 0 whereas nid and version seems to be valid.
> These could be just truncated nodes, right?

Yes, but that should not fail the condition of journal check and result in
journal size truncation, right?

Thanks,

> 
> Here is the full dump of nat journal entries after a SPO (this is not a 
> issue case, but just a normal case with SPO) -
> 
> [   21.121325] fsck.f2fs: 0: addr 0 nid 5c5a ver 1\x0a
> [   21.127509] fsck.f2fs: 1: addr 0 nid 4c1d ver 1\x0a
> [   21.133760] fsck.f2fs: 2: addr 12075 nid 444f ver 0\x0a
> [   21.140325] fsck.f2fs: 3: addr 0 nid 4307 ver 0\x0a
> [   21.146510] fsck.f2fs: 4: addr 0 nid 4059 ver 1\x0a
> [   21.152745] fsck.f2fs: 5: addr 0 nid 3e69 ver 1\x0a
> [   21.158944] fsck.f2fs: 6: addr 0 nid 3cf9 ver 1\x0a
> [   21.165128] fsck.f2fs: 7: addr 0 nid 3b7b ver 1\x0a
> [   21.171362] fsck.f2fs: 8: addr 18f65 nid 34e2 ver 0\x0a
> [   21.177903] fsck.f2fs: 9: addr 12126 nid 2c87 ver 0\x0a
> [   21.184466] fsck.f2fs: 10: addr 0 nid 2a0a ver 1\x0a
> [   21.190795] fsck.f2fs: 11: addr 0 nid 2565 ver 1\x0a
> [   21.197070] fsck.f2fs: 12: addr 0 nid 1f48 ver 0\x0a
> [   21.203369] fsck.f2fs: 13: addr 155bf nid 1d76 ver 0\x0a
> [   21.210042] fsck.f2fs: 14: addr 19538 nid 18da ver 0\x0a
> [   21.216675] fsck.f2fs: 15: addr 0 nid 157f ver 0\x0a
> [   21.223037] fsck.f2fs: 16: addr d117 nid ac3 ver 0\x0a
> [   21.229547] fsck.f2fs: 17: addr 11e48 nid 805 ver 0\x0a
> [   21.236088] fsck.f2fs: 18: addr d112 nid 61c ver 0\x0a
> [   21.242585] fsck.f2fs: 19: addr d00b nid 45b ver 0\x0a
> [   21.249086] fsck.f2fs: 20: addr d111 nid 30d ver 0\x0a
> [   21.255544] fsck.f2fs: 21: addr d110 nid 1b5 ver 0\x0a
> [   21.262025] fsck.f2fs: 22: addr 0 nid 5e41 ver 1\x0a
> [   21.268342] fsck.f2fs: 23: addr 0 nid 5e46 ver 0\x0a
> [   21.274618] fsck.f2fs: 24: addr 0 nid 5a34 ver 1\x0a
> [   21.280916] fsck.f2fs: 25: addr 0 nid 59fa ver 1\x0a
> [   21.287192] fsck.f2fs: 26: addr 0 nid 57c3 ver 1\x0a
> [   21.293526] fsck.f2fs: 27: addr 0 nid 5776 ver 1\x0a
> [   21.299816] fsck.f2fs: 28: addr 0 nid 4f20 ver 1\x0a
> [   21.306091] fsck.f2fs: 29: addr 0 nid 4ff7 ver 0\x0a
> [   21.312416] fsck.f2fs: 30: addr 0 nid 47e1 ver 1\x0a
> [   21.318715] fsck.f2fs: 31: addr 0 nid 4880 ver 1\x0a
> [   21.324990] fsck.f2fs: 32: addr d118 nid ff4 ver 0\x0a
> [   21.331498] fsck.f2fs: 33: addr d10c nid ff9 ver 0\x0a
> [   21.337957] fsck.f2fs: 34: addr 0 nid 49ed ver 1\x0a
> [   21.344290] fsck.f2fs: 35: addr 0 nid 497e ver 0\x0a
> [   21.350612] fsck.f2fs: 36: addr fc5a nid 49d2 ver 0\x0a
> 
> Thanks,
> 
> On Fri, Jul 05, 2019 at 04:33:57PM +0530, Sahitya Tummala wrote:
>> Hi Chao,
>>
>> On Fri, Jul 05, 2019 at 06:11:47PM +0800, Chao Yu wrote:
>>> Hi Sahitya,
>>>
>>> On 2019/7/5 13:01, Sahitya Tummala wrote:
>>>> With the commit <979b25727800> ("fsck.f2fs: check validity of nat journal"),
>>>> there is a serious data loss issue observed during sudden power off tests.
>>>
>>> In that patch, since journal's data is untrusty, I just truncated n_nats to last
>>> valid one to fix original reported issue.
>>
>> The original issue will be fixed even without updating/truncating the n_nats
>> as we still have sanity checks on nid and blkaddr, right?
>>
>>>
>>> Actually, the problem here is we don't know which data is correct and which data
>>> is corrupted in nat or sit journal, so even we skip broken entry without
>>> modifying journal->n_{nats,sits}, latter entry may still contain broken data(but
>>> it may pass validation check), and those entries may cause potential data
>>> loss/corruption...
>>>
>>
>> This issue was reported by a customer in the following test scenario -
>>
>> Update the password in Android and then force crash. It is verified that
>> all the files related to password data has fsync() already.
>>
>> After some iterations (3-4) of this test, it is found password related files are
>> lost, failing the phone to bootup.
>>
>> Without this patch ("fsck.f2fs: check validity of nat journal"), they don't see
>> such issue. It means that the data present in journal seems to be valid atleast
>> in this case and truncating is causing to loose the valid data/files.
>>
>>> It looks no matter how we change the codes, we still may face data
>>> loos/corruption issues.
>>>
>>
>> Yes, maybe in some cases, the data loss is still expected, but there are
>> chances that data may be valid as well, which is confirmed by the test case above.
>>
>>> IMO, we need to find out the root cause of data corruption, So my question is
>>> that why sudden power off tests can make nat journal being corrupted? I doubt
>>> that in-place update method on checkpoint area of fsck may cause this... any
>>> other thoughts?
>>>
>>
>> In case of sudden power off, I have printed the all the journal entries in the
>> NAT and I several entries with valid blkaddr and nid as below -
>>
>> [   21.007649] fsck.f2fs: withfix: 0: addr 1c373 nid 45b\x0a
>> [   21.014399] fsck.f2fs: withfix: 1: addr 1c2c1 nid 1b5\x0a
>> [   21.021151] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 2/37 is invalid!!!\x0a
>> [   21.030860] fsck.f2fs: withfix: 3: addr 1c1f7 nid 19b2\x0a
>> [   21.037674] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 4/37 is invalid!!!\x0a
>> [   21.047353] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 5/37 is invalid!!!\x0a
>> [   21.057062] fsck.f2fs: withfix: 6: addr 1c377 nid ac3\x0a
>> [   21.063841] fsck.f2fs: withfix: 7: addr 1c378 nid ac9\x0a
>> [   21.070634] fsck.f2fs: withfix: 8: addr 1b2fe nid 807\x0a
>> [   21.077356] fsck.f2fs: withfix: 9: addr 17aeb nid 805\x0a
>> [   21.084102] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 10/37 is invalid!!!\x0a
>> [   21.093891] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 11/37 is invalid!!!\x0a
>> [   21.103662] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 12/37 is invalid!!!\x0a
>> [   21.113450] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 13/37 is invalid!!!\x0a
>> [   21.123227] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 14/37 is invalid!!!\x0a
>> [   21.133022] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 15/37 is invalid!!!\x0a
>> [   21.142792] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 16/37 is invalid!!!\x0a
>> [   21.152580] fsck.f2fs: withfix: 17: addr 1c37f nid ff4\x0a
>> [   21.159415] fsck.f2fs: withfix: 18: addr 1c36c nid ff9\x0a
>> [   21.166229] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 19/37 is invalid!!!\x0a
>> [   21.176039] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 20/37 is invalid!!!\x0a
>> [   21.185831] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 21/37 is invalid!!!\x0a
>> [   21.195631] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 22/37 is invalid!!!\x0a
>> [   21.205394] fsck.f2fs: withfix: 23: addr 1c2f3 nid 61c\x0a
>> [   21.212257] fsck.f2fs: withfix: 24: addr 1c391 nid 641\x0a
>> [   21.219093] fsck.f2fs: withfix: 25: addr 1c386 nid 33b\x0a
>> [   21.225905] fsck.f2fs: withfix: 26: addr 1c356 nid 258\x0a
>> [   21.232767] fsck.f2fs: withfix: 27: addr 1c2f2 nid 382\x0a
>> [   21.239601] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 28/37 is invalid!!!\x0a
>> [   21.249398] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 29/37 is invalid!!!\x0a
>> [   21.259160] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 30/37 is invalid!!!\x0a
>> [   21.268970] fsck.f2fs: withfix: 31: addr 1ac03 nid 1ccd\x0a
>> [   21.275874] fsck.f2fs: withfix: 32: addr 1c395 nid a57\x0a
>> [   21.282709] fsck.f2fs: withfix: 33: addr 1c392 nid 92e\x0a
>> [   21.289564] fsck.f2fs: withfix: 34: addr 1c38b nid aa1\x0a
>> [   21.296377] fsck.f2fs: withfix: 35: addr 1c393 nid 931\x0a
>> [   21.303240] fsck.f2fs: withfix: 36: addr 1c394 nid a35\x0a
>>
>> Thanks,
>>
>>> Thanks,
>>>
>>>>
>>>> The reason is due to incorrect update of journal->n_nats in that patch.
>>>> When fsck encounters a nat journal entry with invalid blkaddr/nid, it
>>>> updates journal->n_nats with the index of such first entry and then continues
>>>> the for loop which scans all nats in journal in f2fs_init_nid_bitmap().
>>>> But that loop doesn't continue and doesn't update nm_i->nid_bitmap as well,
>>>> as it fails the for loop condition i < nats_in_cursum(journal).
>>>>
>>>> There could be several other valid entries in the journal after this
>>>> invalid entry. Due to this invalid journal->n_nats, build_nat_area_bitmap()
>>>> could not be build properly for valid nat journal entries. It further results
>>>> into these fsck asserts/errors and causes data loss due to incorrect fsck fix.
>>>>
>>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfb01] ino is 0\x0a
>>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfb01] - _JeNcl242yB3Apz2MW,VOh6WmjsVht1W len[0x1c], type[0x1]\x0a
>>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfa2d] ino is 0\x0a
>>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfa2d] - _++vN7axccjjGNI,TQEVmcOemY1EAAAA len[0x14], type[0x1]\x0a
>>>> ...
>>>> NID[0xba5] is unreachable\x0a
>>>> NID[0xfa9] is unreachable\x0a
>>>> ...
>>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xba5] in NAT\x0a
>>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xfa9] in NAT\x0a
>>>> ...
>>>>
>>>> Fixes: 979b25727800 ("fsck.f2fs: check validity of nat journal")
>>>> Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
>>>> ---
>>>>  fsck/mount.c | 2 --
>>>>  1 file changed, 2 deletions(-)
>>>>
>>>> diff --git a/fsck/mount.c b/fsck/mount.c
>>>> index 60e0e4a..1b93f49 100644
>>>> --- a/fsck/mount.c
>>>> +++ b/fsck/mount.c
>>>> @@ -1160,7 +1160,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>>>>  		addr = le32_to_cpu(nat_in_journal(journal, i).block_addr);
>>>>  		if (!IS_VALID_BLK_ADDR(sbi, addr)) {
>>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: addr(%u) is invalid!!!\n", addr);
>>>> -			journal->n_nats = cpu_to_le16(i);
>>>>  			c.fix_on = 1;
>>>>  			continue;
>>>>  		}
>>>> @@ -1168,7 +1167,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>>>>  		nid = le32_to_cpu(nid_in_journal(journal, i));
>>>>  		if (!IS_VALID_NID(sbi, nid)) {
>>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: nid(%u) is invalid!!!\n", nid);
>>>> -			journal->n_nats = cpu_to_le16(i);
>>>>  			c.fix_on = 1;
>>>>  			continue;
>>>>  		}
>>>>
>>
>> -- 
>> --
>> Sent by a consultant of the Qualcomm Innovation Center, Inc.
>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
>>
>>
>> _______________________________________________
>> Linux-f2fs-devel mailing list
>> Linux-f2fs-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> 


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO
  2019-07-06  1:14       ` Chao Yu
@ 2019-07-06  1:31         ` Sahitya Tummala
  2019-07-06  1:37           ` Sahitya Tummala
  0 siblings, 1 reply; 14+ messages in thread
From: Sahitya Tummala @ 2019-07-06  1:31 UTC (permalink / raw)
  To: Chao Yu; +Cc: Jaegeuk Kim, linux-f2fs-devel

Hi Chao,

On Sat, Jul 06, 2019 at 09:14:59AM +0800, Chao Yu wrote:
> Hi Sahitya,
> 
> On 2019-7-5 21:48, Sahitya Tummala wrote:
> > Hi Chao,
> > 
> > The journal doesn't look to be corrupted, but only block_addr
> > in f2fs_nat_entry is 0 whereas nid and version seems to be valid.
> > These could be just truncated nodes, right?
> 
> Yes, but that should not fail the condition of journal check and result in
> journal size truncation, right?
> 

Yes, the patch fails for this condition as well and results into
truncation of journal, due to this check  addr < SM_I(sbi)->main_blkaddr
in IS_VALID_BLK_ADDR().

Do you think we can correct it this way?

diff --git a/fsck/f2fs.h b/fsck/f2fs.h
index 417ca0b..00fb740 100644
--- a/fsck/f2fs.h
+++ b/fsck/f2fs.h
@@ -364,8 +364,8 @@ static inline bool IS_VALID_NID(struct f2fs_sb_info *sbi, u32 nid)

 static inline bool IS_VALID_BLK_ADDR(struct f2fs_sb_info *sbi, u32 addr)
 {
-       if (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
-                               addr < SM_I(sbi)->main_blkaddr) {
+       if (addr && (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
+                               addr < SM_I(sbi)->main_blkaddr)) {
                DBG(1, "block addr [0x%x]\n", addr);
                return 0;
        }

Thanks,

> Thanks,
> 
> > 
> > Here is the full dump of nat journal entries after a SPO (this is not a 
> > issue case, but just a normal case with SPO) -
> > 
> > [   21.121325] fsck.f2fs: 0: addr 0 nid 5c5a ver 1\x0a
> > [   21.127509] fsck.f2fs: 1: addr 0 nid 4c1d ver 1\x0a
> > [   21.133760] fsck.f2fs: 2: addr 12075 nid 444f ver 0\x0a
> > [   21.140325] fsck.f2fs: 3: addr 0 nid 4307 ver 0\x0a
> > [   21.146510] fsck.f2fs: 4: addr 0 nid 4059 ver 1\x0a
> > [   21.152745] fsck.f2fs: 5: addr 0 nid 3e69 ver 1\x0a
> > [   21.158944] fsck.f2fs: 6: addr 0 nid 3cf9 ver 1\x0a
> > [   21.165128] fsck.f2fs: 7: addr 0 nid 3b7b ver 1\x0a
> > [   21.171362] fsck.f2fs: 8: addr 18f65 nid 34e2 ver 0\x0a
> > [   21.177903] fsck.f2fs: 9: addr 12126 nid 2c87 ver 0\x0a
> > [   21.184466] fsck.f2fs: 10: addr 0 nid 2a0a ver 1\x0a
> > [   21.190795] fsck.f2fs: 11: addr 0 nid 2565 ver 1\x0a
> > [   21.197070] fsck.f2fs: 12: addr 0 nid 1f48 ver 0\x0a
> > [   21.203369] fsck.f2fs: 13: addr 155bf nid 1d76 ver 0\x0a
> > [   21.210042] fsck.f2fs: 14: addr 19538 nid 18da ver 0\x0a
> > [   21.216675] fsck.f2fs: 15: addr 0 nid 157f ver 0\x0a
> > [   21.223037] fsck.f2fs: 16: addr d117 nid ac3 ver 0\x0a
> > [   21.229547] fsck.f2fs: 17: addr 11e48 nid 805 ver 0\x0a
> > [   21.236088] fsck.f2fs: 18: addr d112 nid 61c ver 0\x0a
> > [   21.242585] fsck.f2fs: 19: addr d00b nid 45b ver 0\x0a
> > [   21.249086] fsck.f2fs: 20: addr d111 nid 30d ver 0\x0a
> > [   21.255544] fsck.f2fs: 21: addr d110 nid 1b5 ver 0\x0a
> > [   21.262025] fsck.f2fs: 22: addr 0 nid 5e41 ver 1\x0a
> > [   21.268342] fsck.f2fs: 23: addr 0 nid 5e46 ver 0\x0a
> > [   21.274618] fsck.f2fs: 24: addr 0 nid 5a34 ver 1\x0a
> > [   21.280916] fsck.f2fs: 25: addr 0 nid 59fa ver 1\x0a
> > [   21.287192] fsck.f2fs: 26: addr 0 nid 57c3 ver 1\x0a
> > [   21.293526] fsck.f2fs: 27: addr 0 nid 5776 ver 1\x0a
> > [   21.299816] fsck.f2fs: 28: addr 0 nid 4f20 ver 1\x0a
> > [   21.306091] fsck.f2fs: 29: addr 0 nid 4ff7 ver 0\x0a
> > [   21.312416] fsck.f2fs: 30: addr 0 nid 47e1 ver 1\x0a
> > [   21.318715] fsck.f2fs: 31: addr 0 nid 4880 ver 1\x0a
> > [   21.324990] fsck.f2fs: 32: addr d118 nid ff4 ver 0\x0a
> > [   21.331498] fsck.f2fs: 33: addr d10c nid ff9 ver 0\x0a
> > [   21.337957] fsck.f2fs: 34: addr 0 nid 49ed ver 1\x0a
> > [   21.344290] fsck.f2fs: 35: addr 0 nid 497e ver 0\x0a
> > [   21.350612] fsck.f2fs: 36: addr fc5a nid 49d2 ver 0\x0a
> > 
> > Thanks,
> > 
> > On Fri, Jul 05, 2019 at 04:33:57PM +0530, Sahitya Tummala wrote:
> >> Hi Chao,
> >>
> >> On Fri, Jul 05, 2019 at 06:11:47PM +0800, Chao Yu wrote:
> >>> Hi Sahitya,
> >>>
> >>> On 2019/7/5 13:01, Sahitya Tummala wrote:
> >>>> With the commit <979b25727800> ("fsck.f2fs: check validity of nat journal"),
> >>>> there is a serious data loss issue observed during sudden power off tests.
> >>>
> >>> In that patch, since journal's data is untrusty, I just truncated n_nats to last
> >>> valid one to fix original reported issue.
> >>
> >> The original issue will be fixed even without updating/truncating the n_nats
> >> as we still have sanity checks on nid and blkaddr, right?
> >>
> >>>
> >>> Actually, the problem here is we don't know which data is correct and which data
> >>> is corrupted in nat or sit journal, so even we skip broken entry without
> >>> modifying journal->n_{nats,sits}, latter entry may still contain broken data(but
> >>> it may pass validation check), and those entries may cause potential data
> >>> loss/corruption...
> >>>
> >>
> >> This issue was reported by a customer in the following test scenario -
> >>
> >> Update the password in Android and then force crash. It is verified that
> >> all the files related to password data has fsync() already.
> >>
> >> After some iterations (3-4) of this test, it is found password related files are
> >> lost, failing the phone to bootup.
> >>
> >> Without this patch ("fsck.f2fs: check validity of nat journal"), they don't see
> >> such issue. It means that the data present in journal seems to be valid atleast
> >> in this case and truncating is causing to loose the valid data/files.
> >>
> >>> It looks no matter how we change the codes, we still may face data
> >>> loos/corruption issues.
> >>>
> >>
> >> Yes, maybe in some cases, the data loss is still expected, but there are
> >> chances that data may be valid as well, which is confirmed by the test case above.
> >>
> >>> IMO, we need to find out the root cause of data corruption, So my question is
> >>> that why sudden power off tests can make nat journal being corrupted? I doubt
> >>> that in-place update method on checkpoint area of fsck may cause this... any
> >>> other thoughts?
> >>>
> >>
> >> In case of sudden power off, I have printed the all the journal entries in the
> >> NAT and I several entries with valid blkaddr and nid as below -
> >>
> >> [   21.007649] fsck.f2fs: withfix: 0: addr 1c373 nid 45b\x0a
> >> [   21.014399] fsck.f2fs: withfix: 1: addr 1c2c1 nid 1b5\x0a
> >> [   21.021151] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 2/37 is invalid!!!\x0a
> >> [   21.030860] fsck.f2fs: withfix: 3: addr 1c1f7 nid 19b2\x0a
> >> [   21.037674] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 4/37 is invalid!!!\x0a
> >> [   21.047353] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 5/37 is invalid!!!\x0a
> >> [   21.057062] fsck.f2fs: withfix: 6: addr 1c377 nid ac3\x0a
> >> [   21.063841] fsck.f2fs: withfix: 7: addr 1c378 nid ac9\x0a
> >> [   21.070634] fsck.f2fs: withfix: 8: addr 1b2fe nid 807\x0a
> >> [   21.077356] fsck.f2fs: withfix: 9: addr 17aeb nid 805\x0a
> >> [   21.084102] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 10/37 is invalid!!!\x0a
> >> [   21.093891] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 11/37 is invalid!!!\x0a
> >> [   21.103662] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 12/37 is invalid!!!\x0a
> >> [   21.113450] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 13/37 is invalid!!!\x0a
> >> [   21.123227] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 14/37 is invalid!!!\x0a
> >> [   21.133022] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 15/37 is invalid!!!\x0a
> >> [   21.142792] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 16/37 is invalid!!!\x0a
> >> [   21.152580] fsck.f2fs: withfix: 17: addr 1c37f nid ff4\x0a
> >> [   21.159415] fsck.f2fs: withfix: 18: addr 1c36c nid ff9\x0a
> >> [   21.166229] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 19/37 is invalid!!!\x0a
> >> [   21.176039] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 20/37 is invalid!!!\x0a
> >> [   21.185831] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 21/37 is invalid!!!\x0a
> >> [   21.195631] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 22/37 is invalid!!!\x0a
> >> [   21.205394] fsck.f2fs: withfix: 23: addr 1c2f3 nid 61c\x0a
> >> [   21.212257] fsck.f2fs: withfix: 24: addr 1c391 nid 641\x0a
> >> [   21.219093] fsck.f2fs: withfix: 25: addr 1c386 nid 33b\x0a
> >> [   21.225905] fsck.f2fs: withfix: 26: addr 1c356 nid 258\x0a
> >> [   21.232767] fsck.f2fs: withfix: 27: addr 1c2f2 nid 382\x0a
> >> [   21.239601] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 28/37 is invalid!!!\x0a
> >> [   21.249398] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 29/37 is invalid!!!\x0a
> >> [   21.259160] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 30/37 is invalid!!!\x0a
> >> [   21.268970] fsck.f2fs: withfix: 31: addr 1ac03 nid 1ccd\x0a
> >> [   21.275874] fsck.f2fs: withfix: 32: addr 1c395 nid a57\x0a
> >> [   21.282709] fsck.f2fs: withfix: 33: addr 1c392 nid 92e\x0a
> >> [   21.289564] fsck.f2fs: withfix: 34: addr 1c38b nid aa1\x0a
> >> [   21.296377] fsck.f2fs: withfix: 35: addr 1c393 nid 931\x0a
> >> [   21.303240] fsck.f2fs: withfix: 36: addr 1c394 nid a35\x0a
> >>
> >> Thanks,
> >>
> >>> Thanks,
> >>>
> >>>>
> >>>> The reason is due to incorrect update of journal->n_nats in that patch.
> >>>> When fsck encounters a nat journal entry with invalid blkaddr/nid, it
> >>>> updates journal->n_nats with the index of such first entry and then continues
> >>>> the for loop which scans all nats in journal in f2fs_init_nid_bitmap().
> >>>> But that loop doesn't continue and doesn't update nm_i->nid_bitmap as well,
> >>>> as it fails the for loop condition i < nats_in_cursum(journal).
> >>>>
> >>>> There could be several other valid entries in the journal after this
> >>>> invalid entry. Due to this invalid journal->n_nats, build_nat_area_bitmap()
> >>>> could not be build properly for valid nat journal entries. It further results
> >>>> into these fsck asserts/errors and causes data loss due to incorrect fsck fix.
> >>>>
> >>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfb01] ino is 0\x0a
> >>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfb01] - _JeNcl242yB3Apz2MW,VOh6WmjsVht1W len[0x1c], type[0x1]\x0a
> >>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfa2d] ino is 0\x0a
> >>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfa2d] - _++vN7axccjjGNI,TQEVmcOemY1EAAAA len[0x14], type[0x1]\x0a
> >>>> ...
> >>>> NID[0xba5] is unreachable\x0a
> >>>> NID[0xfa9] is unreachable\x0a
> >>>> ...
> >>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xba5] in NAT\x0a
> >>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xfa9] in NAT\x0a
> >>>> ...
> >>>>
> >>>> Fixes: 979b25727800 ("fsck.f2fs: check validity of nat journal")
> >>>> Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
> >>>> ---
> >>>>  fsck/mount.c | 2 --
> >>>>  1 file changed, 2 deletions(-)
> >>>>
> >>>> diff --git a/fsck/mount.c b/fsck/mount.c
> >>>> index 60e0e4a..1b93f49 100644
> >>>> --- a/fsck/mount.c
> >>>> +++ b/fsck/mount.c
> >>>> @@ -1160,7 +1160,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
> >>>>  		addr = le32_to_cpu(nat_in_journal(journal, i).block_addr);
> >>>>  		if (!IS_VALID_BLK_ADDR(sbi, addr)) {
> >>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: addr(%u) is invalid!!!\n", addr);
> >>>> -			journal->n_nats = cpu_to_le16(i);
> >>>>  			c.fix_on = 1;
> >>>>  			continue;
> >>>>  		}
> >>>> @@ -1168,7 +1167,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
> >>>>  		nid = le32_to_cpu(nid_in_journal(journal, i));
> >>>>  		if (!IS_VALID_NID(sbi, nid)) {
> >>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: nid(%u) is invalid!!!\n", nid);
> >>>> -			journal->n_nats = cpu_to_le16(i);
> >>>>  			c.fix_on = 1;
> >>>>  			continue;
> >>>>  		}
> >>>>
> >>
> >> -- 
> >> --
> >> Sent by a consultant of the Qualcomm Innovation Center, Inc.
> >> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
> >>
> >>
> >> _______________________________________________
> >> Linux-f2fs-devel mailing list
> >> Linux-f2fs-devel@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> > 

-- 
--
Sent by a consultant of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO
  2019-07-06  1:31         ` Sahitya Tummala
@ 2019-07-06  1:37           ` Sahitya Tummala
  2019-07-06  2:10             ` Chao Yu
  2019-07-06  2:10             ` Chao Yu
  0 siblings, 2 replies; 14+ messages in thread
From: Sahitya Tummala @ 2019-07-06  1:37 UTC (permalink / raw)
  To: Chao Yu; +Cc: Jaegeuk Kim, linux-f2fs-devel

Hi Chao,

I just realized that there is a patch from you to fix this already.

commit 92ea71c711681a8f575d09d57d5ceaebd6644fcc
Author: Chao Yu <yuchao0@huawei.com>
Date:   Wed Jan 9 10:48:25 2019 +0800

    fsck.f2fs: fix incorrect boundary of IS_VALID_NID()

    nid should never equal to max_nid, fix it.

    Signed-off-by: Chao Yu <yuchao0@huawei.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

In my code base, I was not having this patch but only have
"fsck.f2fs: check validity of nat journal", thus resulting into
data loss issues. I will now verify with this patch included.

Thanks for your support on this issue.

Thanks,
Sahitya.

On Sat, Jul 06, 2019 at 07:01:12AM +0530, Sahitya Tummala wrote:
> Hi Chao,
> 
> On Sat, Jul 06, 2019 at 09:14:59AM +0800, Chao Yu wrote:
> > Hi Sahitya,
> > 
> > On 2019-7-5 21:48, Sahitya Tummala wrote:
> > > Hi Chao,
> > > 
> > > The journal doesn't look to be corrupted, but only block_addr
> > > in f2fs_nat_entry is 0 whereas nid and version seems to be valid.
> > > These could be just truncated nodes, right?
> > 
> > Yes, but that should not fail the condition of journal check and result in
> > journal size truncation, right?
> > 
> 
> Yes, the patch fails for this condition as well and results into
> truncation of journal, due to this check  addr < SM_I(sbi)->main_blkaddr
> in IS_VALID_BLK_ADDR().
> 
> Do you think we can correct it this way?
> 
> diff --git a/fsck/f2fs.h b/fsck/f2fs.h
> index 417ca0b..00fb740 100644
> --- a/fsck/f2fs.h
> +++ b/fsck/f2fs.h
> @@ -364,8 +364,8 @@ static inline bool IS_VALID_NID(struct f2fs_sb_info *sbi, u32 nid)
> 
>  static inline bool IS_VALID_BLK_ADDR(struct f2fs_sb_info *sbi, u32 addr)
>  {
> -       if (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
> -                               addr < SM_I(sbi)->main_blkaddr) {
> +       if (addr && (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
> +                               addr < SM_I(sbi)->main_blkaddr)) {
>                 DBG(1, "block addr [0x%x]\n", addr);
>                 return 0;
>         }
> 
> Thanks,
> 
> > Thanks,
> > 
> > > 
> > > Here is the full dump of nat journal entries after a SPO (this is not a 
> > > issue case, but just a normal case with SPO) -
> > > 
> > > [   21.121325] fsck.f2fs: 0: addr 0 nid 5c5a ver 1\x0a
> > > [   21.127509] fsck.f2fs: 1: addr 0 nid 4c1d ver 1\x0a
> > > [   21.133760] fsck.f2fs: 2: addr 12075 nid 444f ver 0\x0a
> > > [   21.140325] fsck.f2fs: 3: addr 0 nid 4307 ver 0\x0a
> > > [   21.146510] fsck.f2fs: 4: addr 0 nid 4059 ver 1\x0a
> > > [   21.152745] fsck.f2fs: 5: addr 0 nid 3e69 ver 1\x0a
> > > [   21.158944] fsck.f2fs: 6: addr 0 nid 3cf9 ver 1\x0a
> > > [   21.165128] fsck.f2fs: 7: addr 0 nid 3b7b ver 1\x0a
> > > [   21.171362] fsck.f2fs: 8: addr 18f65 nid 34e2 ver 0\x0a
> > > [   21.177903] fsck.f2fs: 9: addr 12126 nid 2c87 ver 0\x0a
> > > [   21.184466] fsck.f2fs: 10: addr 0 nid 2a0a ver 1\x0a
> > > [   21.190795] fsck.f2fs: 11: addr 0 nid 2565 ver 1\x0a
> > > [   21.197070] fsck.f2fs: 12: addr 0 nid 1f48 ver 0\x0a
> > > [   21.203369] fsck.f2fs: 13: addr 155bf nid 1d76 ver 0\x0a
> > > [   21.210042] fsck.f2fs: 14: addr 19538 nid 18da ver 0\x0a
> > > [   21.216675] fsck.f2fs: 15: addr 0 nid 157f ver 0\x0a
> > > [   21.223037] fsck.f2fs: 16: addr d117 nid ac3 ver 0\x0a
> > > [   21.229547] fsck.f2fs: 17: addr 11e48 nid 805 ver 0\x0a
> > > [   21.236088] fsck.f2fs: 18: addr d112 nid 61c ver 0\x0a
> > > [   21.242585] fsck.f2fs: 19: addr d00b nid 45b ver 0\x0a
> > > [   21.249086] fsck.f2fs: 20: addr d111 nid 30d ver 0\x0a
> > > [   21.255544] fsck.f2fs: 21: addr d110 nid 1b5 ver 0\x0a
> > > [   21.262025] fsck.f2fs: 22: addr 0 nid 5e41 ver 1\x0a
> > > [   21.268342] fsck.f2fs: 23: addr 0 nid 5e46 ver 0\x0a
> > > [   21.274618] fsck.f2fs: 24: addr 0 nid 5a34 ver 1\x0a
> > > [   21.280916] fsck.f2fs: 25: addr 0 nid 59fa ver 1\x0a
> > > [   21.287192] fsck.f2fs: 26: addr 0 nid 57c3 ver 1\x0a
> > > [   21.293526] fsck.f2fs: 27: addr 0 nid 5776 ver 1\x0a
> > > [   21.299816] fsck.f2fs: 28: addr 0 nid 4f20 ver 1\x0a
> > > [   21.306091] fsck.f2fs: 29: addr 0 nid 4ff7 ver 0\x0a
> > > [   21.312416] fsck.f2fs: 30: addr 0 nid 47e1 ver 1\x0a
> > > [   21.318715] fsck.f2fs: 31: addr 0 nid 4880 ver 1\x0a
> > > [   21.324990] fsck.f2fs: 32: addr d118 nid ff4 ver 0\x0a
> > > [   21.331498] fsck.f2fs: 33: addr d10c nid ff9 ver 0\x0a
> > > [   21.337957] fsck.f2fs: 34: addr 0 nid 49ed ver 1\x0a
> > > [   21.344290] fsck.f2fs: 35: addr 0 nid 497e ver 0\x0a
> > > [   21.350612] fsck.f2fs: 36: addr fc5a nid 49d2 ver 0\x0a
> > > 
> > > Thanks,
> > > 
> > > On Fri, Jul 05, 2019 at 04:33:57PM +0530, Sahitya Tummala wrote:
> > >> Hi Chao,
> > >>
> > >> On Fri, Jul 05, 2019 at 06:11:47PM +0800, Chao Yu wrote:
> > >>> Hi Sahitya,
> > >>>
> > >>> On 2019/7/5 13:01, Sahitya Tummala wrote:
> > >>>> With the commit <979b25727800> ("fsck.f2fs: check validity of nat journal"),
> > >>>> there is a serious data loss issue observed during sudden power off tests.
> > >>>
> > >>> In that patch, since journal's data is untrusty, I just truncated n_nats to last
> > >>> valid one to fix original reported issue.
> > >>
> > >> The original issue will be fixed even without updating/truncating the n_nats
> > >> as we still have sanity checks on nid and blkaddr, right?
> > >>
> > >>>
> > >>> Actually, the problem here is we don't know which data is correct and which data
> > >>> is corrupted in nat or sit journal, so even we skip broken entry without
> > >>> modifying journal->n_{nats,sits}, latter entry may still contain broken data(but
> > >>> it may pass validation check), and those entries may cause potential data
> > >>> loss/corruption...
> > >>>
> > >>
> > >> This issue was reported by a customer in the following test scenario -
> > >>
> > >> Update the password in Android and then force crash. It is verified that
> > >> all the files related to password data has fsync() already.
> > >>
> > >> After some iterations (3-4) of this test, it is found password related files are
> > >> lost, failing the phone to bootup.
> > >>
> > >> Without this patch ("fsck.f2fs: check validity of nat journal"), they don't see
> > >> such issue. It means that the data present in journal seems to be valid atleast
> > >> in this case and truncating is causing to loose the valid data/files.
> > >>
> > >>> It looks no matter how we change the codes, we still may face data
> > >>> loos/corruption issues.
> > >>>
> > >>
> > >> Yes, maybe in some cases, the data loss is still expected, but there are
> > >> chances that data may be valid as well, which is confirmed by the test case above.
> > >>
> > >>> IMO, we need to find out the root cause of data corruption, So my question is
> > >>> that why sudden power off tests can make nat journal being corrupted? I doubt
> > >>> that in-place update method on checkpoint area of fsck may cause this... any
> > >>> other thoughts?
> > >>>
> > >>
> > >> In case of sudden power off, I have printed the all the journal entries in the
> > >> NAT and I several entries with valid blkaddr and nid as below -
> > >>
> > >> [   21.007649] fsck.f2fs: withfix: 0: addr 1c373 nid 45b\x0a
> > >> [   21.014399] fsck.f2fs: withfix: 1: addr 1c2c1 nid 1b5\x0a
> > >> [   21.021151] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 2/37 is invalid!!!\x0a
> > >> [   21.030860] fsck.f2fs: withfix: 3: addr 1c1f7 nid 19b2\x0a
> > >> [   21.037674] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 4/37 is invalid!!!\x0a
> > >> [   21.047353] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 5/37 is invalid!!!\x0a
> > >> [   21.057062] fsck.f2fs: withfix: 6: addr 1c377 nid ac3\x0a
> > >> [   21.063841] fsck.f2fs: withfix: 7: addr 1c378 nid ac9\x0a
> > >> [   21.070634] fsck.f2fs: withfix: 8: addr 1b2fe nid 807\x0a
> > >> [   21.077356] fsck.f2fs: withfix: 9: addr 17aeb nid 805\x0a
> > >> [   21.084102] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 10/37 is invalid!!!\x0a
> > >> [   21.093891] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 11/37 is invalid!!!\x0a
> > >> [   21.103662] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 12/37 is invalid!!!\x0a
> > >> [   21.113450] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 13/37 is invalid!!!\x0a
> > >> [   21.123227] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 14/37 is invalid!!!\x0a
> > >> [   21.133022] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 15/37 is invalid!!!\x0a
> > >> [   21.142792] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 16/37 is invalid!!!\x0a
> > >> [   21.152580] fsck.f2fs: withfix: 17: addr 1c37f nid ff4\x0a
> > >> [   21.159415] fsck.f2fs: withfix: 18: addr 1c36c nid ff9\x0a
> > >> [   21.166229] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 19/37 is invalid!!!\x0a
> > >> [   21.176039] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 20/37 is invalid!!!\x0a
> > >> [   21.185831] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 21/37 is invalid!!!\x0a
> > >> [   21.195631] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 22/37 is invalid!!!\x0a
> > >> [   21.205394] fsck.f2fs: withfix: 23: addr 1c2f3 nid 61c\x0a
> > >> [   21.212257] fsck.f2fs: withfix: 24: addr 1c391 nid 641\x0a
> > >> [   21.219093] fsck.f2fs: withfix: 25: addr 1c386 nid 33b\x0a
> > >> [   21.225905] fsck.f2fs: withfix: 26: addr 1c356 nid 258\x0a
> > >> [   21.232767] fsck.f2fs: withfix: 27: addr 1c2f2 nid 382\x0a
> > >> [   21.239601] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 28/37 is invalid!!!\x0a
> > >> [   21.249398] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 29/37 is invalid!!!\x0a
> > >> [   21.259160] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 30/37 is invalid!!!\x0a
> > >> [   21.268970] fsck.f2fs: withfix: 31: addr 1ac03 nid 1ccd\x0a
> > >> [   21.275874] fsck.f2fs: withfix: 32: addr 1c395 nid a57\x0a
> > >> [   21.282709] fsck.f2fs: withfix: 33: addr 1c392 nid 92e\x0a
> > >> [   21.289564] fsck.f2fs: withfix: 34: addr 1c38b nid aa1\x0a
> > >> [   21.296377] fsck.f2fs: withfix: 35: addr 1c393 nid 931\x0a
> > >> [   21.303240] fsck.f2fs: withfix: 36: addr 1c394 nid a35\x0a
> > >>
> > >> Thanks,
> > >>
> > >>> Thanks,
> > >>>
> > >>>>
> > >>>> The reason is due to incorrect update of journal->n_nats in that patch.
> > >>>> When fsck encounters a nat journal entry with invalid blkaddr/nid, it
> > >>>> updates journal->n_nats with the index of such first entry and then continues
> > >>>> the for loop which scans all nats in journal in f2fs_init_nid_bitmap().
> > >>>> But that loop doesn't continue and doesn't update nm_i->nid_bitmap as well,
> > >>>> as it fails the for loop condition i < nats_in_cursum(journal).
> > >>>>
> > >>>> There could be several other valid entries in the journal after this
> > >>>> invalid entry. Due to this invalid journal->n_nats, build_nat_area_bitmap()
> > >>>> could not be build properly for valid nat journal entries. It further results
> > >>>> into these fsck asserts/errors and causes data loss due to incorrect fsck fix.
> > >>>>
> > >>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfb01] ino is 0\x0a
> > >>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfb01] - _JeNcl242yB3Apz2MW,VOh6WmjsVht1W len[0x1c], type[0x1]\x0a
> > >>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfa2d] ino is 0\x0a
> > >>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfa2d] - _++vN7axccjjGNI,TQEVmcOemY1EAAAA len[0x14], type[0x1]\x0a
> > >>>> ...
> > >>>> NID[0xba5] is unreachable\x0a
> > >>>> NID[0xfa9] is unreachable\x0a
> > >>>> ...
> > >>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xba5] in NAT\x0a
> > >>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xfa9] in NAT\x0a
> > >>>> ...
> > >>>>
> > >>>> Fixes: 979b25727800 ("fsck.f2fs: check validity of nat journal")
> > >>>> Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
> > >>>> ---
> > >>>>  fsck/mount.c | 2 --
> > >>>>  1 file changed, 2 deletions(-)
> > >>>>
> > >>>> diff --git a/fsck/mount.c b/fsck/mount.c
> > >>>> index 60e0e4a..1b93f49 100644
> > >>>> --- a/fsck/mount.c
> > >>>> +++ b/fsck/mount.c
> > >>>> @@ -1160,7 +1160,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
> > >>>>  		addr = le32_to_cpu(nat_in_journal(journal, i).block_addr);
> > >>>>  		if (!IS_VALID_BLK_ADDR(sbi, addr)) {
> > >>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: addr(%u) is invalid!!!\n", addr);
> > >>>> -			journal->n_nats = cpu_to_le16(i);
> > >>>>  			c.fix_on = 1;
> > >>>>  			continue;
> > >>>>  		}
> > >>>> @@ -1168,7 +1167,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
> > >>>>  		nid = le32_to_cpu(nid_in_journal(journal, i));
> > >>>>  		if (!IS_VALID_NID(sbi, nid)) {
> > >>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: nid(%u) is invalid!!!\n", nid);
> > >>>> -			journal->n_nats = cpu_to_le16(i);
> > >>>>  			c.fix_on = 1;
> > >>>>  			continue;
> > >>>>  		}
> > >>>>
> > >>
> > >> -- 
> > >> --
> > >> Sent by a consultant of the Qualcomm Innovation Center, Inc.
> > >> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
> > >>
> > >>
> > >> _______________________________________________
> > >> Linux-f2fs-devel mailing list
> > >> Linux-f2fs-devel@lists.sourceforge.net
> > >> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> > > 
> 
> -- 
> --
> Sent by a consultant of the Qualcomm Innovation Center, Inc.
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
> 
> 
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

-- 
--
Sent by a consultant of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO
  2019-07-06  1:37           ` Sahitya Tummala
@ 2019-07-06  2:10             ` Chao Yu
  2019-07-06  2:10             ` Chao Yu
  1 sibling, 0 replies; 14+ messages in thread
From: Chao Yu @ 2019-07-06  2:10 UTC (permalink / raw)
  To: Sahitya Tummala; +Cc: Jaegeuk Kim, linux-f2fs-devel

Hi Sahitya,

On 2019-7-6 9:37, Sahitya Tummala wrote:
> Hi Chao,
> 
> I just realized that there is a patch from you to fix this already.
> 
> commit 92ea71c711681a8f575d09d57d5ceaebd6644fcc
> Author: Chao Yu <yuchao0@huawei.com>
> Date:   Wed Jan 9 10:48:25 2019 +0800
> 
>     fsck.f2fs: fix incorrect boundary of IS_VALID_NID()
> 
>     nid should never equal to max_nid, fix it.
> 
>     Signed-off-by: Chao Yu <yuchao0@huawei.com>
>     Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
> 
> In my code base, I was not having this patch but only have
> "fsck.f2fs: check validity of nat journal", thus resulting into
> data loss issues. I will now verify with this patch included.

Ah, that should be the root cause..

BTW, the diff was pointed out by Jaegeuk on patch ("fsck.f2fs: check validity of
nat journal"), but it looks it was merged to the wrong patch. :P

 static inline bool IS_VALID_BLK_ADDR(struct f2fs_sb_info *sbi, u32 addr)
 {
+	if (addr == NULL_ADDR || addr == NEW_ADDR)
+		return 1;
+
 	if (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
 				addr < SM_I(sbi)->main_blkaddr) {
 		DBG(1, "block addr [0x%x]\n", addr);

Thanks,

> 
> Thanks for your support on this issue.
> 
> Thanks,
> Sahitya.
> 
> On Sat, Jul 06, 2019 at 07:01:12AM +0530, Sahitya Tummala wrote:
>> Hi Chao,
>>
>> On Sat, Jul 06, 2019 at 09:14:59AM +0800, Chao Yu wrote:
>>> Hi Sahitya,
>>>
>>> On 2019-7-5 21:48, Sahitya Tummala wrote:
>>>> Hi Chao,
>>>>
>>>> The journal doesn't look to be corrupted, but only block_addr
>>>> in f2fs_nat_entry is 0 whereas nid and version seems to be valid.
>>>> These could be just truncated nodes, right?
>>>
>>> Yes, but that should not fail the condition of journal check and result in
>>> journal size truncation, right?
>>>
>>
>> Yes, the patch fails for this condition as well and results into
>> truncation of journal, due to this check  addr < SM_I(sbi)->main_blkaddr
>> in IS_VALID_BLK_ADDR().
>>
>> Do you think we can correct it this way?
>>
>> diff --git a/fsck/f2fs.h b/fsck/f2fs.h
>> index 417ca0b..00fb740 100644
>> --- a/fsck/f2fs.h
>> +++ b/fsck/f2fs.h
>> @@ -364,8 +364,8 @@ static inline bool IS_VALID_NID(struct f2fs_sb_info *sbi, u32 nid)
>>
>>  static inline bool IS_VALID_BLK_ADDR(struct f2fs_sb_info *sbi, u32 addr)
>>  {
>> -       if (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
>> -                               addr < SM_I(sbi)->main_blkaddr) {
>> +       if (addr && (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
>> +                               addr < SM_I(sbi)->main_blkaddr)) {
>>                 DBG(1, "block addr [0x%x]\n", addr);
>>                 return 0;
>>         }
>>
>> Thanks,
>>
>>> Thanks,
>>>
>>>>
>>>> Here is the full dump of nat journal entries after a SPO (this is not a 
>>>> issue case, but just a normal case with SPO) -
>>>>
>>>> [   21.121325] fsck.f2fs: 0: addr 0 nid 5c5a ver 1\x0a
>>>> [   21.127509] fsck.f2fs: 1: addr 0 nid 4c1d ver 1\x0a
>>>> [   21.133760] fsck.f2fs: 2: addr 12075 nid 444f ver 0\x0a
>>>> [   21.140325] fsck.f2fs: 3: addr 0 nid 4307 ver 0\x0a
>>>> [   21.146510] fsck.f2fs: 4: addr 0 nid 4059 ver 1\x0a
>>>> [   21.152745] fsck.f2fs: 5: addr 0 nid 3e69 ver 1\x0a
>>>> [   21.158944] fsck.f2fs: 6: addr 0 nid 3cf9 ver 1\x0a
>>>> [   21.165128] fsck.f2fs: 7: addr 0 nid 3b7b ver 1\x0a
>>>> [   21.171362] fsck.f2fs: 8: addr 18f65 nid 34e2 ver 0\x0a
>>>> [   21.177903] fsck.f2fs: 9: addr 12126 nid 2c87 ver 0\x0a
>>>> [   21.184466] fsck.f2fs: 10: addr 0 nid 2a0a ver 1\x0a
>>>> [   21.190795] fsck.f2fs: 11: addr 0 nid 2565 ver 1\x0a
>>>> [   21.197070] fsck.f2fs: 12: addr 0 nid 1f48 ver 0\x0a
>>>> [   21.203369] fsck.f2fs: 13: addr 155bf nid 1d76 ver 0\x0a
>>>> [   21.210042] fsck.f2fs: 14: addr 19538 nid 18da ver 0\x0a
>>>> [   21.216675] fsck.f2fs: 15: addr 0 nid 157f ver 0\x0a
>>>> [   21.223037] fsck.f2fs: 16: addr d117 nid ac3 ver 0\x0a
>>>> [   21.229547] fsck.f2fs: 17: addr 11e48 nid 805 ver 0\x0a
>>>> [   21.236088] fsck.f2fs: 18: addr d112 nid 61c ver 0\x0a
>>>> [   21.242585] fsck.f2fs: 19: addr d00b nid 45b ver 0\x0a
>>>> [   21.249086] fsck.f2fs: 20: addr d111 nid 30d ver 0\x0a
>>>> [   21.255544] fsck.f2fs: 21: addr d110 nid 1b5 ver 0\x0a
>>>> [   21.262025] fsck.f2fs: 22: addr 0 nid 5e41 ver 1\x0a
>>>> [   21.268342] fsck.f2fs: 23: addr 0 nid 5e46 ver 0\x0a
>>>> [   21.274618] fsck.f2fs: 24: addr 0 nid 5a34 ver 1\x0a
>>>> [   21.280916] fsck.f2fs: 25: addr 0 nid 59fa ver 1\x0a
>>>> [   21.287192] fsck.f2fs: 26: addr 0 nid 57c3 ver 1\x0a
>>>> [   21.293526] fsck.f2fs: 27: addr 0 nid 5776 ver 1\x0a
>>>> [   21.299816] fsck.f2fs: 28: addr 0 nid 4f20 ver 1\x0a
>>>> [   21.306091] fsck.f2fs: 29: addr 0 nid 4ff7 ver 0\x0a
>>>> [   21.312416] fsck.f2fs: 30: addr 0 nid 47e1 ver 1\x0a
>>>> [   21.318715] fsck.f2fs: 31: addr 0 nid 4880 ver 1\x0a
>>>> [   21.324990] fsck.f2fs: 32: addr d118 nid ff4 ver 0\x0a
>>>> [   21.331498] fsck.f2fs: 33: addr d10c nid ff9 ver 0\x0a
>>>> [   21.337957] fsck.f2fs: 34: addr 0 nid 49ed ver 1\x0a
>>>> [   21.344290] fsck.f2fs: 35: addr 0 nid 497e ver 0\x0a
>>>> [   21.350612] fsck.f2fs: 36: addr fc5a nid 49d2 ver 0\x0a
>>>>
>>>> Thanks,
>>>>
>>>> On Fri, Jul 05, 2019 at 04:33:57PM +0530, Sahitya Tummala wrote:
>>>>> Hi Chao,
>>>>>
>>>>> On Fri, Jul 05, 2019 at 06:11:47PM +0800, Chao Yu wrote:
>>>>>> Hi Sahitya,
>>>>>>
>>>>>> On 2019/7/5 13:01, Sahitya Tummala wrote:
>>>>>>> With the commit <979b25727800> ("fsck.f2fs: check validity of nat journal"),
>>>>>>> there is a serious data loss issue observed during sudden power off tests.
>>>>>>
>>>>>> In that patch, since journal's data is untrusty, I just truncated n_nats to last
>>>>>> valid one to fix original reported issue.
>>>>>
>>>>> The original issue will be fixed even without updating/truncating the n_nats
>>>>> as we still have sanity checks on nid and blkaddr, right?
>>>>>
>>>>>>
>>>>>> Actually, the problem here is we don't know which data is correct and which data
>>>>>> is corrupted in nat or sit journal, so even we skip broken entry without
>>>>>> modifying journal->n_{nats,sits}, latter entry may still contain broken data(but
>>>>>> it may pass validation check), and those entries may cause potential data
>>>>>> loss/corruption...
>>>>>>
>>>>>
>>>>> This issue was reported by a customer in the following test scenario -
>>>>>
>>>>> Update the password in Android and then force crash. It is verified that
>>>>> all the files related to password data has fsync() already.
>>>>>
>>>>> After some iterations (3-4) of this test, it is found password related files are
>>>>> lost, failing the phone to bootup.
>>>>>
>>>>> Without this patch ("fsck.f2fs: check validity of nat journal"), they don't see
>>>>> such issue. It means that the data present in journal seems to be valid atleast
>>>>> in this case and truncating is causing to loose the valid data/files.
>>>>>
>>>>>> It looks no matter how we change the codes, we still may face data
>>>>>> loos/corruption issues.
>>>>>>
>>>>>
>>>>> Yes, maybe in some cases, the data loss is still expected, but there are
>>>>> chances that data may be valid as well, which is confirmed by the test case above.
>>>>>
>>>>>> IMO, we need to find out the root cause of data corruption, So my question is
>>>>>> that why sudden power off tests can make nat journal being corrupted? I doubt
>>>>>> that in-place update method on checkpoint area of fsck may cause this... any
>>>>>> other thoughts?
>>>>>>
>>>>>
>>>>> In case of sudden power off, I have printed the all the journal entries in the
>>>>> NAT and I several entries with valid blkaddr and nid as below -
>>>>>
>>>>> [   21.007649] fsck.f2fs: withfix: 0: addr 1c373 nid 45b\x0a
>>>>> [   21.014399] fsck.f2fs: withfix: 1: addr 1c2c1 nid 1b5\x0a
>>>>> [   21.021151] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 2/37 is invalid!!!\x0a
>>>>> [   21.030860] fsck.f2fs: withfix: 3: addr 1c1f7 nid 19b2\x0a
>>>>> [   21.037674] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 4/37 is invalid!!!\x0a
>>>>> [   21.047353] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 5/37 is invalid!!!\x0a
>>>>> [   21.057062] fsck.f2fs: withfix: 6: addr 1c377 nid ac3\x0a
>>>>> [   21.063841] fsck.f2fs: withfix: 7: addr 1c378 nid ac9\x0a
>>>>> [   21.070634] fsck.f2fs: withfix: 8: addr 1b2fe nid 807\x0a
>>>>> [   21.077356] fsck.f2fs: withfix: 9: addr 17aeb nid 805\x0a
>>>>> [   21.084102] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 10/37 is invalid!!!\x0a
>>>>> [   21.093891] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 11/37 is invalid!!!\x0a
>>>>> [   21.103662] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 12/37 is invalid!!!\x0a
>>>>> [   21.113450] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 13/37 is invalid!!!\x0a
>>>>> [   21.123227] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 14/37 is invalid!!!\x0a
>>>>> [   21.133022] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 15/37 is invalid!!!\x0a
>>>>> [   21.142792] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 16/37 is invalid!!!\x0a
>>>>> [   21.152580] fsck.f2fs: withfix: 17: addr 1c37f nid ff4\x0a
>>>>> [   21.159415] fsck.f2fs: withfix: 18: addr 1c36c nid ff9\x0a
>>>>> [   21.166229] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 19/37 is invalid!!!\x0a
>>>>> [   21.176039] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 20/37 is invalid!!!\x0a
>>>>> [   21.185831] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 21/37 is invalid!!!\x0a
>>>>> [   21.195631] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 22/37 is invalid!!!\x0a
>>>>> [   21.205394] fsck.f2fs: withfix: 23: addr 1c2f3 nid 61c\x0a
>>>>> [   21.212257] fsck.f2fs: withfix: 24: addr 1c391 nid 641\x0a
>>>>> [   21.219093] fsck.f2fs: withfix: 25: addr 1c386 nid 33b\x0a
>>>>> [   21.225905] fsck.f2fs: withfix: 26: addr 1c356 nid 258\x0a
>>>>> [   21.232767] fsck.f2fs: withfix: 27: addr 1c2f2 nid 382\x0a
>>>>> [   21.239601] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 28/37 is invalid!!!\x0a
>>>>> [   21.249398] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 29/37 is invalid!!!\x0a
>>>>> [   21.259160] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 30/37 is invalid!!!\x0a
>>>>> [   21.268970] fsck.f2fs: withfix: 31: addr 1ac03 nid 1ccd\x0a
>>>>> [   21.275874] fsck.f2fs: withfix: 32: addr 1c395 nid a57\x0a
>>>>> [   21.282709] fsck.f2fs: withfix: 33: addr 1c392 nid 92e\x0a
>>>>> [   21.289564] fsck.f2fs: withfix: 34: addr 1c38b nid aa1\x0a
>>>>> [   21.296377] fsck.f2fs: withfix: 35: addr 1c393 nid 931\x0a
>>>>> [   21.303240] fsck.f2fs: withfix: 36: addr 1c394 nid a35\x0a
>>>>>
>>>>> Thanks,
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>>
>>>>>>> The reason is due to incorrect update of journal->n_nats in that patch.
>>>>>>> When fsck encounters a nat journal entry with invalid blkaddr/nid, it
>>>>>>> updates journal->n_nats with the index of such first entry and then continues
>>>>>>> the for loop which scans all nats in journal in f2fs_init_nid_bitmap().
>>>>>>> But that loop doesn't continue and doesn't update nm_i->nid_bitmap as well,
>>>>>>> as it fails the for loop condition i < nats_in_cursum(journal).
>>>>>>>
>>>>>>> There could be several other valid entries in the journal after this
>>>>>>> invalid entry. Due to this invalid journal->n_nats, build_nat_area_bitmap()
>>>>>>> could not be build properly for valid nat journal entries. It further results
>>>>>>> into these fsck asserts/errors and causes data loss due to incorrect fsck fix.
>>>>>>>
>>>>>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfb01] ino is 0\x0a
>>>>>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfb01] - _JeNcl242yB3Apz2MW,VOh6WmjsVht1W len[0x1c], type[0x1]\x0a
>>>>>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfa2d] ino is 0\x0a
>>>>>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfa2d] - _++vN7axccjjGNI,TQEVmcOemY1EAAAA len[0x14], type[0x1]\x0a
>>>>>>> ...
>>>>>>> NID[0xba5] is unreachable\x0a
>>>>>>> NID[0xfa9] is unreachable\x0a
>>>>>>> ...
>>>>>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xba5] in NAT\x0a
>>>>>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xfa9] in NAT\x0a
>>>>>>> ...
>>>>>>>
>>>>>>> Fixes: 979b25727800 ("fsck.f2fs: check validity of nat journal")
>>>>>>> Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
>>>>>>> ---
>>>>>>>  fsck/mount.c | 2 --
>>>>>>>  1 file changed, 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fsck/mount.c b/fsck/mount.c
>>>>>>> index 60e0e4a..1b93f49 100644
>>>>>>> --- a/fsck/mount.c
>>>>>>> +++ b/fsck/mount.c
>>>>>>> @@ -1160,7 +1160,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>>>>>>>  		addr = le32_to_cpu(nat_in_journal(journal, i).block_addr);
>>>>>>>  		if (!IS_VALID_BLK_ADDR(sbi, addr)) {
>>>>>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: addr(%u) is invalid!!!\n", addr);
>>>>>>> -			journal->n_nats = cpu_to_le16(i);
>>>>>>>  			c.fix_on = 1;
>>>>>>>  			continue;
>>>>>>>  		}
>>>>>>> @@ -1168,7 +1167,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>>>>>>>  		nid = le32_to_cpu(nid_in_journal(journal, i));
>>>>>>>  		if (!IS_VALID_NID(sbi, nid)) {
>>>>>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: nid(%u) is invalid!!!\n", nid);
>>>>>>> -			journal->n_nats = cpu_to_le16(i);
>>>>>>>  			c.fix_on = 1;
>>>>>>>  			continue;
>>>>>>>  		}
>>>>>>>
>>>>>
>>>>> -- 
>>>>> --
>>>>> Sent by a consultant of the Qualcomm Innovation Center, Inc.
>>>>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Linux-f2fs-devel mailing list
>>>>> Linux-f2fs-devel@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
>>>>
>>
>> -- 
>> --
>> Sent by a consultant of the Qualcomm Innovation Center, Inc.
>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
>>
>>
>> _______________________________________________
>> Linux-f2fs-devel mailing list
>> Linux-f2fs-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> 


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO
  2019-07-06  1:37           ` Sahitya Tummala
  2019-07-06  2:10             ` Chao Yu
@ 2019-07-06  2:10             ` Chao Yu
  2019-07-06  2:17               ` Chao Yu
  1 sibling, 1 reply; 14+ messages in thread
From: Chao Yu @ 2019-07-06  2:10 UTC (permalink / raw)
  To: Sahitya Tummala; +Cc: Jaegeuk Kim, linux-f2fs-devel

Hi Sahitya,

On 2019-7-6 9:37, Sahitya Tummala wrote:
> Hi Chao,
> 
> I just realized that there is a patch from you to fix this already.
> 
> commit 92ea71c711681a8f575d09d57d5ceaebd6644fcc
> Author: Chao Yu <yuchao0@huawei.com>
> Date:   Wed Jan 9 10:48:25 2019 +0800
> 
>     fsck.f2fs: fix incorrect boundary of IS_VALID_NID()
> 
>     nid should never equal to max_nid, fix it.
> 
>     Signed-off-by: Chao Yu <yuchao0@huawei.com>
>     Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
> 
> In my code base, I was not having this patch but only have
> "fsck.f2fs: check validity of nat journal", thus resulting into
> data loss issues. I will now verify with this patch included.

Ah, that should be the root cause..

BTW, the diff was pointed out by Jaegeuk on patch ("fsck.f2fs: check validity of
nat journal"), but it looks it was merged to the wrong patch. :P

 static inline bool IS_VALID_BLK_ADDR(struct f2fs_sb_info *sbi, u32 addr)
 {
+	if (addr == NULL_ADDR || addr == NEW_ADDR)
+		return 1;
+
 	if (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
 				addr < SM_I(sbi)->main_blkaddr) {
 		DBG(1, "block addr [0x%x]\n", addr);

Thanks,

> 
> Thanks for your support on this issue.
> 
> Thanks,
> Sahitya.
> 
> On Sat, Jul 06, 2019 at 07:01:12AM +0530, Sahitya Tummala wrote:
>> Hi Chao,
>>
>> On Sat, Jul 06, 2019 at 09:14:59AM +0800, Chao Yu wrote:
>>> Hi Sahitya,
>>>
>>> On 2019-7-5 21:48, Sahitya Tummala wrote:
>>>> Hi Chao,
>>>>
>>>> The journal doesn't look to be corrupted, but only block_addr
>>>> in f2fs_nat_entry is 0 whereas nid and version seems to be valid.
>>>> These could be just truncated nodes, right?
>>>
>>> Yes, but that should not fail the condition of journal check and result in
>>> journal size truncation, right?
>>>
>>
>> Yes, the patch fails for this condition as well and results into
>> truncation of journal, due to this check  addr < SM_I(sbi)->main_blkaddr
>> in IS_VALID_BLK_ADDR().
>>
>> Do you think we can correct it this way?
>>
>> diff --git a/fsck/f2fs.h b/fsck/f2fs.h
>> index 417ca0b..00fb740 100644
>> --- a/fsck/f2fs.h
>> +++ b/fsck/f2fs.h
>> @@ -364,8 +364,8 @@ static inline bool IS_VALID_NID(struct f2fs_sb_info *sbi, u32 nid)
>>
>>  static inline bool IS_VALID_BLK_ADDR(struct f2fs_sb_info *sbi, u32 addr)
>>  {
>> -       if (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
>> -                               addr < SM_I(sbi)->main_blkaddr) {
>> +       if (addr && (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
>> +                               addr < SM_I(sbi)->main_blkaddr)) {
>>                 DBG(1, "block addr [0x%x]\n", addr);
>>                 return 0;
>>         }
>>
>> Thanks,
>>
>>> Thanks,
>>>
>>>>
>>>> Here is the full dump of nat journal entries after a SPO (this is not a 
>>>> issue case, but just a normal case with SPO) -
>>>>
>>>> [   21.121325] fsck.f2fs: 0: addr 0 nid 5c5a ver 1\x0a
>>>> [   21.127509] fsck.f2fs: 1: addr 0 nid 4c1d ver 1\x0a
>>>> [   21.133760] fsck.f2fs: 2: addr 12075 nid 444f ver 0\x0a
>>>> [   21.140325] fsck.f2fs: 3: addr 0 nid 4307 ver 0\x0a
>>>> [   21.146510] fsck.f2fs: 4: addr 0 nid 4059 ver 1\x0a
>>>> [   21.152745] fsck.f2fs: 5: addr 0 nid 3e69 ver 1\x0a
>>>> [   21.158944] fsck.f2fs: 6: addr 0 nid 3cf9 ver 1\x0a
>>>> [   21.165128] fsck.f2fs: 7: addr 0 nid 3b7b ver 1\x0a
>>>> [   21.171362] fsck.f2fs: 8: addr 18f65 nid 34e2 ver 0\x0a
>>>> [   21.177903] fsck.f2fs: 9: addr 12126 nid 2c87 ver 0\x0a
>>>> [   21.184466] fsck.f2fs: 10: addr 0 nid 2a0a ver 1\x0a
>>>> [   21.190795] fsck.f2fs: 11: addr 0 nid 2565 ver 1\x0a
>>>> [   21.197070] fsck.f2fs: 12: addr 0 nid 1f48 ver 0\x0a
>>>> [   21.203369] fsck.f2fs: 13: addr 155bf nid 1d76 ver 0\x0a
>>>> [   21.210042] fsck.f2fs: 14: addr 19538 nid 18da ver 0\x0a
>>>> [   21.216675] fsck.f2fs: 15: addr 0 nid 157f ver 0\x0a
>>>> [   21.223037] fsck.f2fs: 16: addr d117 nid ac3 ver 0\x0a
>>>> [   21.229547] fsck.f2fs: 17: addr 11e48 nid 805 ver 0\x0a
>>>> [   21.236088] fsck.f2fs: 18: addr d112 nid 61c ver 0\x0a
>>>> [   21.242585] fsck.f2fs: 19: addr d00b nid 45b ver 0\x0a
>>>> [   21.249086] fsck.f2fs: 20: addr d111 nid 30d ver 0\x0a
>>>> [   21.255544] fsck.f2fs: 21: addr d110 nid 1b5 ver 0\x0a
>>>> [   21.262025] fsck.f2fs: 22: addr 0 nid 5e41 ver 1\x0a
>>>> [   21.268342] fsck.f2fs: 23: addr 0 nid 5e46 ver 0\x0a
>>>> [   21.274618] fsck.f2fs: 24: addr 0 nid 5a34 ver 1\x0a
>>>> [   21.280916] fsck.f2fs: 25: addr 0 nid 59fa ver 1\x0a
>>>> [   21.287192] fsck.f2fs: 26: addr 0 nid 57c3 ver 1\x0a
>>>> [   21.293526] fsck.f2fs: 27: addr 0 nid 5776 ver 1\x0a
>>>> [   21.299816] fsck.f2fs: 28: addr 0 nid 4f20 ver 1\x0a
>>>> [   21.306091] fsck.f2fs: 29: addr 0 nid 4ff7 ver 0\x0a
>>>> [   21.312416] fsck.f2fs: 30: addr 0 nid 47e1 ver 1\x0a
>>>> [   21.318715] fsck.f2fs: 31: addr 0 nid 4880 ver 1\x0a
>>>> [   21.324990] fsck.f2fs: 32: addr d118 nid ff4 ver 0\x0a
>>>> [   21.331498] fsck.f2fs: 33: addr d10c nid ff9 ver 0\x0a
>>>> [   21.337957] fsck.f2fs: 34: addr 0 nid 49ed ver 1\x0a
>>>> [   21.344290] fsck.f2fs: 35: addr 0 nid 497e ver 0\x0a
>>>> [   21.350612] fsck.f2fs: 36: addr fc5a nid 49d2 ver 0\x0a
>>>>
>>>> Thanks,
>>>>
>>>> On Fri, Jul 05, 2019 at 04:33:57PM +0530, Sahitya Tummala wrote:
>>>>> Hi Chao,
>>>>>
>>>>> On Fri, Jul 05, 2019 at 06:11:47PM +0800, Chao Yu wrote:
>>>>>> Hi Sahitya,
>>>>>>
>>>>>> On 2019/7/5 13:01, Sahitya Tummala wrote:
>>>>>>> With the commit <979b25727800> ("fsck.f2fs: check validity of nat journal"),
>>>>>>> there is a serious data loss issue observed during sudden power off tests.
>>>>>>
>>>>>> In that patch, since journal's data is untrusty, I just truncated n_nats to last
>>>>>> valid one to fix original reported issue.
>>>>>
>>>>> The original issue will be fixed even without updating/truncating the n_nats
>>>>> as we still have sanity checks on nid and blkaddr, right?
>>>>>
>>>>>>
>>>>>> Actually, the problem here is we don't know which data is correct and which data
>>>>>> is corrupted in nat or sit journal, so even we skip broken entry without
>>>>>> modifying journal->n_{nats,sits}, latter entry may still contain broken data(but
>>>>>> it may pass validation check), and those entries may cause potential data
>>>>>> loss/corruption...
>>>>>>
>>>>>
>>>>> This issue was reported by a customer in the following test scenario -
>>>>>
>>>>> Update the password in Android and then force crash. It is verified that
>>>>> all the files related to password data has fsync() already.
>>>>>
>>>>> After some iterations (3-4) of this test, it is found password related files are
>>>>> lost, failing the phone to bootup.
>>>>>
>>>>> Without this patch ("fsck.f2fs: check validity of nat journal"), they don't see
>>>>> such issue. It means that the data present in journal seems to be valid atleast
>>>>> in this case and truncating is causing to loose the valid data/files.
>>>>>
>>>>>> It looks no matter how we change the codes, we still may face data
>>>>>> loos/corruption issues.
>>>>>>
>>>>>
>>>>> Yes, maybe in some cases, the data loss is still expected, but there are
>>>>> chances that data may be valid as well, which is confirmed by the test case above.
>>>>>
>>>>>> IMO, we need to find out the root cause of data corruption, So my question is
>>>>>> that why sudden power off tests can make nat journal being corrupted? I doubt
>>>>>> that in-place update method on checkpoint area of fsck may cause this... any
>>>>>> other thoughts?
>>>>>>
>>>>>
>>>>> In case of sudden power off, I have printed the all the journal entries in the
>>>>> NAT and I several entries with valid blkaddr and nid as below -
>>>>>
>>>>> [   21.007649] fsck.f2fs: withfix: 0: addr 1c373 nid 45b\x0a
>>>>> [   21.014399] fsck.f2fs: withfix: 1: addr 1c2c1 nid 1b5\x0a
>>>>> [   21.021151] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 2/37 is invalid!!!\x0a
>>>>> [   21.030860] fsck.f2fs: withfix: 3: addr 1c1f7 nid 19b2\x0a
>>>>> [   21.037674] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 4/37 is invalid!!!\x0a
>>>>> [   21.047353] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 5/37 is invalid!!!\x0a
>>>>> [   21.057062] fsck.f2fs: withfix: 6: addr 1c377 nid ac3\x0a
>>>>> [   21.063841] fsck.f2fs: withfix: 7: addr 1c378 nid ac9\x0a
>>>>> [   21.070634] fsck.f2fs: withfix: 8: addr 1b2fe nid 807\x0a
>>>>> [   21.077356] fsck.f2fs: withfix: 9: addr 17aeb nid 805\x0a
>>>>> [   21.084102] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 10/37 is invalid!!!\x0a
>>>>> [   21.093891] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 11/37 is invalid!!!\x0a
>>>>> [   21.103662] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 12/37 is invalid!!!\x0a
>>>>> [   21.113450] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 13/37 is invalid!!!\x0a
>>>>> [   21.123227] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 14/37 is invalid!!!\x0a
>>>>> [   21.133022] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 15/37 is invalid!!!\x0a
>>>>> [   21.142792] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 16/37 is invalid!!!\x0a
>>>>> [   21.152580] fsck.f2fs: withfix: 17: addr 1c37f nid ff4\x0a
>>>>> [   21.159415] fsck.f2fs: withfix: 18: addr 1c36c nid ff9\x0a
>>>>> [   21.166229] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 19/37 is invalid!!!\x0a
>>>>> [   21.176039] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 20/37 is invalid!!!\x0a
>>>>> [   21.185831] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 21/37 is invalid!!!\x0a
>>>>> [   21.195631] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 22/37 is invalid!!!\x0a
>>>>> [   21.205394] fsck.f2fs: withfix: 23: addr 1c2f3 nid 61c\x0a
>>>>> [   21.212257] fsck.f2fs: withfix: 24: addr 1c391 nid 641\x0a
>>>>> [   21.219093] fsck.f2fs: withfix: 25: addr 1c386 nid 33b\x0a
>>>>> [   21.225905] fsck.f2fs: withfix: 26: addr 1c356 nid 258\x0a
>>>>> [   21.232767] fsck.f2fs: withfix: 27: addr 1c2f2 nid 382\x0a
>>>>> [   21.239601] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 28/37 is invalid!!!\x0a
>>>>> [   21.249398] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 29/37 is invalid!!!\x0a
>>>>> [   21.259160] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 30/37 is invalid!!!\x0a
>>>>> [   21.268970] fsck.f2fs: withfix: 31: addr 1ac03 nid 1ccd\x0a
>>>>> [   21.275874] fsck.f2fs: withfix: 32: addr 1c395 nid a57\x0a
>>>>> [   21.282709] fsck.f2fs: withfix: 33: addr 1c392 nid 92e\x0a
>>>>> [   21.289564] fsck.f2fs: withfix: 34: addr 1c38b nid aa1\x0a
>>>>> [   21.296377] fsck.f2fs: withfix: 35: addr 1c393 nid 931\x0a
>>>>> [   21.303240] fsck.f2fs: withfix: 36: addr 1c394 nid a35\x0a
>>>>>
>>>>> Thanks,
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>>
>>>>>>> The reason is due to incorrect update of journal->n_nats in that patch.
>>>>>>> When fsck encounters a nat journal entry with invalid blkaddr/nid, it
>>>>>>> updates journal->n_nats with the index of such first entry and then continues
>>>>>>> the for loop which scans all nats in journal in f2fs_init_nid_bitmap().
>>>>>>> But that loop doesn't continue and doesn't update nm_i->nid_bitmap as well,
>>>>>>> as it fails the for loop condition i < nats_in_cursum(journal).
>>>>>>>
>>>>>>> There could be several other valid entries in the journal after this
>>>>>>> invalid entry. Due to this invalid journal->n_nats, build_nat_area_bitmap()
>>>>>>> could not be build properly for valid nat journal entries. It further results
>>>>>>> into these fsck asserts/errors and causes data loss due to incorrect fsck fix.
>>>>>>>
>>>>>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfb01] ino is 0\x0a
>>>>>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfb01] - _JeNcl242yB3Apz2MW,VOh6WmjsVht1W len[0x1c], type[0x1]\x0a
>>>>>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfa2d] ino is 0\x0a
>>>>>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfa2d] - _++vN7axccjjGNI,TQEVmcOemY1EAAAA len[0x14], type[0x1]\x0a
>>>>>>> ...
>>>>>>> NID[0xba5] is unreachable\x0a
>>>>>>> NID[0xfa9] is unreachable\x0a
>>>>>>> ...
>>>>>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xba5] in NAT\x0a
>>>>>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xfa9] in NAT\x0a
>>>>>>> ...
>>>>>>>
>>>>>>> Fixes: 979b25727800 ("fsck.f2fs: check validity of nat journal")
>>>>>>> Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
>>>>>>> ---
>>>>>>>  fsck/mount.c | 2 --
>>>>>>>  1 file changed, 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fsck/mount.c b/fsck/mount.c
>>>>>>> index 60e0e4a..1b93f49 100644
>>>>>>> --- a/fsck/mount.c
>>>>>>> +++ b/fsck/mount.c
>>>>>>> @@ -1160,7 +1160,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>>>>>>>  		addr = le32_to_cpu(nat_in_journal(journal, i).block_addr);
>>>>>>>  		if (!IS_VALID_BLK_ADDR(sbi, addr)) {
>>>>>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: addr(%u) is invalid!!!\n", addr);
>>>>>>> -			journal->n_nats = cpu_to_le16(i);
>>>>>>>  			c.fix_on = 1;
>>>>>>>  			continue;
>>>>>>>  		}
>>>>>>> @@ -1168,7 +1167,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>>>>>>>  		nid = le32_to_cpu(nid_in_journal(journal, i));
>>>>>>>  		if (!IS_VALID_NID(sbi, nid)) {
>>>>>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: nid(%u) is invalid!!!\n", nid);
>>>>>>> -			journal->n_nats = cpu_to_le16(i);
>>>>>>>  			c.fix_on = 1;
>>>>>>>  			continue;
>>>>>>>  		}
>>>>>>>
>>>>>
>>>>> -- 
>>>>> --
>>>>> Sent by a consultant of the Qualcomm Innovation Center, Inc.
>>>>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Linux-f2fs-devel mailing list
>>>>> Linux-f2fs-devel@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
>>>>
>>
>> -- 
>> --
>> Sent by a consultant of the Qualcomm Innovation Center, Inc.
>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
>>
>>
>> _______________________________________________
>> Linux-f2fs-devel mailing list
>> Linux-f2fs-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> 


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO
  2019-07-06  2:10             ` Chao Yu
@ 2019-07-06  2:17               ` Chao Yu
  2019-07-06  4:50                 ` Jaegeuk Kim
  0 siblings, 1 reply; 14+ messages in thread
From: Chao Yu @ 2019-07-06  2:17 UTC (permalink / raw)
  To: Sahitya Tummala; +Cc: Jaegeuk Kim, linux-f2fs-devel

On 2019-7-6 10:10, Chao Yu wrote:
> Hi Sahitya,
> 
> On 2019-7-6 9:37, Sahitya Tummala wrote:
>> Hi Chao,
>>
>> I just realized that there is a patch from you to fix this already.
>>
>> commit 92ea71c711681a8f575d09d57d5ceaebd6644fcc
>> Author: Chao Yu <yuchao0@huawei.com>
>> Date:   Wed Jan 9 10:48:25 2019 +0800
>>
>>     fsck.f2fs: fix incorrect boundary of IS_VALID_NID()
>>
>>     nid should never equal to max_nid, fix it.
>>
>>     Signed-off-by: Chao Yu <yuchao0@huawei.com>
>>     Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
>>
>> In my code base, I was not having this patch but only have
>> "fsck.f2fs: check validity of nat journal", thus resulting into
>> data loss issues. I will now verify with this patch included.
> 
> Ah, that should be the root cause..
> 
> BTW, the diff was pointed out by Jaegeuk on patch ("fsck.f2fs: check validity of
> nat journal"), but it looks it was merged to the wrong patch. :P

Sorry about that, I didn't check it out on upstream branch.

Thanks,

> 
>  static inline bool IS_VALID_BLK_ADDR(struct f2fs_sb_info *sbi, u32 addr)
>  {
> +	if (addr == NULL_ADDR || addr == NEW_ADDR)
> +		return 1;
> +
>  	if (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
>  				addr < SM_I(sbi)->main_blkaddr) {
>  		DBG(1, "block addr [0x%x]\n", addr);
> 
> Thanks,
> 
>>
>> Thanks for your support on this issue.
>>
>> Thanks,
>> Sahitya.
>>
>> On Sat, Jul 06, 2019 at 07:01:12AM +0530, Sahitya Tummala wrote:
>>> Hi Chao,
>>>
>>> On Sat, Jul 06, 2019 at 09:14:59AM +0800, Chao Yu wrote:
>>>> Hi Sahitya,
>>>>
>>>> On 2019-7-5 21:48, Sahitya Tummala wrote:
>>>>> Hi Chao,
>>>>>
>>>>> The journal doesn't look to be corrupted, but only block_addr
>>>>> in f2fs_nat_entry is 0 whereas nid and version seems to be valid.
>>>>> These could be just truncated nodes, right?
>>>>
>>>> Yes, but that should not fail the condition of journal check and result in
>>>> journal size truncation, right?
>>>>
>>>
>>> Yes, the patch fails for this condition as well and results into
>>> truncation of journal, due to this check  addr < SM_I(sbi)->main_blkaddr
>>> in IS_VALID_BLK_ADDR().
>>>
>>> Do you think we can correct it this way?
>>>
>>> diff --git a/fsck/f2fs.h b/fsck/f2fs.h
>>> index 417ca0b..00fb740 100644
>>> --- a/fsck/f2fs.h
>>> +++ b/fsck/f2fs.h
>>> @@ -364,8 +364,8 @@ static inline bool IS_VALID_NID(struct f2fs_sb_info *sbi, u32 nid)
>>>
>>>  static inline bool IS_VALID_BLK_ADDR(struct f2fs_sb_info *sbi, u32 addr)
>>>  {
>>> -       if (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
>>> -                               addr < SM_I(sbi)->main_blkaddr) {
>>> +       if (addr && (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
>>> +                               addr < SM_I(sbi)->main_blkaddr)) {
>>>                 DBG(1, "block addr [0x%x]\n", addr);
>>>                 return 0;
>>>         }
>>>
>>> Thanks,
>>>
>>>> Thanks,
>>>>
>>>>>
>>>>> Here is the full dump of nat journal entries after a SPO (this is not a 
>>>>> issue case, but just a normal case with SPO) -
>>>>>
>>>>> [   21.121325] fsck.f2fs: 0: addr 0 nid 5c5a ver 1\x0a
>>>>> [   21.127509] fsck.f2fs: 1: addr 0 nid 4c1d ver 1\x0a
>>>>> [   21.133760] fsck.f2fs: 2: addr 12075 nid 444f ver 0\x0a
>>>>> [   21.140325] fsck.f2fs: 3: addr 0 nid 4307 ver 0\x0a
>>>>> [   21.146510] fsck.f2fs: 4: addr 0 nid 4059 ver 1\x0a
>>>>> [   21.152745] fsck.f2fs: 5: addr 0 nid 3e69 ver 1\x0a
>>>>> [   21.158944] fsck.f2fs: 6: addr 0 nid 3cf9 ver 1\x0a
>>>>> [   21.165128] fsck.f2fs: 7: addr 0 nid 3b7b ver 1\x0a
>>>>> [   21.171362] fsck.f2fs: 8: addr 18f65 nid 34e2 ver 0\x0a
>>>>> [   21.177903] fsck.f2fs: 9: addr 12126 nid 2c87 ver 0\x0a
>>>>> [   21.184466] fsck.f2fs: 10: addr 0 nid 2a0a ver 1\x0a
>>>>> [   21.190795] fsck.f2fs: 11: addr 0 nid 2565 ver 1\x0a
>>>>> [   21.197070] fsck.f2fs: 12: addr 0 nid 1f48 ver 0\x0a
>>>>> [   21.203369] fsck.f2fs: 13: addr 155bf nid 1d76 ver 0\x0a
>>>>> [   21.210042] fsck.f2fs: 14: addr 19538 nid 18da ver 0\x0a
>>>>> [   21.216675] fsck.f2fs: 15: addr 0 nid 157f ver 0\x0a
>>>>> [   21.223037] fsck.f2fs: 16: addr d117 nid ac3 ver 0\x0a
>>>>> [   21.229547] fsck.f2fs: 17: addr 11e48 nid 805 ver 0\x0a
>>>>> [   21.236088] fsck.f2fs: 18: addr d112 nid 61c ver 0\x0a
>>>>> [   21.242585] fsck.f2fs: 19: addr d00b nid 45b ver 0\x0a
>>>>> [   21.249086] fsck.f2fs: 20: addr d111 nid 30d ver 0\x0a
>>>>> [   21.255544] fsck.f2fs: 21: addr d110 nid 1b5 ver 0\x0a
>>>>> [   21.262025] fsck.f2fs: 22: addr 0 nid 5e41 ver 1\x0a
>>>>> [   21.268342] fsck.f2fs: 23: addr 0 nid 5e46 ver 0\x0a
>>>>> [   21.274618] fsck.f2fs: 24: addr 0 nid 5a34 ver 1\x0a
>>>>> [   21.280916] fsck.f2fs: 25: addr 0 nid 59fa ver 1\x0a
>>>>> [   21.287192] fsck.f2fs: 26: addr 0 nid 57c3 ver 1\x0a
>>>>> [   21.293526] fsck.f2fs: 27: addr 0 nid 5776 ver 1\x0a
>>>>> [   21.299816] fsck.f2fs: 28: addr 0 nid 4f20 ver 1\x0a
>>>>> [   21.306091] fsck.f2fs: 29: addr 0 nid 4ff7 ver 0\x0a
>>>>> [   21.312416] fsck.f2fs: 30: addr 0 nid 47e1 ver 1\x0a
>>>>> [   21.318715] fsck.f2fs: 31: addr 0 nid 4880 ver 1\x0a
>>>>> [   21.324990] fsck.f2fs: 32: addr d118 nid ff4 ver 0\x0a
>>>>> [   21.331498] fsck.f2fs: 33: addr d10c nid ff9 ver 0\x0a
>>>>> [   21.337957] fsck.f2fs: 34: addr 0 nid 49ed ver 1\x0a
>>>>> [   21.344290] fsck.f2fs: 35: addr 0 nid 497e ver 0\x0a
>>>>> [   21.350612] fsck.f2fs: 36: addr fc5a nid 49d2 ver 0\x0a
>>>>>
>>>>> Thanks,
>>>>>
>>>>> On Fri, Jul 05, 2019 at 04:33:57PM +0530, Sahitya Tummala wrote:
>>>>>> Hi Chao,
>>>>>>
>>>>>> On Fri, Jul 05, 2019 at 06:11:47PM +0800, Chao Yu wrote:
>>>>>>> Hi Sahitya,
>>>>>>>
>>>>>>> On 2019/7/5 13:01, Sahitya Tummala wrote:
>>>>>>>> With the commit <979b25727800> ("fsck.f2fs: check validity of nat journal"),
>>>>>>>> there is a serious data loss issue observed during sudden power off tests.
>>>>>>>
>>>>>>> In that patch, since journal's data is untrusty, I just truncated n_nats to last
>>>>>>> valid one to fix original reported issue.
>>>>>>
>>>>>> The original issue will be fixed even without updating/truncating the n_nats
>>>>>> as we still have sanity checks on nid and blkaddr, right?
>>>>>>
>>>>>>>
>>>>>>> Actually, the problem here is we don't know which data is correct and which data
>>>>>>> is corrupted in nat or sit journal, so even we skip broken entry without
>>>>>>> modifying journal->n_{nats,sits}, latter entry may still contain broken data(but
>>>>>>> it may pass validation check), and those entries may cause potential data
>>>>>>> loss/corruption...
>>>>>>>
>>>>>>
>>>>>> This issue was reported by a customer in the following test scenario -
>>>>>>
>>>>>> Update the password in Android and then force crash. It is verified that
>>>>>> all the files related to password data has fsync() already.
>>>>>>
>>>>>> After some iterations (3-4) of this test, it is found password related files are
>>>>>> lost, failing the phone to bootup.
>>>>>>
>>>>>> Without this patch ("fsck.f2fs: check validity of nat journal"), they don't see
>>>>>> such issue. It means that the data present in journal seems to be valid atleast
>>>>>> in this case and truncating is causing to loose the valid data/files.
>>>>>>
>>>>>>> It looks no matter how we change the codes, we still may face data
>>>>>>> loos/corruption issues.
>>>>>>>
>>>>>>
>>>>>> Yes, maybe in some cases, the data loss is still expected, but there are
>>>>>> chances that data may be valid as well, which is confirmed by the test case above.
>>>>>>
>>>>>>> IMO, we need to find out the root cause of data corruption, So my question is
>>>>>>> that why sudden power off tests can make nat journal being corrupted? I doubt
>>>>>>> that in-place update method on checkpoint area of fsck may cause this... any
>>>>>>> other thoughts?
>>>>>>>
>>>>>>
>>>>>> In case of sudden power off, I have printed the all the journal entries in the
>>>>>> NAT and I several entries with valid blkaddr and nid as below -
>>>>>>
>>>>>> [   21.007649] fsck.f2fs: withfix: 0: addr 1c373 nid 45b\x0a
>>>>>> [   21.014399] fsck.f2fs: withfix: 1: addr 1c2c1 nid 1b5\x0a
>>>>>> [   21.021151] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 2/37 is invalid!!!\x0a
>>>>>> [   21.030860] fsck.f2fs: withfix: 3: addr 1c1f7 nid 19b2\x0a
>>>>>> [   21.037674] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 4/37 is invalid!!!\x0a
>>>>>> [   21.047353] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 5/37 is invalid!!!\x0a
>>>>>> [   21.057062] fsck.f2fs: withfix: 6: addr 1c377 nid ac3\x0a
>>>>>> [   21.063841] fsck.f2fs: withfix: 7: addr 1c378 nid ac9\x0a
>>>>>> [   21.070634] fsck.f2fs: withfix: 8: addr 1b2fe nid 807\x0a
>>>>>> [   21.077356] fsck.f2fs: withfix: 9: addr 17aeb nid 805\x0a
>>>>>> [   21.084102] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 10/37 is invalid!!!\x0a
>>>>>> [   21.093891] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 11/37 is invalid!!!\x0a
>>>>>> [   21.103662] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 12/37 is invalid!!!\x0a
>>>>>> [   21.113450] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 13/37 is invalid!!!\x0a
>>>>>> [   21.123227] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 14/37 is invalid!!!\x0a
>>>>>> [   21.133022] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 15/37 is invalid!!!\x0a
>>>>>> [   21.142792] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 16/37 is invalid!!!\x0a
>>>>>> [   21.152580] fsck.f2fs: withfix: 17: addr 1c37f nid ff4\x0a
>>>>>> [   21.159415] fsck.f2fs: withfix: 18: addr 1c36c nid ff9\x0a
>>>>>> [   21.166229] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 19/37 is invalid!!!\x0a
>>>>>> [   21.176039] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 20/37 is invalid!!!\x0a
>>>>>> [   21.185831] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 21/37 is invalid!!!\x0a
>>>>>> [   21.195631] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 22/37 is invalid!!!\x0a
>>>>>> [   21.205394] fsck.f2fs: withfix: 23: addr 1c2f3 nid 61c\x0a
>>>>>> [   21.212257] fsck.f2fs: withfix: 24: addr 1c391 nid 641\x0a
>>>>>> [   21.219093] fsck.f2fs: withfix: 25: addr 1c386 nid 33b\x0a
>>>>>> [   21.225905] fsck.f2fs: withfix: 26: addr 1c356 nid 258\x0a
>>>>>> [   21.232767] fsck.f2fs: withfix: 27: addr 1c2f2 nid 382\x0a
>>>>>> [   21.239601] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 28/37 is invalid!!!\x0a
>>>>>> [   21.249398] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 29/37 is invalid!!!\x0a
>>>>>> [   21.259160] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 30/37 is invalid!!!\x0a
>>>>>> [   21.268970] fsck.f2fs: withfix: 31: addr 1ac03 nid 1ccd\x0a
>>>>>> [   21.275874] fsck.f2fs: withfix: 32: addr 1c395 nid a57\x0a
>>>>>> [   21.282709] fsck.f2fs: withfix: 33: addr 1c392 nid 92e\x0a
>>>>>> [   21.289564] fsck.f2fs: withfix: 34: addr 1c38b nid aa1\x0a
>>>>>> [   21.296377] fsck.f2fs: withfix: 35: addr 1c393 nid 931\x0a
>>>>>> [   21.303240] fsck.f2fs: withfix: 36: addr 1c394 nid a35\x0a
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>>>
>>>>>>>> The reason is due to incorrect update of journal->n_nats in that patch.
>>>>>>>> When fsck encounters a nat journal entry with invalid blkaddr/nid, it
>>>>>>>> updates journal->n_nats with the index of such first entry and then continues
>>>>>>>> the for loop which scans all nats in journal in f2fs_init_nid_bitmap().
>>>>>>>> But that loop doesn't continue and doesn't update nm_i->nid_bitmap as well,
>>>>>>>> as it fails the for loop condition i < nats_in_cursum(journal).
>>>>>>>>
>>>>>>>> There could be several other valid entries in the journal after this
>>>>>>>> invalid entry. Due to this invalid journal->n_nats, build_nat_area_bitmap()
>>>>>>>> could not be build properly for valid nat journal entries. It further results
>>>>>>>> into these fsck asserts/errors and causes data loss due to incorrect fsck fix.
>>>>>>>>
>>>>>>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfb01] ino is 0\x0a
>>>>>>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfb01] - _JeNcl242yB3Apz2MW,VOh6WmjsVht1W len[0x1c], type[0x1]\x0a
>>>>>>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfa2d] ino is 0\x0a
>>>>>>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfa2d] - _++vN7axccjjGNI,TQEVmcOemY1EAAAA len[0x14], type[0x1]\x0a
>>>>>>>> ...
>>>>>>>> NID[0xba5] is unreachable\x0a
>>>>>>>> NID[0xfa9] is unreachable\x0a
>>>>>>>> ...
>>>>>>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xba5] in NAT\x0a
>>>>>>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xfa9] in NAT\x0a
>>>>>>>> ...
>>>>>>>>
>>>>>>>> Fixes: 979b25727800 ("fsck.f2fs: check validity of nat journal")
>>>>>>>> Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
>>>>>>>> ---
>>>>>>>>  fsck/mount.c | 2 --
>>>>>>>>  1 file changed, 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/fsck/mount.c b/fsck/mount.c
>>>>>>>> index 60e0e4a..1b93f49 100644
>>>>>>>> --- a/fsck/mount.c
>>>>>>>> +++ b/fsck/mount.c
>>>>>>>> @@ -1160,7 +1160,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>>>>>>>>  		addr = le32_to_cpu(nat_in_journal(journal, i).block_addr);
>>>>>>>>  		if (!IS_VALID_BLK_ADDR(sbi, addr)) {
>>>>>>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: addr(%u) is invalid!!!\n", addr);
>>>>>>>> -			journal->n_nats = cpu_to_le16(i);
>>>>>>>>  			c.fix_on = 1;
>>>>>>>>  			continue;
>>>>>>>>  		}
>>>>>>>> @@ -1168,7 +1167,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>>>>>>>>  		nid = le32_to_cpu(nid_in_journal(journal, i));
>>>>>>>>  		if (!IS_VALID_NID(sbi, nid)) {
>>>>>>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: nid(%u) is invalid!!!\n", nid);
>>>>>>>> -			journal->n_nats = cpu_to_le16(i);
>>>>>>>>  			c.fix_on = 1;
>>>>>>>>  			continue;
>>>>>>>>  		}
>>>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> --
>>>>>> Sent by a consultant of the Qualcomm Innovation Center, Inc.
>>>>>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Linux-f2fs-devel mailing list
>>>>>> Linux-f2fs-devel@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
>>>>>
>>>
>>> -- 
>>> --
>>> Sent by a consultant of the Qualcomm Innovation Center, Inc.
>>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
>>>
>>>
>>> _______________________________________________
>>> Linux-f2fs-devel mailing list
>>> Linux-f2fs-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
>>
> 
> 
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> 


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO
  2019-07-06  2:17               ` Chao Yu
@ 2019-07-06  4:50                 ` Jaegeuk Kim
  2019-07-07  5:25                   ` Chao Yu
  0 siblings, 1 reply; 14+ messages in thread
From: Jaegeuk Kim @ 2019-07-06  4:50 UTC (permalink / raw)
  To: Chao Yu; +Cc: linux-f2fs-devel

On 07/06, Chao Yu wrote:
> On 2019-7-6 10:10, Chao Yu wrote:
> > Hi Sahitya,
> > 
> > On 2019-7-6 9:37, Sahitya Tummala wrote:
> >> Hi Chao,
> >>
> >> I just realized that there is a patch from you to fix this already.
> >>
> >> commit 92ea71c711681a8f575d09d57d5ceaebd6644fcc
> >> Author: Chao Yu <yuchao0@huawei.com>
> >> Date:   Wed Jan 9 10:48:25 2019 +0800
> >>
> >>     fsck.f2fs: fix incorrect boundary of IS_VALID_NID()
> >>
> >>     nid should never equal to max_nid, fix it.
> >>
> >>     Signed-off-by: Chao Yu <yuchao0@huawei.com>
> >>     Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
> >>
> >> In my code base, I was not having this patch but only have
> >> "fsck.f2fs: check validity of nat journal", thus resulting into
> >> data loss issues. I will now verify with this patch included.
> > 
> > Ah, that should be the root cause..
> > 
> > BTW, the diff was pointed out by Jaegeuk on patch ("fsck.f2fs: check validity of
> > nat journal"), but it looks it was merged to the wrong patch. :P
> 
> Sorry about that, I didn't check it out on upstream branch.

I guess we need this diff?

> 
> Thanks,
> 
> > 
> >  static inline bool IS_VALID_BLK_ADDR(struct f2fs_sb_info *sbi, u32 addr)
> >  {
> > +	if (addr == NULL_ADDR || addr == NEW_ADDR)
> > +		return 1;
> > +
> >  	if (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
> >  				addr < SM_I(sbi)->main_blkaddr) {
> >  		DBG(1, "block addr [0x%x]\n", addr);
> > 
> > Thanks,
> > 
> >>
> >> Thanks for your support on this issue.
> >>
> >> Thanks,
> >> Sahitya.
> >>
> >> On Sat, Jul 06, 2019 at 07:01:12AM +0530, Sahitya Tummala wrote:
> >>> Hi Chao,
> >>>
> >>> On Sat, Jul 06, 2019 at 09:14:59AM +0800, Chao Yu wrote:
> >>>> Hi Sahitya,
> >>>>
> >>>> On 2019-7-5 21:48, Sahitya Tummala wrote:
> >>>>> Hi Chao,
> >>>>>
> >>>>> The journal doesn't look to be corrupted, but only block_addr
> >>>>> in f2fs_nat_entry is 0 whereas nid and version seems to be valid.
> >>>>> These could be just truncated nodes, right?
> >>>>
> >>>> Yes, but that should not fail the condition of journal check and result in
> >>>> journal size truncation, right?
> >>>>
> >>>
> >>> Yes, the patch fails for this condition as well and results into
> >>> truncation of journal, due to this check  addr < SM_I(sbi)->main_blkaddr
> >>> in IS_VALID_BLK_ADDR().
> >>>
> >>> Do you think we can correct it this way?
> >>>
> >>> diff --git a/fsck/f2fs.h b/fsck/f2fs.h
> >>> index 417ca0b..00fb740 100644
> >>> --- a/fsck/f2fs.h
> >>> +++ b/fsck/f2fs.h
> >>> @@ -364,8 +364,8 @@ static inline bool IS_VALID_NID(struct f2fs_sb_info *sbi, u32 nid)
> >>>
> >>>  static inline bool IS_VALID_BLK_ADDR(struct f2fs_sb_info *sbi, u32 addr)
> >>>  {
> >>> -       if (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
> >>> -                               addr < SM_I(sbi)->main_blkaddr) {
> >>> +       if (addr && (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
> >>> +                               addr < SM_I(sbi)->main_blkaddr)) {
> >>>                 DBG(1, "block addr [0x%x]\n", addr);
> >>>                 return 0;
> >>>         }
> >>>
> >>> Thanks,
> >>>
> >>>> Thanks,
> >>>>
> >>>>>
> >>>>> Here is the full dump of nat journal entries after a SPO (this is not a 
> >>>>> issue case, but just a normal case with SPO) -
> >>>>>
> >>>>> [   21.121325] fsck.f2fs: 0: addr 0 nid 5c5a ver 1\x0a
> >>>>> [   21.127509] fsck.f2fs: 1: addr 0 nid 4c1d ver 1\x0a
> >>>>> [   21.133760] fsck.f2fs: 2: addr 12075 nid 444f ver 0\x0a
> >>>>> [   21.140325] fsck.f2fs: 3: addr 0 nid 4307 ver 0\x0a
> >>>>> [   21.146510] fsck.f2fs: 4: addr 0 nid 4059 ver 1\x0a
> >>>>> [   21.152745] fsck.f2fs: 5: addr 0 nid 3e69 ver 1\x0a
> >>>>> [   21.158944] fsck.f2fs: 6: addr 0 nid 3cf9 ver 1\x0a
> >>>>> [   21.165128] fsck.f2fs: 7: addr 0 nid 3b7b ver 1\x0a
> >>>>> [   21.171362] fsck.f2fs: 8: addr 18f65 nid 34e2 ver 0\x0a
> >>>>> [   21.177903] fsck.f2fs: 9: addr 12126 nid 2c87 ver 0\x0a
> >>>>> [   21.184466] fsck.f2fs: 10: addr 0 nid 2a0a ver 1\x0a
> >>>>> [   21.190795] fsck.f2fs: 11: addr 0 nid 2565 ver 1\x0a
> >>>>> [   21.197070] fsck.f2fs: 12: addr 0 nid 1f48 ver 0\x0a
> >>>>> [   21.203369] fsck.f2fs: 13: addr 155bf nid 1d76 ver 0\x0a
> >>>>> [   21.210042] fsck.f2fs: 14: addr 19538 nid 18da ver 0\x0a
> >>>>> [   21.216675] fsck.f2fs: 15: addr 0 nid 157f ver 0\x0a
> >>>>> [   21.223037] fsck.f2fs: 16: addr d117 nid ac3 ver 0\x0a
> >>>>> [   21.229547] fsck.f2fs: 17: addr 11e48 nid 805 ver 0\x0a
> >>>>> [   21.236088] fsck.f2fs: 18: addr d112 nid 61c ver 0\x0a
> >>>>> [   21.242585] fsck.f2fs: 19: addr d00b nid 45b ver 0\x0a
> >>>>> [   21.249086] fsck.f2fs: 20: addr d111 nid 30d ver 0\x0a
> >>>>> [   21.255544] fsck.f2fs: 21: addr d110 nid 1b5 ver 0\x0a
> >>>>> [   21.262025] fsck.f2fs: 22: addr 0 nid 5e41 ver 1\x0a
> >>>>> [   21.268342] fsck.f2fs: 23: addr 0 nid 5e46 ver 0\x0a
> >>>>> [   21.274618] fsck.f2fs: 24: addr 0 nid 5a34 ver 1\x0a
> >>>>> [   21.280916] fsck.f2fs: 25: addr 0 nid 59fa ver 1\x0a
> >>>>> [   21.287192] fsck.f2fs: 26: addr 0 nid 57c3 ver 1\x0a
> >>>>> [   21.293526] fsck.f2fs: 27: addr 0 nid 5776 ver 1\x0a
> >>>>> [   21.299816] fsck.f2fs: 28: addr 0 nid 4f20 ver 1\x0a
> >>>>> [   21.306091] fsck.f2fs: 29: addr 0 nid 4ff7 ver 0\x0a
> >>>>> [   21.312416] fsck.f2fs: 30: addr 0 nid 47e1 ver 1\x0a
> >>>>> [   21.318715] fsck.f2fs: 31: addr 0 nid 4880 ver 1\x0a
> >>>>> [   21.324990] fsck.f2fs: 32: addr d118 nid ff4 ver 0\x0a
> >>>>> [   21.331498] fsck.f2fs: 33: addr d10c nid ff9 ver 0\x0a
> >>>>> [   21.337957] fsck.f2fs: 34: addr 0 nid 49ed ver 1\x0a
> >>>>> [   21.344290] fsck.f2fs: 35: addr 0 nid 497e ver 0\x0a
> >>>>> [   21.350612] fsck.f2fs: 36: addr fc5a nid 49d2 ver 0\x0a
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> On Fri, Jul 05, 2019 at 04:33:57PM +0530, Sahitya Tummala wrote:
> >>>>>> Hi Chao,
> >>>>>>
> >>>>>> On Fri, Jul 05, 2019 at 06:11:47PM +0800, Chao Yu wrote:
> >>>>>>> Hi Sahitya,
> >>>>>>>
> >>>>>>> On 2019/7/5 13:01, Sahitya Tummala wrote:
> >>>>>>>> With the commit <979b25727800> ("fsck.f2fs: check validity of nat journal"),
> >>>>>>>> there is a serious data loss issue observed during sudden power off tests.
> >>>>>>>
> >>>>>>> In that patch, since journal's data is untrusty, I just truncated n_nats to last
> >>>>>>> valid one to fix original reported issue.
> >>>>>>
> >>>>>> The original issue will be fixed even without updating/truncating the n_nats
> >>>>>> as we still have sanity checks on nid and blkaddr, right?
> >>>>>>
> >>>>>>>
> >>>>>>> Actually, the problem here is we don't know which data is correct and which data
> >>>>>>> is corrupted in nat or sit journal, so even we skip broken entry without
> >>>>>>> modifying journal->n_{nats,sits}, latter entry may still contain broken data(but
> >>>>>>> it may pass validation check), and those entries may cause potential data
> >>>>>>> loss/corruption...
> >>>>>>>
> >>>>>>
> >>>>>> This issue was reported by a customer in the following test scenario -
> >>>>>>
> >>>>>> Update the password in Android and then force crash. It is verified that
> >>>>>> all the files related to password data has fsync() already.
> >>>>>>
> >>>>>> After some iterations (3-4) of this test, it is found password related files are
> >>>>>> lost, failing the phone to bootup.
> >>>>>>
> >>>>>> Without this patch ("fsck.f2fs: check validity of nat journal"), they don't see
> >>>>>> such issue. It means that the data present in journal seems to be valid atleast
> >>>>>> in this case and truncating is causing to loose the valid data/files.
> >>>>>>
> >>>>>>> It looks no matter how we change the codes, we still may face data
> >>>>>>> loos/corruption issues.
> >>>>>>>
> >>>>>>
> >>>>>> Yes, maybe in some cases, the data loss is still expected, but there are
> >>>>>> chances that data may be valid as well, which is confirmed by the test case above.
> >>>>>>
> >>>>>>> IMO, we need to find out the root cause of data corruption, So my question is
> >>>>>>> that why sudden power off tests can make nat journal being corrupted? I doubt
> >>>>>>> that in-place update method on checkpoint area of fsck may cause this... any
> >>>>>>> other thoughts?
> >>>>>>>
> >>>>>>
> >>>>>> In case of sudden power off, I have printed the all the journal entries in the
> >>>>>> NAT and I several entries with valid blkaddr and nid as below -
> >>>>>>
> >>>>>> [   21.007649] fsck.f2fs: withfix: 0: addr 1c373 nid 45b\x0a
> >>>>>> [   21.014399] fsck.f2fs: withfix: 1: addr 1c2c1 nid 1b5\x0a
> >>>>>> [   21.021151] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 2/37 is invalid!!!\x0a
> >>>>>> [   21.030860] fsck.f2fs: withfix: 3: addr 1c1f7 nid 19b2\x0a
> >>>>>> [   21.037674] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 4/37 is invalid!!!\x0a
> >>>>>> [   21.047353] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 5/37 is invalid!!!\x0a
> >>>>>> [   21.057062] fsck.f2fs: withfix: 6: addr 1c377 nid ac3\x0a
> >>>>>> [   21.063841] fsck.f2fs: withfix: 7: addr 1c378 nid ac9\x0a
> >>>>>> [   21.070634] fsck.f2fs: withfix: 8: addr 1b2fe nid 807\x0a
> >>>>>> [   21.077356] fsck.f2fs: withfix: 9: addr 17aeb nid 805\x0a
> >>>>>> [   21.084102] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 10/37 is invalid!!!\x0a
> >>>>>> [   21.093891] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 11/37 is invalid!!!\x0a
> >>>>>> [   21.103662] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 12/37 is invalid!!!\x0a
> >>>>>> [   21.113450] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 13/37 is invalid!!!\x0a
> >>>>>> [   21.123227] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 14/37 is invalid!!!\x0a
> >>>>>> [   21.133022] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 15/37 is invalid!!!\x0a
> >>>>>> [   21.142792] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 16/37 is invalid!!!\x0a
> >>>>>> [   21.152580] fsck.f2fs: withfix: 17: addr 1c37f nid ff4\x0a
> >>>>>> [   21.159415] fsck.f2fs: withfix: 18: addr 1c36c nid ff9\x0a
> >>>>>> [   21.166229] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 19/37 is invalid!!!\x0a
> >>>>>> [   21.176039] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 20/37 is invalid!!!\x0a
> >>>>>> [   21.185831] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 21/37 is invalid!!!\x0a
> >>>>>> [   21.195631] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 22/37 is invalid!!!\x0a
> >>>>>> [   21.205394] fsck.f2fs: withfix: 23: addr 1c2f3 nid 61c\x0a
> >>>>>> [   21.212257] fsck.f2fs: withfix: 24: addr 1c391 nid 641\x0a
> >>>>>> [   21.219093] fsck.f2fs: withfix: 25: addr 1c386 nid 33b\x0a
> >>>>>> [   21.225905] fsck.f2fs: withfix: 26: addr 1c356 nid 258\x0a
> >>>>>> [   21.232767] fsck.f2fs: withfix: 27: addr 1c2f2 nid 382\x0a
> >>>>>> [   21.239601] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 28/37 is invalid!!!\x0a
> >>>>>> [   21.249398] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 29/37 is invalid!!!\x0a
> >>>>>> [   21.259160] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 30/37 is invalid!!!\x0a
> >>>>>> [   21.268970] fsck.f2fs: withfix: 31: addr 1ac03 nid 1ccd\x0a
> >>>>>> [   21.275874] fsck.f2fs: withfix: 32: addr 1c395 nid a57\x0a
> >>>>>> [   21.282709] fsck.f2fs: withfix: 33: addr 1c392 nid 92e\x0a
> >>>>>> [   21.289564] fsck.f2fs: withfix: 34: addr 1c38b nid aa1\x0a
> >>>>>> [   21.296377] fsck.f2fs: withfix: 35: addr 1c393 nid 931\x0a
> >>>>>> [   21.303240] fsck.f2fs: withfix: 36: addr 1c394 nid a35\x0a
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>>>
> >>>>>>>> The reason is due to incorrect update of journal->n_nats in that patch.
> >>>>>>>> When fsck encounters a nat journal entry with invalid blkaddr/nid, it
> >>>>>>>> updates journal->n_nats with the index of such first entry and then continues
> >>>>>>>> the for loop which scans all nats in journal in f2fs_init_nid_bitmap().
> >>>>>>>> But that loop doesn't continue and doesn't update nm_i->nid_bitmap as well,
> >>>>>>>> as it fails the for loop condition i < nats_in_cursum(journal).
> >>>>>>>>
> >>>>>>>> There could be several other valid entries in the journal after this
> >>>>>>>> invalid entry. Due to this invalid journal->n_nats, build_nat_area_bitmap()
> >>>>>>>> could not be build properly for valid nat journal entries. It further results
> >>>>>>>> into these fsck asserts/errors and causes data loss due to incorrect fsck fix.
> >>>>>>>>
> >>>>>>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfb01] ino is 0\x0a
> >>>>>>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfb01] - _JeNcl242yB3Apz2MW,VOh6WmjsVht1W len[0x1c], type[0x1]\x0a
> >>>>>>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfa2d] ino is 0\x0a
> >>>>>>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfa2d] - _++vN7axccjjGNI,TQEVmcOemY1EAAAA len[0x14], type[0x1]\x0a
> >>>>>>>> ...
> >>>>>>>> NID[0xba5] is unreachable\x0a
> >>>>>>>> NID[0xfa9] is unreachable\x0a
> >>>>>>>> ...
> >>>>>>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xba5] in NAT\x0a
> >>>>>>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xfa9] in NAT\x0a
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>> Fixes: 979b25727800 ("fsck.f2fs: check validity of nat journal")
> >>>>>>>> Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
> >>>>>>>> ---
> >>>>>>>>  fsck/mount.c | 2 --
> >>>>>>>>  1 file changed, 2 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/fsck/mount.c b/fsck/mount.c
> >>>>>>>> index 60e0e4a..1b93f49 100644
> >>>>>>>> --- a/fsck/mount.c
> >>>>>>>> +++ b/fsck/mount.c
> >>>>>>>> @@ -1160,7 +1160,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
> >>>>>>>>  		addr = le32_to_cpu(nat_in_journal(journal, i).block_addr);
> >>>>>>>>  		if (!IS_VALID_BLK_ADDR(sbi, addr)) {
> >>>>>>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: addr(%u) is invalid!!!\n", addr);
> >>>>>>>> -			journal->n_nats = cpu_to_le16(i);
> >>>>>>>>  			c.fix_on = 1;
> >>>>>>>>  			continue;
> >>>>>>>>  		}
> >>>>>>>> @@ -1168,7 +1167,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
> >>>>>>>>  		nid = le32_to_cpu(nid_in_journal(journal, i));
> >>>>>>>>  		if (!IS_VALID_NID(sbi, nid)) {
> >>>>>>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: nid(%u) is invalid!!!\n", nid);
> >>>>>>>> -			journal->n_nats = cpu_to_le16(i);
> >>>>>>>>  			c.fix_on = 1;
> >>>>>>>>  			continue;
> >>>>>>>>  		}
> >>>>>>>>
> >>>>>>
> >>>>>> -- 
> >>>>>> --
> >>>>>> Sent by a consultant of the Qualcomm Innovation Center, Inc.
> >>>>>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Linux-f2fs-devel mailing list
> >>>>>> Linux-f2fs-devel@lists.sourceforge.net
> >>>>>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> >>>>>
> >>>
> >>> -- 
> >>> --
> >>> Sent by a consultant of the Qualcomm Innovation Center, Inc.
> >>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
> >>>
> >>>
> >>> _______________________________________________
> >>> Linux-f2fs-devel mailing list
> >>> Linux-f2fs-devel@lists.sourceforge.net
> >>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> >>
> > 
> > 
> > _______________________________________________
> > Linux-f2fs-devel mailing list
> > Linux-f2fs-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> > 


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO
  2019-07-06  4:50                 ` Jaegeuk Kim
@ 2019-07-07  5:25                   ` Chao Yu
  2019-07-08 23:43                     ` Jaegeuk Kim
  0 siblings, 1 reply; 14+ messages in thread
From: Chao Yu @ 2019-07-07  5:25 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-f2fs-devel

On 2019-7-6 12:50, Jaegeuk Kim wrote:
> On 07/06, Chao Yu wrote:
>> On 2019-7-6 10:10, Chao Yu wrote:
>>> Hi Sahitya,
>>>
>>> On 2019-7-6 9:37, Sahitya Tummala wrote:
>>>> Hi Chao,
>>>>
>>>> I just realized that there is a patch from you to fix this already.
>>>>
>>>> commit 92ea71c711681a8f575d09d57d5ceaebd6644fcc
>>>> Author: Chao Yu <yuchao0@huawei.com>
>>>> Date:   Wed Jan 9 10:48:25 2019 +0800
>>>>
>>>>     fsck.f2fs: fix incorrect boundary of IS_VALID_NID()
>>>>
>>>>     nid should never equal to max_nid, fix it.
>>>>
>>>>     Signed-off-by: Chao Yu <yuchao0@huawei.com>
>>>>     Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
>>>>
>>>> In my code base, I was not having this patch but only have
>>>> "fsck.f2fs: check validity of nat journal", thus resulting into
>>>> data loss issues. I will now verify with this patch included.
>>>
>>> Ah, that should be the root cause..
>>>
>>> BTW, the diff was pointed out by Jaegeuk on patch ("fsck.f2fs: check validity of
>>> nat journal"), but it looks it was merged to the wrong patch. :P
>>
>> Sorry about that, I didn't check it out on upstream branch.
> 
> I guess we need this diff?

Yes, this diff should be merged into ("fsck.f2fs: check validity of nat
journal") instead of ("fsck.f2fs: fix incorrect boundary of IS_VALID_NID()").

BTW, I think we'd better to use __is_valid_data_blkaddr/f2fs_is_valid_blkaddr to
check blkaddr in different condition like we did in kernel.

Thanks,

> 
>>
>> Thanks,
>>
>>>
>>>  static inline bool IS_VALID_BLK_ADDR(struct f2fs_sb_info *sbi, u32 addr)
>>>  {
>>> +	if (addr == NULL_ADDR || addr == NEW_ADDR)
>>> +		return 1;
>>> +
>>>  	if (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
>>>  				addr < SM_I(sbi)->main_blkaddr) {
>>>  		DBG(1, "block addr [0x%x]\n", addr);
>>>
>>> Thanks,
>>>
>>>>
>>>> Thanks for your support on this issue.
>>>>
>>>> Thanks,
>>>> Sahitya.
>>>>
>>>> On Sat, Jul 06, 2019 at 07:01:12AM +0530, Sahitya Tummala wrote:
>>>>> Hi Chao,
>>>>>
>>>>> On Sat, Jul 06, 2019 at 09:14:59AM +0800, Chao Yu wrote:
>>>>>> Hi Sahitya,
>>>>>>
>>>>>> On 2019-7-5 21:48, Sahitya Tummala wrote:
>>>>>>> Hi Chao,
>>>>>>>
>>>>>>> The journal doesn't look to be corrupted, but only block_addr
>>>>>>> in f2fs_nat_entry is 0 whereas nid and version seems to be valid.
>>>>>>> These could be just truncated nodes, right?
>>>>>>
>>>>>> Yes, but that should not fail the condition of journal check and result in
>>>>>> journal size truncation, right?
>>>>>>
>>>>>
>>>>> Yes, the patch fails for this condition as well and results into
>>>>> truncation of journal, due to this check  addr < SM_I(sbi)->main_blkaddr
>>>>> in IS_VALID_BLK_ADDR().
>>>>>
>>>>> Do you think we can correct it this way?
>>>>>
>>>>> diff --git a/fsck/f2fs.h b/fsck/f2fs.h
>>>>> index 417ca0b..00fb740 100644
>>>>> --- a/fsck/f2fs.h
>>>>> +++ b/fsck/f2fs.h
>>>>> @@ -364,8 +364,8 @@ static inline bool IS_VALID_NID(struct f2fs_sb_info *sbi, u32 nid)
>>>>>
>>>>>  static inline bool IS_VALID_BLK_ADDR(struct f2fs_sb_info *sbi, u32 addr)
>>>>>  {
>>>>> -       if (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
>>>>> -                               addr < SM_I(sbi)->main_blkaddr) {
>>>>> +       if (addr && (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
>>>>> +                               addr < SM_I(sbi)->main_blkaddr)) {
>>>>>                 DBG(1, "block addr [0x%x]\n", addr);
>>>>>                 return 0;
>>>>>         }
>>>>>
>>>>> Thanks,
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>>
>>>>>>> Here is the full dump of nat journal entries after a SPO (this is not a 
>>>>>>> issue case, but just a normal case with SPO) -
>>>>>>>
>>>>>>> [   21.121325] fsck.f2fs: 0: addr 0 nid 5c5a ver 1\x0a
>>>>>>> [   21.127509] fsck.f2fs: 1: addr 0 nid 4c1d ver 1\x0a
>>>>>>> [   21.133760] fsck.f2fs: 2: addr 12075 nid 444f ver 0\x0a
>>>>>>> [   21.140325] fsck.f2fs: 3: addr 0 nid 4307 ver 0\x0a
>>>>>>> [   21.146510] fsck.f2fs: 4: addr 0 nid 4059 ver 1\x0a
>>>>>>> [   21.152745] fsck.f2fs: 5: addr 0 nid 3e69 ver 1\x0a
>>>>>>> [   21.158944] fsck.f2fs: 6: addr 0 nid 3cf9 ver 1\x0a
>>>>>>> [   21.165128] fsck.f2fs: 7: addr 0 nid 3b7b ver 1\x0a
>>>>>>> [   21.171362] fsck.f2fs: 8: addr 18f65 nid 34e2 ver 0\x0a
>>>>>>> [   21.177903] fsck.f2fs: 9: addr 12126 nid 2c87 ver 0\x0a
>>>>>>> [   21.184466] fsck.f2fs: 10: addr 0 nid 2a0a ver 1\x0a
>>>>>>> [   21.190795] fsck.f2fs: 11: addr 0 nid 2565 ver 1\x0a
>>>>>>> [   21.197070] fsck.f2fs: 12: addr 0 nid 1f48 ver 0\x0a
>>>>>>> [   21.203369] fsck.f2fs: 13: addr 155bf nid 1d76 ver 0\x0a
>>>>>>> [   21.210042] fsck.f2fs: 14: addr 19538 nid 18da ver 0\x0a
>>>>>>> [   21.216675] fsck.f2fs: 15: addr 0 nid 157f ver 0\x0a
>>>>>>> [   21.223037] fsck.f2fs: 16: addr d117 nid ac3 ver 0\x0a
>>>>>>> [   21.229547] fsck.f2fs: 17: addr 11e48 nid 805 ver 0\x0a
>>>>>>> [   21.236088] fsck.f2fs: 18: addr d112 nid 61c ver 0\x0a
>>>>>>> [   21.242585] fsck.f2fs: 19: addr d00b nid 45b ver 0\x0a
>>>>>>> [   21.249086] fsck.f2fs: 20: addr d111 nid 30d ver 0\x0a
>>>>>>> [   21.255544] fsck.f2fs: 21: addr d110 nid 1b5 ver 0\x0a
>>>>>>> [   21.262025] fsck.f2fs: 22: addr 0 nid 5e41 ver 1\x0a
>>>>>>> [   21.268342] fsck.f2fs: 23: addr 0 nid 5e46 ver 0\x0a
>>>>>>> [   21.274618] fsck.f2fs: 24: addr 0 nid 5a34 ver 1\x0a
>>>>>>> [   21.280916] fsck.f2fs: 25: addr 0 nid 59fa ver 1\x0a
>>>>>>> [   21.287192] fsck.f2fs: 26: addr 0 nid 57c3 ver 1\x0a
>>>>>>> [   21.293526] fsck.f2fs: 27: addr 0 nid 5776 ver 1\x0a
>>>>>>> [   21.299816] fsck.f2fs: 28: addr 0 nid 4f20 ver 1\x0a
>>>>>>> [   21.306091] fsck.f2fs: 29: addr 0 nid 4ff7 ver 0\x0a
>>>>>>> [   21.312416] fsck.f2fs: 30: addr 0 nid 47e1 ver 1\x0a
>>>>>>> [   21.318715] fsck.f2fs: 31: addr 0 nid 4880 ver 1\x0a
>>>>>>> [   21.324990] fsck.f2fs: 32: addr d118 nid ff4 ver 0\x0a
>>>>>>> [   21.331498] fsck.f2fs: 33: addr d10c nid ff9 ver 0\x0a
>>>>>>> [   21.337957] fsck.f2fs: 34: addr 0 nid 49ed ver 1\x0a
>>>>>>> [   21.344290] fsck.f2fs: 35: addr 0 nid 497e ver 0\x0a
>>>>>>> [   21.350612] fsck.f2fs: 36: addr fc5a nid 49d2 ver 0\x0a
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> On Fri, Jul 05, 2019 at 04:33:57PM +0530, Sahitya Tummala wrote:
>>>>>>>> Hi Chao,
>>>>>>>>
>>>>>>>> On Fri, Jul 05, 2019 at 06:11:47PM +0800, Chao Yu wrote:
>>>>>>>>> Hi Sahitya,
>>>>>>>>>
>>>>>>>>> On 2019/7/5 13:01, Sahitya Tummala wrote:
>>>>>>>>>> With the commit <979b25727800> ("fsck.f2fs: check validity of nat journal"),
>>>>>>>>>> there is a serious data loss issue observed during sudden power off tests.
>>>>>>>>>
>>>>>>>>> In that patch, since journal's data is untrusty, I just truncated n_nats to last
>>>>>>>>> valid one to fix original reported issue.
>>>>>>>>
>>>>>>>> The original issue will be fixed even without updating/truncating the n_nats
>>>>>>>> as we still have sanity checks on nid and blkaddr, right?
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Actually, the problem here is we don't know which data is correct and which data
>>>>>>>>> is corrupted in nat or sit journal, so even we skip broken entry without
>>>>>>>>> modifying journal->n_{nats,sits}, latter entry may still contain broken data(but
>>>>>>>>> it may pass validation check), and those entries may cause potential data
>>>>>>>>> loss/corruption...
>>>>>>>>>
>>>>>>>>
>>>>>>>> This issue was reported by a customer in the following test scenario -
>>>>>>>>
>>>>>>>> Update the password in Android and then force crash. It is verified that
>>>>>>>> all the files related to password data has fsync() already.
>>>>>>>>
>>>>>>>> After some iterations (3-4) of this test, it is found password related files are
>>>>>>>> lost, failing the phone to bootup.
>>>>>>>>
>>>>>>>> Without this patch ("fsck.f2fs: check validity of nat journal"), they don't see
>>>>>>>> such issue. It means that the data present in journal seems to be valid atleast
>>>>>>>> in this case and truncating is causing to loose the valid data/files.
>>>>>>>>
>>>>>>>>> It looks no matter how we change the codes, we still may face data
>>>>>>>>> loos/corruption issues.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, maybe in some cases, the data loss is still expected, but there are
>>>>>>>> chances that data may be valid as well, which is confirmed by the test case above.
>>>>>>>>
>>>>>>>>> IMO, we need to find out the root cause of data corruption, So my question is
>>>>>>>>> that why sudden power off tests can make nat journal being corrupted? I doubt
>>>>>>>>> that in-place update method on checkpoint area of fsck may cause this... any
>>>>>>>>> other thoughts?
>>>>>>>>>
>>>>>>>>
>>>>>>>> In case of sudden power off, I have printed the all the journal entries in the
>>>>>>>> NAT and I several entries with valid blkaddr and nid as below -
>>>>>>>>
>>>>>>>> [   21.007649] fsck.f2fs: withfix: 0: addr 1c373 nid 45b\x0a
>>>>>>>> [   21.014399] fsck.f2fs: withfix: 1: addr 1c2c1 nid 1b5\x0a
>>>>>>>> [   21.021151] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 2/37 is invalid!!!\x0a
>>>>>>>> [   21.030860] fsck.f2fs: withfix: 3: addr 1c1f7 nid 19b2\x0a
>>>>>>>> [   21.037674] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 4/37 is invalid!!!\x0a
>>>>>>>> [   21.047353] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 5/37 is invalid!!!\x0a
>>>>>>>> [   21.057062] fsck.f2fs: withfix: 6: addr 1c377 nid ac3\x0a
>>>>>>>> [   21.063841] fsck.f2fs: withfix: 7: addr 1c378 nid ac9\x0a
>>>>>>>> [   21.070634] fsck.f2fs: withfix: 8: addr 1b2fe nid 807\x0a
>>>>>>>> [   21.077356] fsck.f2fs: withfix: 9: addr 17aeb nid 805\x0a
>>>>>>>> [   21.084102] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 10/37 is invalid!!!\x0a
>>>>>>>> [   21.093891] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 11/37 is invalid!!!\x0a
>>>>>>>> [   21.103662] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 12/37 is invalid!!!\x0a
>>>>>>>> [   21.113450] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 13/37 is invalid!!!\x0a
>>>>>>>> [   21.123227] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 14/37 is invalid!!!\x0a
>>>>>>>> [   21.133022] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 15/37 is invalid!!!\x0a
>>>>>>>> [   21.142792] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 16/37 is invalid!!!\x0a
>>>>>>>> [   21.152580] fsck.f2fs: withfix: 17: addr 1c37f nid ff4\x0a
>>>>>>>> [   21.159415] fsck.f2fs: withfix: 18: addr 1c36c nid ff9\x0a
>>>>>>>> [   21.166229] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 19/37 is invalid!!!\x0a
>>>>>>>> [   21.176039] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 20/37 is invalid!!!\x0a
>>>>>>>> [   21.185831] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 21/37 is invalid!!!\x0a
>>>>>>>> [   21.195631] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 22/37 is invalid!!!\x0a
>>>>>>>> [   21.205394] fsck.f2fs: withfix: 23: addr 1c2f3 nid 61c\x0a
>>>>>>>> [   21.212257] fsck.f2fs: withfix: 24: addr 1c391 nid 641\x0a
>>>>>>>> [   21.219093] fsck.f2fs: withfix: 25: addr 1c386 nid 33b\x0a
>>>>>>>> [   21.225905] fsck.f2fs: withfix: 26: addr 1c356 nid 258\x0a
>>>>>>>> [   21.232767] fsck.f2fs: withfix: 27: addr 1c2f2 nid 382\x0a
>>>>>>>> [   21.239601] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 28/37 is invalid!!!\x0a
>>>>>>>> [   21.249398] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 29/37 is invalid!!!\x0a
>>>>>>>> [   21.259160] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 30/37 is invalid!!!\x0a
>>>>>>>> [   21.268970] fsck.f2fs: withfix: 31: addr 1ac03 nid 1ccd\x0a
>>>>>>>> [   21.275874] fsck.f2fs: withfix: 32: addr 1c395 nid a57\x0a
>>>>>>>> [   21.282709] fsck.f2fs: withfix: 33: addr 1c392 nid 92e\x0a
>>>>>>>> [   21.289564] fsck.f2fs: withfix: 34: addr 1c38b nid aa1\x0a
>>>>>>>> [   21.296377] fsck.f2fs: withfix: 35: addr 1c393 nid 931\x0a
>>>>>>>> [   21.303240] fsck.f2fs: withfix: 36: addr 1c394 nid a35\x0a
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The reason is due to incorrect update of journal->n_nats in that patch.
>>>>>>>>>> When fsck encounters a nat journal entry with invalid blkaddr/nid, it
>>>>>>>>>> updates journal->n_nats with the index of such first entry and then continues
>>>>>>>>>> the for loop which scans all nats in journal in f2fs_init_nid_bitmap().
>>>>>>>>>> But that loop doesn't continue and doesn't update nm_i->nid_bitmap as well,
>>>>>>>>>> as it fails the for loop condition i < nats_in_cursum(journal).
>>>>>>>>>>
>>>>>>>>>> There could be several other valid entries in the journal after this
>>>>>>>>>> invalid entry. Due to this invalid journal->n_nats, build_nat_area_bitmap()
>>>>>>>>>> could not be build properly for valid nat journal entries. It further results
>>>>>>>>>> into these fsck asserts/errors and causes data loss due to incorrect fsck fix.
>>>>>>>>>>
>>>>>>>>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfb01] ino is 0\x0a
>>>>>>>>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfb01] - _JeNcl242yB3Apz2MW,VOh6WmjsVht1W len[0x1c], type[0x1]\x0a
>>>>>>>>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfa2d] ino is 0\x0a
>>>>>>>>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfa2d] - _++vN7axccjjGNI,TQEVmcOemY1EAAAA len[0x14], type[0x1]\x0a
>>>>>>>>>> ...
>>>>>>>>>> NID[0xba5] is unreachable\x0a
>>>>>>>>>> NID[0xfa9] is unreachable\x0a
>>>>>>>>>> ...
>>>>>>>>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xba5] in NAT\x0a
>>>>>>>>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xfa9] in NAT\x0a
>>>>>>>>>> ...
>>>>>>>>>>
>>>>>>>>>> Fixes: 979b25727800 ("fsck.f2fs: check validity of nat journal")
>>>>>>>>>> Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
>>>>>>>>>> ---
>>>>>>>>>>  fsck/mount.c | 2 --
>>>>>>>>>>  1 file changed, 2 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/fsck/mount.c b/fsck/mount.c
>>>>>>>>>> index 60e0e4a..1b93f49 100644
>>>>>>>>>> --- a/fsck/mount.c
>>>>>>>>>> +++ b/fsck/mount.c
>>>>>>>>>> @@ -1160,7 +1160,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>>>>>>>>>>  		addr = le32_to_cpu(nat_in_journal(journal, i).block_addr);
>>>>>>>>>>  		if (!IS_VALID_BLK_ADDR(sbi, addr)) {
>>>>>>>>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: addr(%u) is invalid!!!\n", addr);
>>>>>>>>>> -			journal->n_nats = cpu_to_le16(i);
>>>>>>>>>>  			c.fix_on = 1;
>>>>>>>>>>  			continue;
>>>>>>>>>>  		}
>>>>>>>>>> @@ -1168,7 +1167,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
>>>>>>>>>>  		nid = le32_to_cpu(nid_in_journal(journal, i));
>>>>>>>>>>  		if (!IS_VALID_NID(sbi, nid)) {
>>>>>>>>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: nid(%u) is invalid!!!\n", nid);
>>>>>>>>>> -			journal->n_nats = cpu_to_le16(i);
>>>>>>>>>>  			c.fix_on = 1;
>>>>>>>>>>  			continue;
>>>>>>>>>>  		}
>>>>>>>>>>
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> --
>>>>>>>> Sent by a consultant of the Qualcomm Innovation Center, Inc.
>>>>>>>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Linux-f2fs-devel mailing list
>>>>>>>> Linux-f2fs-devel@lists.sourceforge.net
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
>>>>>>>
>>>>>
>>>>> -- 
>>>>> --
>>>>> Sent by a consultant of the Qualcomm Innovation Center, Inc.
>>>>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Linux-f2fs-devel mailing list
>>>>> Linux-f2fs-devel@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
>>>>
>>>
>>>
>>> _______________________________________________
>>> Linux-f2fs-devel mailing list
>>> Linux-f2fs-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
>>>


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO
  2019-07-07  5:25                   ` Chao Yu
@ 2019-07-08 23:43                     ` Jaegeuk Kim
  0 siblings, 0 replies; 14+ messages in thread
From: Jaegeuk Kim @ 2019-07-08 23:43 UTC (permalink / raw)
  To: Chao Yu; +Cc: linux-f2fs-devel

On 07/07, Chao Yu wrote:
> On 2019-7-6 12:50, Jaegeuk Kim wrote:
> > On 07/06, Chao Yu wrote:
> >> On 2019-7-6 10:10, Chao Yu wrote:
> >>> Hi Sahitya,
> >>>
> >>> On 2019-7-6 9:37, Sahitya Tummala wrote:
> >>>> Hi Chao,
> >>>>
> >>>> I just realized that there is a patch from you to fix this already.
> >>>>
> >>>> commit 92ea71c711681a8f575d09d57d5ceaebd6644fcc
> >>>> Author: Chao Yu <yuchao0@huawei.com>
> >>>> Date:   Wed Jan 9 10:48:25 2019 +0800
> >>>>
> >>>>     fsck.f2fs: fix incorrect boundary of IS_VALID_NID()
> >>>>
> >>>>     nid should never equal to max_nid, fix it.
> >>>>
> >>>>     Signed-off-by: Chao Yu <yuchao0@huawei.com>
> >>>>     Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
> >>>>
> >>>> In my code base, I was not having this patch but only have
> >>>> "fsck.f2fs: check validity of nat journal", thus resulting into
> >>>> data loss issues. I will now verify with this patch included.
> >>>
> >>> Ah, that should be the root cause..
> >>>
> >>> BTW, the diff was pointed out by Jaegeuk on patch ("fsck.f2fs: check validity of
> >>> nat journal"), but it looks it was merged to the wrong patch. :P
> >>
> >> Sorry about that, I didn't check it out on upstream branch.
> > 
> > I guess we need this diff?
> 
> Yes, this diff should be merged into ("fsck.f2fs: check validity of nat
> journal") instead of ("fsck.f2fs: fix incorrect boundary of IS_VALID_NID()").
> 
> BTW, I think we'd better to use __is_valid_data_blkaddr/f2fs_is_valid_blkaddr to
> check blkaddr in different condition like we did in kernel.

Sorry, it is in -master. Let me keep going with this.

> 
> Thanks,
> 
> > 
> >>
> >> Thanks,
> >>
> >>>
> >>>  static inline bool IS_VALID_BLK_ADDR(struct f2fs_sb_info *sbi, u32 addr)
> >>>  {
> >>> +	if (addr == NULL_ADDR || addr == NEW_ADDR)
> >>> +		return 1;
> >>> +
> >>>  	if (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
> >>>  				addr < SM_I(sbi)->main_blkaddr) {
> >>>  		DBG(1, "block addr [0x%x]\n", addr);
> >>>
> >>> Thanks,
> >>>
> >>>>
> >>>> Thanks for your support on this issue.
> >>>>
> >>>> Thanks,
> >>>> Sahitya.
> >>>>
> >>>> On Sat, Jul 06, 2019 at 07:01:12AM +0530, Sahitya Tummala wrote:
> >>>>> Hi Chao,
> >>>>>
> >>>>> On Sat, Jul 06, 2019 at 09:14:59AM +0800, Chao Yu wrote:
> >>>>>> Hi Sahitya,
> >>>>>>
> >>>>>> On 2019-7-5 21:48, Sahitya Tummala wrote:
> >>>>>>> Hi Chao,
> >>>>>>>
> >>>>>>> The journal doesn't look to be corrupted, but only block_addr
> >>>>>>> in f2fs_nat_entry is 0 whereas nid and version seems to be valid.
> >>>>>>> These could be just truncated nodes, right?
> >>>>>>
> >>>>>> Yes, but that should not fail the condition of journal check and result in
> >>>>>> journal size truncation, right?
> >>>>>>
> >>>>>
> >>>>> Yes, the patch fails for this condition as well and results into
> >>>>> truncation of journal, due to this check  addr < SM_I(sbi)->main_blkaddr
> >>>>> in IS_VALID_BLK_ADDR().
> >>>>>
> >>>>> Do you think we can correct it this way?
> >>>>>
> >>>>> diff --git a/fsck/f2fs.h b/fsck/f2fs.h
> >>>>> index 417ca0b..00fb740 100644
> >>>>> --- a/fsck/f2fs.h
> >>>>> +++ b/fsck/f2fs.h
> >>>>> @@ -364,8 +364,8 @@ static inline bool IS_VALID_NID(struct f2fs_sb_info *sbi, u32 nid)
> >>>>>
> >>>>>  static inline bool IS_VALID_BLK_ADDR(struct f2fs_sb_info *sbi, u32 addr)
> >>>>>  {
> >>>>> -       if (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
> >>>>> -                               addr < SM_I(sbi)->main_blkaddr) {
> >>>>> +       if (addr && (addr >= le64_to_cpu(F2FS_RAW_SUPER(sbi)->block_count) ||
> >>>>> +                               addr < SM_I(sbi)->main_blkaddr)) {
> >>>>>                 DBG(1, "block addr [0x%x]\n", addr);
> >>>>>                 return 0;
> >>>>>         }
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>>>
> >>>>>>> Here is the full dump of nat journal entries after a SPO (this is not a 
> >>>>>>> issue case, but just a normal case with SPO) -
> >>>>>>>
> >>>>>>> [   21.121325] fsck.f2fs: 0: addr 0 nid 5c5a ver 1\x0a
> >>>>>>> [   21.127509] fsck.f2fs: 1: addr 0 nid 4c1d ver 1\x0a
> >>>>>>> [   21.133760] fsck.f2fs: 2: addr 12075 nid 444f ver 0\x0a
> >>>>>>> [   21.140325] fsck.f2fs: 3: addr 0 nid 4307 ver 0\x0a
> >>>>>>> [   21.146510] fsck.f2fs: 4: addr 0 nid 4059 ver 1\x0a
> >>>>>>> [   21.152745] fsck.f2fs: 5: addr 0 nid 3e69 ver 1\x0a
> >>>>>>> [   21.158944] fsck.f2fs: 6: addr 0 nid 3cf9 ver 1\x0a
> >>>>>>> [   21.165128] fsck.f2fs: 7: addr 0 nid 3b7b ver 1\x0a
> >>>>>>> [   21.171362] fsck.f2fs: 8: addr 18f65 nid 34e2 ver 0\x0a
> >>>>>>> [   21.177903] fsck.f2fs: 9: addr 12126 nid 2c87 ver 0\x0a
> >>>>>>> [   21.184466] fsck.f2fs: 10: addr 0 nid 2a0a ver 1\x0a
> >>>>>>> [   21.190795] fsck.f2fs: 11: addr 0 nid 2565 ver 1\x0a
> >>>>>>> [   21.197070] fsck.f2fs: 12: addr 0 nid 1f48 ver 0\x0a
> >>>>>>> [   21.203369] fsck.f2fs: 13: addr 155bf nid 1d76 ver 0\x0a
> >>>>>>> [   21.210042] fsck.f2fs: 14: addr 19538 nid 18da ver 0\x0a
> >>>>>>> [   21.216675] fsck.f2fs: 15: addr 0 nid 157f ver 0\x0a
> >>>>>>> [   21.223037] fsck.f2fs: 16: addr d117 nid ac3 ver 0\x0a
> >>>>>>> [   21.229547] fsck.f2fs: 17: addr 11e48 nid 805 ver 0\x0a
> >>>>>>> [   21.236088] fsck.f2fs: 18: addr d112 nid 61c ver 0\x0a
> >>>>>>> [   21.242585] fsck.f2fs: 19: addr d00b nid 45b ver 0\x0a
> >>>>>>> [   21.249086] fsck.f2fs: 20: addr d111 nid 30d ver 0\x0a
> >>>>>>> [   21.255544] fsck.f2fs: 21: addr d110 nid 1b5 ver 0\x0a
> >>>>>>> [   21.262025] fsck.f2fs: 22: addr 0 nid 5e41 ver 1\x0a
> >>>>>>> [   21.268342] fsck.f2fs: 23: addr 0 nid 5e46 ver 0\x0a
> >>>>>>> [   21.274618] fsck.f2fs: 24: addr 0 nid 5a34 ver 1\x0a
> >>>>>>> [   21.280916] fsck.f2fs: 25: addr 0 nid 59fa ver 1\x0a
> >>>>>>> [   21.287192] fsck.f2fs: 26: addr 0 nid 57c3 ver 1\x0a
> >>>>>>> [   21.293526] fsck.f2fs: 27: addr 0 nid 5776 ver 1\x0a
> >>>>>>> [   21.299816] fsck.f2fs: 28: addr 0 nid 4f20 ver 1\x0a
> >>>>>>> [   21.306091] fsck.f2fs: 29: addr 0 nid 4ff7 ver 0\x0a
> >>>>>>> [   21.312416] fsck.f2fs: 30: addr 0 nid 47e1 ver 1\x0a
> >>>>>>> [   21.318715] fsck.f2fs: 31: addr 0 nid 4880 ver 1\x0a
> >>>>>>> [   21.324990] fsck.f2fs: 32: addr d118 nid ff4 ver 0\x0a
> >>>>>>> [   21.331498] fsck.f2fs: 33: addr d10c nid ff9 ver 0\x0a
> >>>>>>> [   21.337957] fsck.f2fs: 34: addr 0 nid 49ed ver 1\x0a
> >>>>>>> [   21.344290] fsck.f2fs: 35: addr 0 nid 497e ver 0\x0a
> >>>>>>> [   21.350612] fsck.f2fs: 36: addr fc5a nid 49d2 ver 0\x0a
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> On Fri, Jul 05, 2019 at 04:33:57PM +0530, Sahitya Tummala wrote:
> >>>>>>>> Hi Chao,
> >>>>>>>>
> >>>>>>>> On Fri, Jul 05, 2019 at 06:11:47PM +0800, Chao Yu wrote:
> >>>>>>>>> Hi Sahitya,
> >>>>>>>>>
> >>>>>>>>> On 2019/7/5 13:01, Sahitya Tummala wrote:
> >>>>>>>>>> With the commit <979b25727800> ("fsck.f2fs: check validity of nat journal"),
> >>>>>>>>>> there is a serious data loss issue observed during sudden power off tests.
> >>>>>>>>>
> >>>>>>>>> In that patch, since journal's data is untrusty, I just truncated n_nats to last
> >>>>>>>>> valid one to fix original reported issue.
> >>>>>>>>
> >>>>>>>> The original issue will be fixed even without updating/truncating the n_nats
> >>>>>>>> as we still have sanity checks on nid and blkaddr, right?
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Actually, the problem here is we don't know which data is correct and which data
> >>>>>>>>> is corrupted in nat or sit journal, so even we skip broken entry without
> >>>>>>>>> modifying journal->n_{nats,sits}, latter entry may still contain broken data(but
> >>>>>>>>> it may pass validation check), and those entries may cause potential data
> >>>>>>>>> loss/corruption...
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> This issue was reported by a customer in the following test scenario -
> >>>>>>>>
> >>>>>>>> Update the password in Android and then force crash. It is verified that
> >>>>>>>> all the files related to password data has fsync() already.
> >>>>>>>>
> >>>>>>>> After some iterations (3-4) of this test, it is found password related files are
> >>>>>>>> lost, failing the phone to bootup.
> >>>>>>>>
> >>>>>>>> Without this patch ("fsck.f2fs: check validity of nat journal"), they don't see
> >>>>>>>> such issue. It means that the data present in journal seems to be valid atleast
> >>>>>>>> in this case and truncating is causing to loose the valid data/files.
> >>>>>>>>
> >>>>>>>>> It looks no matter how we change the codes, we still may face data
> >>>>>>>>> loos/corruption issues.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Yes, maybe in some cases, the data loss is still expected, but there are
> >>>>>>>> chances that data may be valid as well, which is confirmed by the test case above.
> >>>>>>>>
> >>>>>>>>> IMO, we need to find out the root cause of data corruption, So my question is
> >>>>>>>>> that why sudden power off tests can make nat journal being corrupted? I doubt
> >>>>>>>>> that in-place update method on checkpoint area of fsck may cause this... any
> >>>>>>>>> other thoughts?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> In case of sudden power off, I have printed the all the journal entries in the
> >>>>>>>> NAT and I several entries with valid blkaddr and nid as below -
> >>>>>>>>
> >>>>>>>> [   21.007649] fsck.f2fs: withfix: 0: addr 1c373 nid 45b\x0a
> >>>>>>>> [   21.014399] fsck.f2fs: withfix: 1: addr 1c2c1 nid 1b5\x0a
> >>>>>>>> [   21.021151] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 2/37 is invalid!!!\x0a
> >>>>>>>> [   21.030860] fsck.f2fs: withfix: 3: addr 1c1f7 nid 19b2\x0a
> >>>>>>>> [   21.037674] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 4/37 is invalid!!!\x0a
> >>>>>>>> [   21.047353] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 5/37 is invalid!!!\x0a
> >>>>>>>> [   21.057062] fsck.f2fs: withfix: 6: addr 1c377 nid ac3\x0a
> >>>>>>>> [   21.063841] fsck.f2fs: withfix: 7: addr 1c378 nid ac9\x0a
> >>>>>>>> [   21.070634] fsck.f2fs: withfix: 8: addr 1b2fe nid 807\x0a
> >>>>>>>> [   21.077356] fsck.f2fs: withfix: 9: addr 17aeb nid 805\x0a
> >>>>>>>> [   21.084102] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 10/37 is invalid!!!\x0a
> >>>>>>>> [   21.093891] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 11/37 is invalid!!!\x0a
> >>>>>>>> [   21.103662] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 12/37 is invalid!!!\x0a
> >>>>>>>> [   21.113450] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 13/37 is invalid!!!\x0a
> >>>>>>>> [   21.123227] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 14/37 is invalid!!!\x0a
> >>>>>>>> [   21.133022] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 15/37 is invalid!!!\x0a
> >>>>>>>> [   21.142792] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 16/37 is invalid!!!\x0a
> >>>>>>>> [   21.152580] fsck.f2fs: withfix: 17: addr 1c37f nid ff4\x0a
> >>>>>>>> [   21.159415] fsck.f2fs: withfix: 18: addr 1c36c nid ff9\x0a
> >>>>>>>> [   21.166229] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 19/37 is invalid!!!\x0a
> >>>>>>>> [   21.176039] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 20/37 is invalid!!!\x0a
> >>>>>>>> [   21.185831] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 21/37 is invalid!!!\x0a
> >>>>>>>> [   21.195631] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 22/37 is invalid!!!\x0a
> >>>>>>>> [   21.205394] fsck.f2fs: withfix: 23: addr 1c2f3 nid 61c\x0a
> >>>>>>>> [   21.212257] fsck.f2fs: withfix: 24: addr 1c391 nid 641\x0a
> >>>>>>>> [   21.219093] fsck.f2fs: withfix: 25: addr 1c386 nid 33b\x0a
> >>>>>>>> [   21.225905] fsck.f2fs: withfix: 26: addr 1c356 nid 258\x0a
> >>>>>>>> [   21.232767] fsck.f2fs: withfix: 27: addr 1c2f2 nid 382\x0a
> >>>>>>>> [   21.239601] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 28/37 is invalid!!!\x0a
> >>>>>>>> [   21.249398] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 29/37 is invalid!!!\x0a
> >>>>>>>> [   21.259160] fsck.f2fs: \x09Error: f2fs_init_nid_bitmap: addr(0) i/nats 30/37 is invalid!!!\x0a
> >>>>>>>> [   21.268970] fsck.f2fs: withfix: 31: addr 1ac03 nid 1ccd\x0a
> >>>>>>>> [   21.275874] fsck.f2fs: withfix: 32: addr 1c395 nid a57\x0a
> >>>>>>>> [   21.282709] fsck.f2fs: withfix: 33: addr 1c392 nid 92e\x0a
> >>>>>>>> [   21.289564] fsck.f2fs: withfix: 34: addr 1c38b nid aa1\x0a
> >>>>>>>> [   21.296377] fsck.f2fs: withfix: 35: addr 1c393 nid 931\x0a
> >>>>>>>> [   21.303240] fsck.f2fs: withfix: 36: addr 1c394 nid a35\x0a
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> The reason is due to incorrect update of journal->n_nats in that patch.
> >>>>>>>>>> When fsck encounters a nat journal entry with invalid blkaddr/nid, it
> >>>>>>>>>> updates journal->n_nats with the index of such first entry and then continues
> >>>>>>>>>> the for loop which scans all nats in journal in f2fs_init_nid_bitmap().
> >>>>>>>>>> But that loop doesn't continue and doesn't update nm_i->nid_bitmap as well,
> >>>>>>>>>> as it fails the for loop condition i < nats_in_cursum(journal).
> >>>>>>>>>>
> >>>>>>>>>> There could be several other valid entries in the journal after this
> >>>>>>>>>> invalid entry. Due to this invalid journal->n_nats, build_nat_area_bitmap()
> >>>>>>>>>> could not be build properly for valid nat journal entries. It further results
> >>>>>>>>>> into these fsck asserts/errors and causes data loss due to incorrect fsck fix.
> >>>>>>>>>>
> >>>>>>>>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfb01] ino is 0\x0a
> >>>>>>>>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfb01] - _JeNcl242yB3Apz2MW,VOh6WmjsVht1W len[0x1c], type[0x1]\x0a
> >>>>>>>>>> [ASSERT] (sanity_check_nid: 372)  --> nid[0xfa2d] ino is 0\x0a
> >>>>>>>>>> [FIX] (__chk_dentries:1438)  --> Unlink [0xfa2d] - _++vN7axccjjGNI,TQEVmcOemY1EAAAA len[0x14], type[0x1]\x0a
> >>>>>>>>>> ...
> >>>>>>>>>> NID[0xba5] is unreachable\x0a
> >>>>>>>>>> NID[0xfa9] is unreachable\x0a
> >>>>>>>>>> ...
> >>>>>>>>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xba5] in NAT\x0a
> >>>>>>>>>> [FIX] (nullify_nat_entry:2089)  --> Remove nid [0xfa9] in NAT\x0a
> >>>>>>>>>> ...
> >>>>>>>>>>
> >>>>>>>>>> Fixes: 979b25727800 ("fsck.f2fs: check validity of nat journal")
> >>>>>>>>>> Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
> >>>>>>>>>> ---
> >>>>>>>>>>  fsck/mount.c | 2 --
> >>>>>>>>>>  1 file changed, 2 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/fsck/mount.c b/fsck/mount.c
> >>>>>>>>>> index 60e0e4a..1b93f49 100644
> >>>>>>>>>> --- a/fsck/mount.c
> >>>>>>>>>> +++ b/fsck/mount.c
> >>>>>>>>>> @@ -1160,7 +1160,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
> >>>>>>>>>>  		addr = le32_to_cpu(nat_in_journal(journal, i).block_addr);
> >>>>>>>>>>  		if (!IS_VALID_BLK_ADDR(sbi, addr)) {
> >>>>>>>>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: addr(%u) is invalid!!!\n", addr);
> >>>>>>>>>> -			journal->n_nats = cpu_to_le16(i);
> >>>>>>>>>>  			c.fix_on = 1;
> >>>>>>>>>>  			continue;
> >>>>>>>>>>  		}
> >>>>>>>>>> @@ -1168,7 +1167,6 @@ static int f2fs_init_nid_bitmap(struct f2fs_sb_info *sbi)
> >>>>>>>>>>  		nid = le32_to_cpu(nid_in_journal(journal, i));
> >>>>>>>>>>  		if (!IS_VALID_NID(sbi, nid)) {
> >>>>>>>>>>  			MSG(0, "\tError: f2fs_init_nid_bitmap: nid(%u) is invalid!!!\n", nid);
> >>>>>>>>>> -			journal->n_nats = cpu_to_le16(i);
> >>>>>>>>>>  			c.fix_on = 1;
> >>>>>>>>>>  			continue;
> >>>>>>>>>>  		}
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>> -- 
> >>>>>>>> --
> >>>>>>>> Sent by a consultant of the Qualcomm Innovation Center, Inc.
> >>>>>>>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> Linux-f2fs-devel mailing list
> >>>>>>>> Linux-f2fs-devel@lists.sourceforge.net
> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> >>>>>>>
> >>>>>
> >>>>> -- 
> >>>>> --
> >>>>> Sent by a consultant of the Qualcomm Innovation Center, Inc.
> >>>>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Linux-f2fs-devel mailing list
> >>>>> Linux-f2fs-devel@lists.sourceforge.net
> >>>>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Linux-f2fs-devel mailing list
> >>> Linux-f2fs-devel@lists.sourceforge.net
> >>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> >>>


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-07-08 23:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-05  5:01 [f2fs-dev] [PATCH] fsck: Fix data loss issue observed during SPO Sahitya Tummala
2019-07-05 10:11 ` Chao Yu
2019-07-05 11:03   ` Sahitya Tummala
2019-07-05 13:48     ` Sahitya Tummala
2019-07-06  1:14       ` Chao Yu
2019-07-06  1:31         ` Sahitya Tummala
2019-07-06  1:37           ` Sahitya Tummala
2019-07-06  2:10             ` Chao Yu
2019-07-06  2:10             ` Chao Yu
2019-07-06  2:17               ` Chao Yu
2019-07-06  4:50                 ` Jaegeuk Kim
2019-07-07  5:25                   ` Chao Yu
2019-07-08 23:43                     ` Jaegeuk Kim
2019-07-06  1:12     ` Chao Yu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.