[f2fs-dev] [PATCH v4 0/2] fsck: Check write pointers of zoned block devices

linux-f2fs-devel.lists.sourceforge.net archive mirror
 help / color / mirror / Atom feed

* [f2fs-dev] [PATCH v4 0/2] fsck: Check write pointers of zoned block devices
@ 2019-08-30 10:19 Shin'ichiro Kawasaki
  2019-08-30 10:19 ` [f2fs-dev] [PATCH v4 1/2] libf2fs_zoned: Introduce f2fs_report_zones() helper function Shin'ichiro Kawasaki
  2019-08-30 10:19 ` [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments Shin'ichiro Kawasaki
  0 siblings, 2 replies; 18+ messages in thread
From: Shin'ichiro Kawasaki @ 2019-08-30 10:19 UTC (permalink / raw)
  To: Jaegeuk Kim, Chao Yu, linux-f2fs-devel; +Cc: Damien Le Moal

On sudden f2fs shutdown, zoned block device status and f2fs meta data can be
inconsistent. When f2fs shutdown happens during write operations, write pointers
on the device go forward but the f2fs meta data does not reflect write pointer
progress. This inconsistency will eventually causes "Unaligned write command"
error when restarting write operation after the next mount. This error can be
observed with xfstests test case generic/388, which enforces sudden shutdown
during write operation and checks the file system recovery. Once the error
happens because of the inconsistency, the file system requires fix. However,
fsck.f2fs does not have a feature to check and fix it.

This patch series adds a new feature to fsck.f2fs to check and fix the
inconsistency. First patch adds a function which helps fsck to call report zone
command to zoned block devices. Second patch checks write pointers of zones that
current segments recorded in meta data point to.

This patch series depends on other patches for zoned block devices, then it
conflicts with the master branch in f2fs-tools.git as of Aug/19/2019. It can be
applied without conflict to dev and dev-test branch tips.

Changes from v3:
* Set curseg position at a new zone start when its write pointer is at zone end

Changes from v2:
* Reflected review comments by Chao Yu
* Dropped 4th patch and 2nd patch (2nd patch was required for the 4th patch)

Changes from v1:
* Fixed build failure on dev branch

Shin'ichiro Kawasaki (2):
  libf2fs_zoned: Introduce f2fs_report_zones() helper function
  fsck.f2fs: Check write pointer consistency with current segments

 fsck/f2fs.h         |   5 ++
 fsck/fsck.c         | 198 ++++++++++++++++++++++++++++++++++++++++++++
 fsck/fsck.h         |   3 +
 fsck/main.c         |   2 +
 include/f2fs_fs.h   |   5 ++
 lib/libf2fs_zoned.c |  59 +++++++++++++
 6 files changed, 272 insertions(+)

-- 
2.21.0

_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [f2fs-dev] [PATCH v4 1/2] libf2fs_zoned: Introduce f2fs_report_zones() helper function
  2019-08-30 10:19 [f2fs-dev] [PATCH v4 0/2] fsck: Check write pointers of zoned block devices Shin'ichiro Kawasaki
@ 2019-08-30 10:19 ` Shin'ichiro Kawasaki
  2019-08-30 10:19 ` [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments Shin'ichiro Kawasaki
  1 sibling, 0 replies; 18+ messages in thread
From: Shin'ichiro Kawasaki @ 2019-08-30 10:19 UTC (permalink / raw)
  To: Jaegeuk Kim, Chao Yu, linux-f2fs-devel; +Cc: Damien Le Moal

To prepare for write pointer consistency check by fsck, add
f2fs_report_zones() helper function which calls REPORT ZONE command to
get write pointer status. The function is added to lib/libf2fs_zoned
which gathers zoned block device related functions.

To check write pointer consistency with f2fs meta data, fsck needs to
refer both of reported zone information and f2fs super block structure
"f2fs_sb_info". However, libf2fs_zoned does not import f2fs_sb_info. To
keep f2fs_sb_info structure out of libf2fs_zoned, provide a callback
function in fsck to f2fs_report_zones() and call it for each zone.

Add SECTOR_SHIFT definition in include/f2fs_fs.h to avoid a magic number
to convert bytes into 512B sectors.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
---
 include/f2fs_fs.h   |  5 ++++
 lib/libf2fs_zoned.c | 59 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/include/f2fs_fs.h b/include/f2fs_fs.h
index 0d9a036..7b02bca 100644
--- a/include/f2fs_fs.h
+++ b/include/f2fs_fs.h
@@ -279,6 +279,9 @@ static inline uint64_t bswap_64(uint64_t val)
 #endif
 #define PAGE_CACHE_SIZE		4096
 #define BITS_PER_BYTE		8
+#ifndef SECTOR_SHIFT
+#define SECTOR_SHIFT		9
+#endif
 #define F2FS_SUPER_MAGIC	0xF2F52010	/* F2FS Magic Number */
 #define CP_CHKSUM_OFFSET	4092
 #define SB_CHKSUM_OFFSET	3068
@@ -1279,6 +1282,8 @@ blk_zone_cond_str(struct blk_zone *blkz)
 
 extern int f2fs_get_zoned_model(int);
 extern int f2fs_get_zone_blocks(int);
+typedef int (report_zones_cb_t)(int i, struct blk_zone *blkz, void *opaque);
+extern int f2fs_report_zones(int, report_zones_cb_t *, void *);
 extern int f2fs_check_zones(int);
 extern int f2fs_reset_zones(int);
 
diff --git a/lib/libf2fs_zoned.c b/lib/libf2fs_zoned.c
index af00b44..127fc6e 100644
--- a/lib/libf2fs_zoned.c
+++ b/lib/libf2fs_zoned.c
@@ -193,6 +193,59 @@ int f2fs_get_zone_blocks(int i)
 
 #define F2FS_REPORT_ZONES_BUFSZ	524288
 
+int f2fs_report_zones(int j, report_zones_cb_t *report_zones_cb, void *opaque)
+{
+	struct device_info *dev = c.devices + j;
+	struct blk_zone_report *rep;
+	struct blk_zone *blkz;
+	unsigned int i, n = 0;
+	u_int64_t total_sectors = (dev->total_sectors * c.sector_size)
+		>> SECTOR_SHIFT;
+	u_int64_t sector = 0;
+	int ret = -1;
+
+	rep = malloc(F2FS_REPORT_ZONES_BUFSZ);
+	if (!rep) {
+		ERR_MSG("No memory for report zones\n");
+		return -ENOMEM;
+	}
+
+	while (sector < total_sectors) {
+
+		/* Get zone info */
+		rep->sector = sector;
+		rep->nr_zones = (F2FS_REPORT_ZONES_BUFSZ - sizeof(struct blk_zone_report))
+			/ sizeof(struct blk_zone);
+
+		ret = ioctl(dev->fd, BLKREPORTZONE, rep);
+		if (ret != 0) {
+			ret = -errno;
+			ERR_MSG("ioctl BLKREPORTZONE failed: errno=%d\n",
+				errno);
+			goto out;
+		}
+
+		if (!rep->nr_zones) {
+			ret = -EIO;
+			ERR_MSG("Unexpected ioctl BLKREPORTZONE result\n");
+			goto out;
+		}
+
+		blkz = (struct blk_zone *)(rep + 1);
+		for (i = 0; i < rep->nr_zones; i++) {
+			ret = report_zones_cb(n, blkz, opaque);
+			if (ret)
+				goto out;
+			sector = blk_zone_sector(blkz) + blk_zone_length(blkz);
+			n++;
+			blkz++;
+		}
+	}
+out:
+	free(rep);
+	return ret;
+}
+
 int f2fs_check_zones(int j)
 {
 	struct device_info *dev = c.devices + j;
@@ -372,6 +425,12 @@ out:
 
 #else
 
+int f2fs_report_zones(int j, report_zones_cb_t *report_zones_cb, void *opaque)
+{
+	ERR_MSG("%d: Zoned block devices are not supported\n", i);
+	return -1;
+}
+
 int f2fs_get_zoned_model(int i)
 {
 	struct device_info *dev = c.devices + i;
-- 
2.21.0



_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments
  2019-08-30 10:19 [f2fs-dev] [PATCH v4 0/2] fsck: Check write pointers of zoned block devices Shin'ichiro Kawasaki
  2019-08-30 10:19 ` [f2fs-dev] [PATCH v4 1/2] libf2fs_zoned: Introduce f2fs_report_zones() helper function Shin'ichiro Kawasaki
@ 2019-08-30 10:19 ` Shin'ichiro Kawasaki
  2019-09-02  7:02   ` Chao Yu
  1 sibling, 1 reply; 18+ messages in thread
From: Shin'ichiro Kawasaki @ 2019-08-30 10:19 UTC (permalink / raw)
  To: Jaegeuk Kim, Chao Yu, linux-f2fs-devel; +Cc: Damien Le Moal

On sudden f2fs shutdown, zoned block device status and f2fs current
segment positions in meta data can be inconsistent. When f2fs shutdown
happens before write operations completes, write pointers of zoned block
devices can go further but f2fs meta data keeps current segments at
positions before the write operations. After remounting the f2fs, the
inconsistency causes write operations not at write pointers and
"Unaligned write command" error is reported. This error was observed when
xfstests test case generic/388 was run with f2fs on a zoned block device.

To avoid the error, have f2fs.fsck check consistency between each current
segment's position and the write pointer of the zone the current segment
points to. If the write pointer goes advance from the current segment,
fix the current segment position setting at same as the write pointer
position. If the write pointer goes to the zone end, find a new zone and
set the current segment position at the new zone start. In case the write
pointer is behind the current segment, write zero data at the write
pointer position to make write pointer position at same as the current
segment.

When inconsistencies are found, turn on c.bug_on flag in fsck_verify() to
ask users to fix them or not. When inconsistencies get fixed, turn on
'force' flag in fsck_verify() to enforce fixes in following checks. This
position fix is done at the beginning of do_fsck() function so that other
checks reflect the current segment modification.

Also add GET_SEC_FROM_SEG and GET_SEG_FROM_SEC macros in fsck/fsck.h to
simplify the code.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 fsck/f2fs.h |   5 ++
 fsck/fsck.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fsck/fsck.h |   3 +
 fsck/main.c |   2 +
 4 files changed, 208 insertions(+)

diff --git a/fsck/f2fs.h b/fsck/f2fs.h
index 4dc6698..2c1c2b3 100644
--- a/fsck/f2fs.h
+++ b/fsck/f2fs.h
@@ -337,6 +337,11 @@ static inline block_t __end_block_addr(struct f2fs_sb_info *sbi)
 #define GET_BLKOFF_FROM_SEG0(sbi, blk_addr)				\
 	(GET_SEGOFF_FROM_SEG0(sbi, blk_addr) & (sbi->blocks_per_seg - 1))
 
+#define GET_SEC_FROM_SEG(sbi, segno)					\
+	((segno) / (sbi)->segs_per_sec)
+#define GET_SEG_FROM_SEC(sbi, secno)					\
+	((secno) * (sbi)->segs_per_sec)
+
 #define FREE_I_START_SEGNO(sbi)						\
 	GET_SEGNO_FROM_SEG0(sbi, SM_I(sbi)->main_blkaddr)
 #define GET_R2L_SEGNO(sbi, segno)	(segno + FREE_I_START_SEGNO(sbi))
diff --git a/fsck/fsck.c b/fsck/fsck.c
index 8953ca1..a0f6849 100644
--- a/fsck/fsck.c
+++ b/fsck/fsck.c
@@ -2574,6 +2574,190 @@ out:
 	return cnt;
 }
 
+/*
+ * Search a free section in main area. Start search from the section specified
+ * with segno argument toward main area end. Return first segment of the found
+ * section in segno argument.
+ */
+static int find_next_free_section(struct f2fs_sb_info *sbi,
+				  unsigned int *segno)
+{
+	unsigned int i, sec, section_valid_blocks;
+	unsigned int end_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr)
+		+ SM_I(sbi)->main_segments;
+	unsigned int end_sec = GET_SEC_FROM_SEG(sbi, end_segno);
+	struct seg_entry *se;
+	struct curseg_info *cs;
+
+	for (sec = GET_SEC_FROM_SEG(sbi, *segno); sec < end_sec; sec++) {
+		/* find a section without valid blocks */
+		section_valid_blocks = 0;
+		for (i = 0; i < sbi->segs_per_sec; i++) {
+			se = get_seg_entry(sbi, GET_SEG_FROM_SEC(sbi, sec) + i);
+			section_valid_blocks += se->valid_blocks;
+		}
+		if (section_valid_blocks)
+			continue;
+
+		/* check the cursegs do not use the section */
+		for (i = 0; i < NO_CHECK_TYPE; i++) {
+			cs = &SM_I(sbi)->curseg_array[i];
+			if (GET_SEC_FROM_SEG(sbi, cs->segno) == sec)
+				break;
+		}
+		if (i >= NR_CURSEG_TYPE) {
+			*segno = GET_SEG_FROM_SEC(sbi, sec);
+			return 0;
+		}
+	}
+
+	return -1;
+}
+
+struct write_pointer_check_data {
+	struct f2fs_sb_info *sbi;
+	struct device_info *dev;
+};
+
+static int fsck_chk_write_pointer(int i, struct blk_zone *blkz, void *opaque)
+{
+	struct write_pointer_check_data *wpd = opaque;
+	struct f2fs_sb_info *sbi = wpd->sbi;
+	struct device_info *dev = wpd->dev;
+	struct f2fs_fsck *fsck = F2FS_FSCK(sbi);
+	block_t zone_block, wp_block, wp_blkoff, cs_block, b;
+	unsigned int zone_segno, wp_segno, new_segno;
+	struct seg_entry *se;
+	struct curseg_info *cs;
+	int cs_index, ret;
+	int log_sectors_per_block = sbi->log_blocksize - SECTOR_SHIFT;
+	unsigned int segs_per_zone = sbi->segs_per_sec * sbi->secs_per_zone;
+	void *zero_blk;
+
+	if (blk_zone_conv(blkz))
+		return 0;
+
+	zone_block = dev->start_blkaddr
+		+ (blk_zone_sector(blkz) >> log_sectors_per_block);
+	zone_segno = GET_SEGNO(sbi, zone_block);
+	wp_block = dev->start_blkaddr
+		+ (blk_zone_wp_sector(blkz) >> log_sectors_per_block);
+	wp_segno = GET_SEGNO(sbi, wp_block);
+	wp_blkoff = wp_block - START_BLOCK(sbi, wp_segno);
+
+	/* find the curseg which points to the zone */
+	for (cs_index = 0; cs_index < NO_CHECK_TYPE; cs_index++) {
+		cs = &SM_I(sbi)->curseg_array[cs_index];
+		if (zone_segno <= cs->segno &&
+		    cs->segno < zone_segno + segs_per_zone)
+			break;
+	}
+
+	if (cs_index >= NR_CURSEG_TYPE)
+		return 0;
+
+	/* check write pointer consistency with the curseg in the zone */
+	cs_block = START_BLOCK(sbi, cs->segno) + cs->next_blkoff;
+	if (wp_block == cs_block)
+		return 0;
+
+	if (!c.fix_on) {
+		MSG(0, "Inconsistent write pointer: "
+		    "curseg %d[0x%x,0x%x] wp[0x%x,0x%x]\n",
+		    cs_index, cs->segno, cs->next_blkoff, wp_segno, wp_blkoff);
+		fsck->chk.wp_inconsistent_zones++;
+		return 0;
+	}
+
+	/*
+	 * If the curseg is in advance from the write pointer, write zero to
+	 * move the write pointer forward to the same position as the curseg.
+	 */
+	if (wp_block < cs_block) {
+		ret = 0;
+		zero_blk = calloc(BLOCK_SZ, 1);
+		if (!zero_blk)
+			return -EINVAL;
+
+		FIX_MSG("Advance write pointer to match with curseg %d: "
+			"[0x%x,0x%x]->[0x%x,0x%x]",
+			cs_index, wp_segno, wp_blkoff,
+			cs->segno, cs->next_blkoff);
+		for (b = wp_block; b < cs_block && !ret; b++)
+			ret = dev_write_block(zero_blk, b);
+
+		fsck->chk.wp_fixed_zones++;
+		free(zero_blk);
+		return ret;
+	}
+
+	if (wp_segno == zone_segno + segs_per_zone) {
+		/*
+		 * If the write pointer is in advance from the curseg and at
+		 * the zone end (section end), search a new free zone (section)
+		 * between the curseg and main area end.
+		 */
+		new_segno = wp_segno;
+		ret = find_next_free_section(sbi, &new_segno);
+		if (ret) {
+			/* search again from main area start */
+			new_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr);
+			ret = find_next_free_section(sbi, &new_segno);
+		}
+		if (ret) {
+			MSG(0, "Free section not found\n");
+			return ret;
+		}
+		FIX_MSG("New section for curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
+			cs_index, cs->segno, cs->next_blkoff, new_segno, 0);
+		cs->segno = new_segno;
+		cs->next_blkoff = 0;
+	} else {
+		/*
+		 * If the write pointer is in advance from the curseg within
+		 * the zone, modify the curseg position to be same as the
+		 * write pointer.
+		 */
+		ASSERT(wp_segno < zone_segno + segs_per_zone);
+		FIX_MSG("Advance curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
+			cs_index, cs->segno, cs->next_blkoff,
+			wp_segno, wp_blkoff);
+		cs->segno = wp_segno;
+		cs->next_blkoff = wp_blkoff;
+	}
+
+	se = get_seg_entry(sbi, cs->segno);
+	se->type = cs_index;
+	fsck->chk.wp_fixed_zones++;
+
+	return 0;
+}
+
+void fsck_chk_write_pointers(struct f2fs_sb_info *sbi)
+{
+	unsigned int i;
+	struct f2fs_fsck *fsck = F2FS_FSCK(sbi);
+	struct write_pointer_check_data wpd = {	sbi, NULL };
+
+	if (c.zoned_model != F2FS_ZONED_HM)
+		return;
+
+	for (i = 0; i < MAX_DEVICES; i++) {
+		if (!c.devices[i].path)
+			break;
+
+		wpd.dev = c.devices + i;
+		if (f2fs_report_zones(i, fsck_chk_write_pointer, &wpd)) {
+			printf("[FSCK] Write pointer check failed: %s\n",
+			       c.devices[i].path);
+			return;
+		}
+	}
+
+	if (fsck->chk.wp_fixed_zones && c.fix_on)
+		write_curseg_info(sbi);
+}
+
 int fsck_chk_curseg_info(struct f2fs_sb_info *sbi)
 {
 	struct curseg_info *curseg;
@@ -2624,6 +2808,20 @@ int fsck_verify(struct f2fs_sb_info *sbi)
 
 	printf("\n");
 
+	if (c.zoned_model == F2FS_ZONED_HM) {
+		printf("[FSCK] Write pointers consistency                    ");
+		if (fsck->chk.wp_inconsistent_zones == 0x0) {
+			printf(" [Ok..]\n");
+		} else {
+			printf(" [Fail] [0x%x]\n",
+			       fsck->chk.wp_inconsistent_zones);
+			c.bug_on = 1;
+		}
+
+		if (fsck->chk.wp_fixed_zones && c.fix_on)
+			force = 1;
+	}
+
 	if (c.feature & cpu_to_le32(F2FS_FEATURE_LOST_FOUND)) {
 		for (i = 0; i < fsck->nr_nat_entries; i++)
 			if (f2fs_test_bit(i, fsck->nat_area_bitmap) != 0)
diff --git a/fsck/fsck.h b/fsck/fsck.h
index d38e8de..aa3dbe7 100644
--- a/fsck/fsck.h
+++ b/fsck/fsck.h
@@ -80,6 +80,8 @@ struct f2fs_fsck {
 		u32 multi_hard_link_files;
 		u64 sit_valid_blocks;
 		u32 sit_free_segs;
+		u32 wp_fixed_zones;
+		u32 wp_inconsistent_zones;
 	} chk;
 
 	struct hard_link_node *hard_link_list_head;
@@ -156,6 +158,7 @@ int fsck_chk_inline_dentries(struct f2fs_sb_info *, struct f2fs_node *,
 		struct child_info *);
 void fsck_chk_checkpoint(struct f2fs_sb_info *sbi);
 int fsck_chk_meta(struct f2fs_sb_info *sbi);
+void fsck_chk_write_pointers(struct f2fs_sb_info *);
 int fsck_chk_curseg_info(struct f2fs_sb_info *);
 void pretty_print_filename(const u8 *raw_name, u32 len,
 			   char out[F2FS_PRINT_NAMELEN], int enc_name);
diff --git a/fsck/main.c b/fsck/main.c
index 9aca024..4b4a789 100644
--- a/fsck/main.c
+++ b/fsck/main.c
@@ -584,6 +584,8 @@ static void do_fsck(struct f2fs_sb_info *sbi)
 
 	print_cp_state(flag);
 
+	fsck_chk_write_pointers(sbi);
+
 	fsck_chk_curseg_info(sbi);
 
 	if (!c.fix_on && !c.bug_on) {
-- 
2.21.0



_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments
  2019-08-30 10:19 ` [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments Shin'ichiro Kawasaki
@ 2019-09-02  7:02   ` Chao Yu
  2019-09-03  8:37     ` Shinichiro Kawasaki
  0 siblings, 1 reply; 18+ messages in thread
From: Chao Yu @ 2019-09-02  7:02 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki, Jaegeuk Kim, linux-f2fs-devel; +Cc: Damien Le Moal

On 2019/8/30 18:19, Shin'ichiro Kawasaki wrote:
> On sudden f2fs shutdown, zoned block device status and f2fs current
> segment positions in meta data can be inconsistent. When f2fs shutdown
> happens before write operations completes, write pointers of zoned block
> devices can go further but f2fs meta data keeps current segments at
> positions before the write operations. After remounting the f2fs, the
> inconsistency causes write operations not at write pointers and
> "Unaligned write command" error is reported. This error was observed when
> xfstests test case generic/388 was run with f2fs on a zoned block device.
> 
> To avoid the error, have f2fs.fsck check consistency between each current
> segment's position and the write pointer of the zone the current segment
> points to. If the write pointer goes advance from the current segment,
> fix the current segment position setting at same as the write pointer
> position. If the write pointer goes to the zone end, find a new zone and
> set the current segment position at the new zone start. In case the write
> pointer is behind the current segment, write zero data at the write
> pointer position to make write pointer position at same as the current
> segment.
> 
> When inconsistencies are found, turn on c.bug_on flag in fsck_verify() to
> ask users to fix them or not. When inconsistencies get fixed, turn on
> 'force' flag in fsck_verify() to enforce fixes in following checks. This
> position fix is done at the beginning of do_fsck() function so that other
> checks reflect the current segment modification.
> 
> Also add GET_SEC_FROM_SEG and GET_SEG_FROM_SEC macros in fsck/fsck.h to
> simplify the code.
> 
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> ---
>  fsck/f2fs.h |   5 ++
>  fsck/fsck.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fsck/fsck.h |   3 +
>  fsck/main.c |   2 +
>  4 files changed, 208 insertions(+)
> 
> diff --git a/fsck/f2fs.h b/fsck/f2fs.h
> index 4dc6698..2c1c2b3 100644
> --- a/fsck/f2fs.h
> +++ b/fsck/f2fs.h
> @@ -337,6 +337,11 @@ static inline block_t __end_block_addr(struct f2fs_sb_info *sbi)
>  #define GET_BLKOFF_FROM_SEG0(sbi, blk_addr)				\
>  	(GET_SEGOFF_FROM_SEG0(sbi, blk_addr) & (sbi->blocks_per_seg - 1))
>  
> +#define GET_SEC_FROM_SEG(sbi, segno)					\
> +	((segno) / (sbi)->segs_per_sec)
> +#define GET_SEG_FROM_SEC(sbi, secno)					\
> +	((secno) * (sbi)->segs_per_sec)
> +
>  #define FREE_I_START_SEGNO(sbi)						\
>  	GET_SEGNO_FROM_SEG0(sbi, SM_I(sbi)->main_blkaddr)
>  #define GET_R2L_SEGNO(sbi, segno)	(segno + FREE_I_START_SEGNO(sbi))
> diff --git a/fsck/fsck.c b/fsck/fsck.c
> index 8953ca1..a0f6849 100644
> --- a/fsck/fsck.c
> +++ b/fsck/fsck.c
> @@ -2574,6 +2574,190 @@ out:
>  	return cnt;
>  }
>  
> +/*
> + * Search a free section in main area. Start search from the section specified
> + * with segno argument toward main area end. Return first segment of the found
> + * section in segno argument.
> + */
> +static int find_next_free_section(struct f2fs_sb_info *sbi,
> +				  unsigned int *segno)
> +{
> +	unsigned int i, sec, section_valid_blocks;
> +	unsigned int end_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr)
> +		+ SM_I(sbi)->main_segments;
> +	unsigned int end_sec = GET_SEC_FROM_SEG(sbi, end_segno);
> +	struct seg_entry *se;
> +	struct curseg_info *cs;
> +
> +	for (sec = GET_SEC_FROM_SEG(sbi, *segno); sec < end_sec; sec++) {
> +		/* find a section without valid blocks */
> +		section_valid_blocks = 0;
> +		for (i = 0; i < sbi->segs_per_sec; i++) {
> +			se = get_seg_entry(sbi, GET_SEG_FROM_SEC(sbi, sec) + i);
> +			section_valid_blocks += se->valid_blocks;

If we want to find a 'real' free section, we'd better to use
se->ckpt_valid_blocks (wrapped with get_seg_vblocks()) in where we has recorded
fsynced data count.

> +		}
> +		if (section_valid_blocks)
> +			continue;
> +
> +		/* check the cursegs do not use the section */
> +		for (i = 0; i < NO_CHECK_TYPE; i++) {
> +			cs = &SM_I(sbi)->curseg_array[i];
> +			if (GET_SEC_FROM_SEG(sbi, cs->segno) == sec)
> +				break;
> +		}
> +		if (i >= NR_CURSEG_TYPE) {
> +			*segno = GET_SEG_FROM_SEC(sbi, sec);
> +			return 0;
> +		}
> +	}
> +
> +	return -1;
> +}
> +
> +struct write_pointer_check_data {
> +	struct f2fs_sb_info *sbi;
> +	struct device_info *dev;
> +};
> +
> +static int fsck_chk_write_pointer(int i, struct blk_zone *blkz, void *opaque)
> +{
> +	struct write_pointer_check_data *wpd = opaque;
> +	struct f2fs_sb_info *sbi = wpd->sbi;
> +	struct device_info *dev = wpd->dev;
> +	struct f2fs_fsck *fsck = F2FS_FSCK(sbi);
> +	block_t zone_block, wp_block, wp_blkoff, cs_block, b;
> +	unsigned int zone_segno, wp_segno, new_segno;
> +	struct seg_entry *se;
> +	struct curseg_info *cs;
> +	int cs_index, ret;
> +	int log_sectors_per_block = sbi->log_blocksize - SECTOR_SHIFT;
> +	unsigned int segs_per_zone = sbi->segs_per_sec * sbi->secs_per_zone;
> +	void *zero_blk;
> +
> +	if (blk_zone_conv(blkz))
> +		return 0;
> +
> +	zone_block = dev->start_blkaddr
> +		+ (blk_zone_sector(blkz) >> log_sectors_per_block);
> +	zone_segno = GET_SEGNO(sbi, zone_block);
> +	wp_block = dev->start_blkaddr
> +		+ (blk_zone_wp_sector(blkz) >> log_sectors_per_block);
> +	wp_segno = GET_SEGNO(sbi, wp_block);
> +	wp_blkoff = wp_block - START_BLOCK(sbi, wp_segno);
> +
> +	/* find the curseg which points to the zone */
> +	for (cs_index = 0; cs_index < NO_CHECK_TYPE; cs_index++) {
> +		cs = &SM_I(sbi)->curseg_array[cs_index];
> +		if (zone_segno <= cs->segno &&
> +		    cs->segno < zone_segno + segs_per_zone)
> +			break;
> +	}
> +
> +	if (cs_index >= NR_CURSEG_TYPE)
> +		return 0;
> +
> +	/* check write pointer consistency with the curseg in the zone */
> +	cs_block = START_BLOCK(sbi, cs->segno) + cs->next_blkoff;
> +	if (wp_block == cs_block)
> +		return 0;
> +
> +	if (!c.fix_on) {
> +		MSG(0, "Inconsistent write pointer: "
> +		    "curseg %d[0x%x,0x%x] wp[0x%x,0x%x]\n",
> +		    cs_index, cs->segno, cs->next_blkoff, wp_segno, wp_blkoff);
> +		fsck->chk.wp_inconsistent_zones++;
> +		return 0;
> +	}
> +
> +	/*
> +	 * If the curseg is in advance from the write pointer, write zero to
> +	 * move the write pointer forward to the same position as the curseg.
> +	 */
> +	if (wp_block < cs_block) {
> +		ret = 0;
> +		zero_blk = calloc(BLOCK_SZ, 1);
> +		if (!zero_blk)
> +			return -EINVAL;
> +
> +		FIX_MSG("Advance write pointer to match with curseg %d: "
> +			"[0x%x,0x%x]->[0x%x,0x%x]",
> +			cs_index, wp_segno, wp_blkoff,
> +			cs->segno, cs->next_blkoff);
> +		for (b = wp_block; b < cs_block && !ret; b++)
> +			ret = dev_write_block(zero_blk, b);
> +
> +		fsck->chk.wp_fixed_zones++;
> +		free(zero_blk);
> +		return ret;
> +	}
> +
> +	if (wp_segno == zone_segno + segs_per_zone) {
> +		/*
> +		 * If the write pointer is in advance from the curseg and at
> +		 * the zone end (section end), search a new free zone (section)
> +		 * between the curseg and main area end.
> +		 */
> +		new_segno = wp_segno;
> +		ret = find_next_free_section(sbi, &new_segno);
> +		if (ret) {
> +			/* search again from main area start */
> +			new_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr);
> +			ret = find_next_free_section(sbi, &new_segno);
> +		}

If curseg's type is warm node, relocating curseg would ruin warm node chain,
result in losing fsynced data for ever as well.

So I guess we should migrate all dnode blocks chained with cs->next_blkoff in
current section.

> +		if (ret) {
> +			MSG(0, "Free section not found\n");
> +			return ret;
> +		}
> +		FIX_MSG("New section for curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
> +			cs_index, cs->segno, cs->next_blkoff, new_segno, 0);
> +		cs->segno = new_segno;
> +		cs->next_blkoff = 0;
> +	} else {
> +		/*
> +		 * If the write pointer is in advance from the curseg within
> +		 * the zone, modify the curseg position to be same as the
> +		 * write pointer.
> +		 */
> +		ASSERT(wp_segno < zone_segno + segs_per_zone);
> +		FIX_MSG("Advance curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
> +			cs_index, cs->segno, cs->next_blkoff,
> +			wp_segno, wp_blkoff);
> +		cs->segno = wp_segno;
> +		cs->next_blkoff = wp_blkoff;

The same data lose problem here, I guess we'd better handle it with the same way
as above.

Thoughts?

Thanks,

> +	}
> +
> +	se = get_seg_entry(sbi, cs->segno);
> +	se->type = cs_index;
> +	fsck->chk.wp_fixed_zones++;
> +
> +	return 0;
> +}
> +
> +void fsck_chk_write_pointers(struct f2fs_sb_info *sbi)
> +{
> +	unsigned int i;
> +	struct f2fs_fsck *fsck = F2FS_FSCK(sbi);
> +	struct write_pointer_check_data wpd = {	sbi, NULL };
> +
> +	if (c.zoned_model != F2FS_ZONED_HM)
> +		return;
> +
> +	for (i = 0; i < MAX_DEVICES; i++) {
> +		if (!c.devices[i].path)
> +			break;
> +
> +		wpd.dev = c.devices + i;
> +		if (f2fs_report_zones(i, fsck_chk_write_pointer, &wpd)) {
> +			printf("[FSCK] Write pointer check failed: %s\n",
> +			       c.devices[i].path);
> +			return;
> +		}
> +	}
> +
> +	if (fsck->chk.wp_fixed_zones && c.fix_on)
> +		write_curseg_info(sbi);
> +}
> +
>  int fsck_chk_curseg_info(struct f2fs_sb_info *sbi)
>  {
>  	struct curseg_info *curseg;
> @@ -2624,6 +2808,20 @@ int fsck_verify(struct f2fs_sb_info *sbi)
>  
>  	printf("\n");
>  
> +	if (c.zoned_model == F2FS_ZONED_HM) {
> +		printf("[FSCK] Write pointers consistency                    ");
> +		if (fsck->chk.wp_inconsistent_zones == 0x0) {
> +			printf(" [Ok..]\n");
> +		} else {
> +			printf(" [Fail] [0x%x]\n",
> +			       fsck->chk.wp_inconsistent_zones);
> +			c.bug_on = 1;
> +		}
> +
> +		if (fsck->chk.wp_fixed_zones && c.fix_on)
> +			force = 1;
> +	}
> +
>  	if (c.feature & cpu_to_le32(F2FS_FEATURE_LOST_FOUND)) {
>  		for (i = 0; i < fsck->nr_nat_entries; i++)
>  			if (f2fs_test_bit(i, fsck->nat_area_bitmap) != 0)
> diff --git a/fsck/fsck.h b/fsck/fsck.h
> index d38e8de..aa3dbe7 100644
> --- a/fsck/fsck.h
> +++ b/fsck/fsck.h
> @@ -80,6 +80,8 @@ struct f2fs_fsck {
>  		u32 multi_hard_link_files;
>  		u64 sit_valid_blocks;
>  		u32 sit_free_segs;
> +		u32 wp_fixed_zones;
> +		u32 wp_inconsistent_zones;
>  	} chk;
>  
>  	struct hard_link_node *hard_link_list_head;
> @@ -156,6 +158,7 @@ int fsck_chk_inline_dentries(struct f2fs_sb_info *, struct f2fs_node *,
>  		struct child_info *);
>  void fsck_chk_checkpoint(struct f2fs_sb_info *sbi);
>  int fsck_chk_meta(struct f2fs_sb_info *sbi);
> +void fsck_chk_write_pointers(struct f2fs_sb_info *);
>  int fsck_chk_curseg_info(struct f2fs_sb_info *);
>  void pretty_print_filename(const u8 *raw_name, u32 len,
>  			   char out[F2FS_PRINT_NAMELEN], int enc_name);
> diff --git a/fsck/main.c b/fsck/main.c
> index 9aca024..4b4a789 100644
> --- a/fsck/main.c
> +++ b/fsck/main.c
> @@ -584,6 +584,8 @@ static void do_fsck(struct f2fs_sb_info *sbi)
>  
>  	print_cp_state(flag);
>  
> +	fsck_chk_write_pointers(sbi);
> +
>  	fsck_chk_curseg_info(sbi);
>  
>  	if (!c.fix_on && !c.bug_on) {
> 


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments
  2019-09-02  7:02   ` Chao Yu
@ 2019-09-03  8:37     ` Shinichiro Kawasaki
  2019-09-05  9:58       ` Chao Yu
  0 siblings, 1 reply; 18+ messages in thread
From: Shinichiro Kawasaki @ 2019-09-03  8:37 UTC (permalink / raw)
  To: Chao Yu; +Cc: Jaegeuk Kim, Damien Le Moal, linux-f2fs-devel

On Sep 02, 2019 / 15:02, Chao Yu wrote:
> On 2019/8/30 18:19, Shin'ichiro Kawasaki wrote:
> > On sudden f2fs shutdown, zoned block device status and f2fs current
> > segment positions in meta data can be inconsistent. When f2fs shutdown
> > happens before write operations completes, write pointers of zoned block
> > devices can go further but f2fs meta data keeps current segments at
> > positions before the write operations. After remounting the f2fs, the
> > inconsistency causes write operations not at write pointers and
> > "Unaligned write command" error is reported. This error was observed when
> > xfstests test case generic/388 was run with f2fs on a zoned block device.
> > 
> > To avoid the error, have f2fs.fsck check consistency between each current
> > segment's position and the write pointer of the zone the current segment
> > points to. If the write pointer goes advance from the current segment,
> > fix the current segment position setting at same as the write pointer
> > position. If the write pointer goes to the zone end, find a new zone and
> > set the current segment position at the new zone start. In case the write
> > pointer is behind the current segment, write zero data at the write
> > pointer position to make write pointer position at same as the current
> > segment.
> > 
> > When inconsistencies are found, turn on c.bug_on flag in fsck_verify() to
> > ask users to fix them or not. When inconsistencies get fixed, turn on
> > 'force' flag in fsck_verify() to enforce fixes in following checks. This
> > position fix is done at the beginning of do_fsck() function so that other
> > checks reflect the current segment modification.
> > 
> > Also add GET_SEC_FROM_SEG and GET_SEG_FROM_SEC macros in fsck/fsck.h to
> > simplify the code.
> > 
> > Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > ---
> >  fsck/f2fs.h |   5 ++
> >  fsck/fsck.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  fsck/fsck.h |   3 +
> >  fsck/main.c |   2 +
> >  4 files changed, 208 insertions(+)
> > 
> > diff --git a/fsck/f2fs.h b/fsck/f2fs.h
> > index 4dc6698..2c1c2b3 100644
> > --- a/fsck/f2fs.h
> > +++ b/fsck/f2fs.h
> > @@ -337,6 +337,11 @@ static inline block_t __end_block_addr(struct f2fs_sb_info *sbi)
> >  #define GET_BLKOFF_FROM_SEG0(sbi, blk_addr)				\
> >  	(GET_SEGOFF_FROM_SEG0(sbi, blk_addr) & (sbi->blocks_per_seg - 1))
> >  
> > +#define GET_SEC_FROM_SEG(sbi, segno)					\
> > +	((segno) / (sbi)->segs_per_sec)
> > +#define GET_SEG_FROM_SEC(sbi, secno)					\
> > +	((secno) * (sbi)->segs_per_sec)
> > +
> >  #define FREE_I_START_SEGNO(sbi)						\
> >  	GET_SEGNO_FROM_SEG0(sbi, SM_I(sbi)->main_blkaddr)
> >  #define GET_R2L_SEGNO(sbi, segno)	(segno + FREE_I_START_SEGNO(sbi))
> > diff --git a/fsck/fsck.c b/fsck/fsck.c
> > index 8953ca1..a0f6849 100644
> > --- a/fsck/fsck.c
> > +++ b/fsck/fsck.c
> > @@ -2574,6 +2574,190 @@ out:
> >  	return cnt;
> >  }
> >  
> > +/*
> > + * Search a free section in main area. Start search from the section specified
> > + * with segno argument toward main area end. Return first segment of the found
> > + * section in segno argument.
> > + */
> > +static int find_next_free_section(struct f2fs_sb_info *sbi,
> > +				  unsigned int *segno)
> > +{
> > +	unsigned int i, sec, section_valid_blocks;
> > +	unsigned int end_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr)
> > +		+ SM_I(sbi)->main_segments;
> > +	unsigned int end_sec = GET_SEC_FROM_SEG(sbi, end_segno);
> > +	struct seg_entry *se;
> > +	struct curseg_info *cs;
> > +
> > +	for (sec = GET_SEC_FROM_SEG(sbi, *segno); sec < end_sec; sec++) {
> > +		/* find a section without valid blocks */
> > +		section_valid_blocks = 0;
> > +		for (i = 0; i < sbi->segs_per_sec; i++) {
> > +			se = get_seg_entry(sbi, GET_SEG_FROM_SEC(sbi, sec) + i);
> > +			section_valid_blocks += se->valid_blocks;
> 
> If we want to find a 'real' free section, we'd better to use
> se->ckpt_valid_blocks (wrapped with get_seg_vblocks()) in where we has recorded
> fsynced data count.

Thanks. When I create next patch series, I will use get_seg_vblocks().
I will rebase to dev-test branch in which get_seg_vblocks() is available.

> 
> > +		}
> > +		if (section_valid_blocks)
> > +			continue;
> > +
> > +		/* check the cursegs do not use the section */
> > +		for (i = 0; i < NO_CHECK_TYPE; i++) {
> > +			cs = &SM_I(sbi)->curseg_array[i];
> > +			if (GET_SEC_FROM_SEG(sbi, cs->segno) == sec)
> > +				break;
> > +		}
> > +		if (i >= NR_CURSEG_TYPE) {
> > +			*segno = GET_SEG_FROM_SEC(sbi, sec);
> > +			return 0;
> > +		}
> > +	}
> > +
> > +	return -1;
> > +}
> > +
> > +struct write_pointer_check_data {
> > +	struct f2fs_sb_info *sbi;
> > +	struct device_info *dev;
> > +};
> > +
> > +static int fsck_chk_write_pointer(int i, struct blk_zone *blkz, void *opaque)
> > +{
> > +	struct write_pointer_check_data *wpd = opaque;
> > +	struct f2fs_sb_info *sbi = wpd->sbi;
> > +	struct device_info *dev = wpd->dev;
> > +	struct f2fs_fsck *fsck = F2FS_FSCK(sbi);
> > +	block_t zone_block, wp_block, wp_blkoff, cs_block, b;
> > +	unsigned int zone_segno, wp_segno, new_segno;
> > +	struct seg_entry *se;
> > +	struct curseg_info *cs;
> > +	int cs_index, ret;
> > +	int log_sectors_per_block = sbi->log_blocksize - SECTOR_SHIFT;
> > +	unsigned int segs_per_zone = sbi->segs_per_sec * sbi->secs_per_zone;
> > +	void *zero_blk;
> > +
> > +	if (blk_zone_conv(blkz))
> > +		return 0;
> > +
> > +	zone_block = dev->start_blkaddr
> > +		+ (blk_zone_sector(blkz) >> log_sectors_per_block);
> > +	zone_segno = GET_SEGNO(sbi, zone_block);
> > +	wp_block = dev->start_blkaddr
> > +		+ (blk_zone_wp_sector(blkz) >> log_sectors_per_block);
> > +	wp_segno = GET_SEGNO(sbi, wp_block);
> > +	wp_blkoff = wp_block - START_BLOCK(sbi, wp_segno);
> > +
> > +	/* find the curseg which points to the zone */
> > +	for (cs_index = 0; cs_index < NO_CHECK_TYPE; cs_index++) {
> > +		cs = &SM_I(sbi)->curseg_array[cs_index];
> > +		if (zone_segno <= cs->segno &&
> > +		    cs->segno < zone_segno + segs_per_zone)
> > +			break;
> > +	}
> > +
> > +	if (cs_index >= NR_CURSEG_TYPE)
> > +		return 0;
> > +
> > +	/* check write pointer consistency with the curseg in the zone */
> > +	cs_block = START_BLOCK(sbi, cs->segno) + cs->next_blkoff;
> > +	if (wp_block == cs_block)
> > +		return 0;
> > +
> > +	if (!c.fix_on) {
> > +		MSG(0, "Inconsistent write pointer: "
> > +		    "curseg %d[0x%x,0x%x] wp[0x%x,0x%x]\n",
> > +		    cs_index, cs->segno, cs->next_blkoff, wp_segno, wp_blkoff);
> > +		fsck->chk.wp_inconsistent_zones++;
> > +		return 0;
> > +	}
> > +
> > +	/*
> > +	 * If the curseg is in advance from the write pointer, write zero to
> > +	 * move the write pointer forward to the same position as the curseg.
> > +	 */
> > +	if (wp_block < cs_block) {
> > +		ret = 0;
> > +		zero_blk = calloc(BLOCK_SZ, 1);
> > +		if (!zero_blk)
> > +			return -EINVAL;
> > +
> > +		FIX_MSG("Advance write pointer to match with curseg %d: "
> > +			"[0x%x,0x%x]->[0x%x,0x%x]",
> > +			cs_index, wp_segno, wp_blkoff,
> > +			cs->segno, cs->next_blkoff);
> > +		for (b = wp_block; b < cs_block && !ret; b++)
> > +			ret = dev_write_block(zero_blk, b);
> > +
> > +		fsck->chk.wp_fixed_zones++;
> > +		free(zero_blk);
> > +		return ret;
> > +	}
> > +
> > +	if (wp_segno == zone_segno + segs_per_zone) {
> > +		/*
> > +		 * If the write pointer is in advance from the curseg and at
> > +		 * the zone end (section end), search a new free zone (section)
> > +		 * between the curseg and main area end.
> > +		 */
> > +		new_segno = wp_segno;
> > +		ret = find_next_free_section(sbi, &new_segno);
> > +		if (ret) {
> > +			/* search again from main area start */
> > +			new_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr);
> > +			ret = find_next_free_section(sbi, &new_segno);
> > +		}
> 
> If curseg's type is warm node, relocating curseg would ruin warm node chain,
> result in losing fsynced data for ever as well.
> 
> So I guess we should migrate all dnode blocks chained with cs->next_blkoff in
> current section.
> 
> > +		if (ret) {
> > +			MSG(0, "Free section not found\n");
> > +			return ret;
> > +		}
> > +		FIX_MSG("New section for curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
> > +			cs_index, cs->segno, cs->next_blkoff, new_segno, 0);
> > +		cs->segno = new_segno;
> > +		cs->next_blkoff = 0;
> > +	} else {
> > +		/*
> > +		 * If the write pointer is in advance from the curseg within
> > +		 * the zone, modify the curseg position to be same as the
> > +		 * write pointer.
> > +		 */
> > +		ASSERT(wp_segno < zone_segno + segs_per_zone);
> > +		FIX_MSG("Advance curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
> > +			cs_index, cs->segno, cs->next_blkoff,
> > +			wp_segno, wp_blkoff);
> > +		cs->segno = wp_segno;
> > +		cs->next_blkoff = wp_blkoff;
> 
> The same data lose problem here, I guess we'd better handle it with the same way
> as above.
> 
> Thoughts?

I created f2fs status with fsync data and warm node chain, and confirmed the v4
implementation ruins the dnode blocks chain. Hmm.

My understanding is that when f2fs kernel module recovers the fsync data, it
sets the warm node curseg position at the start of the fsync data, and the fsync
data will be overwritten as new nodes created. Is this understanding correct?

If this is the case, I think write pointer inconsistency will happen even if
fsck does "migrate all dnode blocks" (I assume that the warm node curseg's next
blkoff points to the migrated dnode block chain start). After the fsync data fix
by kernel, warm node curseg will not point to the write pointer position.
Anyway, kernel code change will be required to fix the inconsistency after fsync
data fix.

How about to have fsck leave warm node curseg position untouched if the fsync
data exists? The kernel side change for write pointer inconsistency will be able
to cover this case.

--
Best Regards,
Shin'ichiro Kawasaki

_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments
  2019-09-03  8:37     ` Shinichiro Kawasaki
@ 2019-09-05  9:58       ` Chao Yu
  2019-09-06  8:31         ` Shinichiro Kawasaki
  0 siblings, 1 reply; 18+ messages in thread
From: Chao Yu @ 2019-09-05  9:58 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: Jaegeuk Kim, Damien Le Moal, linux-f2fs-devel

Hi Shinichiro,

Sorry for the delay.

On 2019/9/3 16:37, Shinichiro Kawasaki wrote:
> On Sep 02, 2019 / 15:02, Chao Yu wrote:
>> On 2019/8/30 18:19, Shin'ichiro Kawasaki wrote:
>>> On sudden f2fs shutdown, zoned block device status and f2fs current
>>> segment positions in meta data can be inconsistent. When f2fs shutdown
>>> happens before write operations completes, write pointers of zoned block
>>> devices can go further but f2fs meta data keeps current segments at
>>> positions before the write operations. After remounting the f2fs, the
>>> inconsistency causes write operations not at write pointers and
>>> "Unaligned write command" error is reported. This error was observed when
>>> xfstests test case generic/388 was run with f2fs on a zoned block device.
>>>
>>> To avoid the error, have f2fs.fsck check consistency between each current
>>> segment's position and the write pointer of the zone the current segment
>>> points to. If the write pointer goes advance from the current segment,
>>> fix the current segment position setting at same as the write pointer
>>> position. If the write pointer goes to the zone end, find a new zone and
>>> set the current segment position at the new zone start. In case the write
>>> pointer is behind the current segment, write zero data at the write
>>> pointer position to make write pointer position at same as the current
>>> segment.
>>>
>>> When inconsistencies are found, turn on c.bug_on flag in fsck_verify() to
>>> ask users to fix them or not. When inconsistencies get fixed, turn on
>>> 'force' flag in fsck_verify() to enforce fixes in following checks. This
>>> position fix is done at the beginning of do_fsck() function so that other
>>> checks reflect the current segment modification.
>>>
>>> Also add GET_SEC_FROM_SEG and GET_SEG_FROM_SEC macros in fsck/fsck.h to
>>> simplify the code.
>>>
>>> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>>> ---
>>>  fsck/f2fs.h |   5 ++
>>>  fsck/fsck.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  fsck/fsck.h |   3 +
>>>  fsck/main.c |   2 +
>>>  4 files changed, 208 insertions(+)
>>>
>>> diff --git a/fsck/f2fs.h b/fsck/f2fs.h
>>> index 4dc6698..2c1c2b3 100644
>>> --- a/fsck/f2fs.h
>>> +++ b/fsck/f2fs.h
>>> @@ -337,6 +337,11 @@ static inline block_t __end_block_addr(struct f2fs_sb_info *sbi)
>>>  #define GET_BLKOFF_FROM_SEG0(sbi, blk_addr)				\
>>>  	(GET_SEGOFF_FROM_SEG0(sbi, blk_addr) & (sbi->blocks_per_seg - 1))
>>>  
>>> +#define GET_SEC_FROM_SEG(sbi, segno)					\
>>> +	((segno) / (sbi)->segs_per_sec)
>>> +#define GET_SEG_FROM_SEC(sbi, secno)					\
>>> +	((secno) * (sbi)->segs_per_sec)
>>> +
>>>  #define FREE_I_START_SEGNO(sbi)						\
>>>  	GET_SEGNO_FROM_SEG0(sbi, SM_I(sbi)->main_blkaddr)
>>>  #define GET_R2L_SEGNO(sbi, segno)	(segno + FREE_I_START_SEGNO(sbi))
>>> diff --git a/fsck/fsck.c b/fsck/fsck.c
>>> index 8953ca1..a0f6849 100644
>>> --- a/fsck/fsck.c
>>> +++ b/fsck/fsck.c
>>> @@ -2574,6 +2574,190 @@ out:
>>>  	return cnt;
>>>  }
>>>  
>>> +/*
>>> + * Search a free section in main area. Start search from the section specified
>>> + * with segno argument toward main area end. Return first segment of the found
>>> + * section in segno argument.
>>> + */
>>> +static int find_next_free_section(struct f2fs_sb_info *sbi,
>>> +				  unsigned int *segno)
>>> +{
>>> +	unsigned int i, sec, section_valid_blocks;
>>> +	unsigned int end_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr)
>>> +		+ SM_I(sbi)->main_segments;
>>> +	unsigned int end_sec = GET_SEC_FROM_SEG(sbi, end_segno);
>>> +	struct seg_entry *se;
>>> +	struct curseg_info *cs;
>>> +
>>> +	for (sec = GET_SEC_FROM_SEG(sbi, *segno); sec < end_sec; sec++) {
>>> +		/* find a section without valid blocks */
>>> +		section_valid_blocks = 0;
>>> +		for (i = 0; i < sbi->segs_per_sec; i++) {
>>> +			se = get_seg_entry(sbi, GET_SEG_FROM_SEC(sbi, sec) + i);
>>> +			section_valid_blocks += se->valid_blocks;
>>
>> If we want to find a 'real' free section, we'd better to use
>> se->ckpt_valid_blocks (wrapped with get_seg_vblocks()) in where we has recorded
>> fsynced data count.
> 
> Thanks. When I create next patch series, I will use get_seg_vblocks().
> I will rebase to dev-test branch in which get_seg_vblocks() is available.
> 
>>
>>> +		}
>>> +		if (section_valid_blocks)
>>> +			continue;
>>> +
>>> +		/* check the cursegs do not use the section */
>>> +		for (i = 0; i < NO_CHECK_TYPE; i++) {
>>> +			cs = &SM_I(sbi)->curseg_array[i];
>>> +			if (GET_SEC_FROM_SEG(sbi, cs->segno) == sec)
>>> +				break;
>>> +		}
>>> +		if (i >= NR_CURSEG_TYPE) {
>>> +			*segno = GET_SEG_FROM_SEC(sbi, sec);
>>> +			return 0;
>>> +		}
>>> +	}
>>> +
>>> +	return -1;
>>> +}
>>> +
>>> +struct write_pointer_check_data {
>>> +	struct f2fs_sb_info *sbi;
>>> +	struct device_info *dev;
>>> +};
>>> +
>>> +static int fsck_chk_write_pointer(int i, struct blk_zone *blkz, void *opaque)
>>> +{
>>> +	struct write_pointer_check_data *wpd = opaque;
>>> +	struct f2fs_sb_info *sbi = wpd->sbi;
>>> +	struct device_info *dev = wpd->dev;
>>> +	struct f2fs_fsck *fsck = F2FS_FSCK(sbi);
>>> +	block_t zone_block, wp_block, wp_blkoff, cs_block, b;
>>> +	unsigned int zone_segno, wp_segno, new_segno;
>>> +	struct seg_entry *se;
>>> +	struct curseg_info *cs;
>>> +	int cs_index, ret;
>>> +	int log_sectors_per_block = sbi->log_blocksize - SECTOR_SHIFT;
>>> +	unsigned int segs_per_zone = sbi->segs_per_sec * sbi->secs_per_zone;
>>> +	void *zero_blk;
>>> +
>>> +	if (blk_zone_conv(blkz))
>>> +		return 0;
>>> +
>>> +	zone_block = dev->start_blkaddr
>>> +		+ (blk_zone_sector(blkz) >> log_sectors_per_block);
>>> +	zone_segno = GET_SEGNO(sbi, zone_block);
>>> +	wp_block = dev->start_blkaddr
>>> +		+ (blk_zone_wp_sector(blkz) >> log_sectors_per_block);
>>> +	wp_segno = GET_SEGNO(sbi, wp_block);
>>> +	wp_blkoff = wp_block - START_BLOCK(sbi, wp_segno);
>>> +
>>> +	/* find the curseg which points to the zone */
>>> +	for (cs_index = 0; cs_index < NO_CHECK_TYPE; cs_index++) {
>>> +		cs = &SM_I(sbi)->curseg_array[cs_index];
>>> +		if (zone_segno <= cs->segno &&
>>> +		    cs->segno < zone_segno + segs_per_zone)
>>> +			break;
>>> +	}
>>> +
>>> +	if (cs_index >= NR_CURSEG_TYPE)
>>> +		return 0;
>>> +
>>> +	/* check write pointer consistency with the curseg in the zone */
>>> +	cs_block = START_BLOCK(sbi, cs->segno) + cs->next_blkoff;
>>> +	if (wp_block == cs_block)
>>> +		return 0;
>>> +
>>> +	if (!c.fix_on) {
>>> +		MSG(0, "Inconsistent write pointer: "
>>> +		    "curseg %d[0x%x,0x%x] wp[0x%x,0x%x]\n",
>>> +		    cs_index, cs->segno, cs->next_blkoff, wp_segno, wp_blkoff);
>>> +		fsck->chk.wp_inconsistent_zones++;
>>> +		return 0;
>>> +	}
>>> +
>>> +	/*
>>> +	 * If the curseg is in advance from the write pointer, write zero to
>>> +	 * move the write pointer forward to the same position as the curseg.
>>> +	 */
>>> +	if (wp_block < cs_block) {
>>> +		ret = 0;
>>> +		zero_blk = calloc(BLOCK_SZ, 1);
>>> +		if (!zero_blk)
>>> +			return -EINVAL;
>>> +
>>> +		FIX_MSG("Advance write pointer to match with curseg %d: "
>>> +			"[0x%x,0x%x]->[0x%x,0x%x]",
>>> +			cs_index, wp_segno, wp_blkoff,
>>> +			cs->segno, cs->next_blkoff);
>>> +		for (b = wp_block; b < cs_block && !ret; b++)
>>> +			ret = dev_write_block(zero_blk, b);
>>> +
>>> +		fsck->chk.wp_fixed_zones++;
>>> +		free(zero_blk);
>>> +		return ret;
>>> +	}
>>> +
>>> +	if (wp_segno == zone_segno + segs_per_zone) {
>>> +		/*
>>> +		 * If the write pointer is in advance from the curseg and at
>>> +		 * the zone end (section end), search a new free zone (section)
>>> +		 * between the curseg and main area end.
>>> +		 */
>>> +		new_segno = wp_segno;
>>> +		ret = find_next_free_section(sbi, &new_segno);
>>> +		if (ret) {
>>> +			/* search again from main area start */
>>> +			new_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr);
>>> +			ret = find_next_free_section(sbi, &new_segno);
>>> +		}
>>
>> If curseg's type is warm node, relocating curseg would ruin warm node chain,
>> result in losing fsynced data for ever as well.
>>
>> So I guess we should migrate all dnode blocks chained with cs->next_blkoff in
>> current section.
>>
>>> +		if (ret) {
>>> +			MSG(0, "Free section not found\n");
>>> +			return ret;
>>> +		}
>>> +		FIX_MSG("New section for curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
>>> +			cs_index, cs->segno, cs->next_blkoff, new_segno, 0);
>>> +		cs->segno = new_segno;
>>> +		cs->next_blkoff = 0;
>>> +	} else {
>>> +		/*
>>> +		 * If the write pointer is in advance from the curseg within
>>> +		 * the zone, modify the curseg position to be same as the
>>> +		 * write pointer.
>>> +		 */
>>> +		ASSERT(wp_segno < zone_segno + segs_per_zone);
>>> +		FIX_MSG("Advance curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
>>> +			cs_index, cs->segno, cs->next_blkoff,
>>> +			wp_segno, wp_blkoff);
>>> +		cs->segno = wp_segno;
>>> +		cs->next_blkoff = wp_blkoff;
>>
>> The same data lose problem here, I guess we'd better handle it with the same way
>> as above.
>>
>> Thoughts?
> 
> I created f2fs status with fsync data and warm node chain, and confirmed the v4
> implementation ruins the dnode blocks chain. Hmm.
> 
> My understanding is that when f2fs kernel module recovers the fsync data, it
> sets the warm node curseg position at the start of the fsync data, and the fsync
> data will be overwritten as new nodes created. Is this understanding correct?

Sorry, I'm not sure I've understood you correctly.

Current recovery flow is:
1)   find all valid fsynced dnode in warm node chain
2.a) recover fsynced dnode in memory, and set it dirty
2.b) recover directory entry in memory, and set it dirty
2.c) during regular's dnode recovery, physical blocks indexed by recovered dnode
will be revalidated
Note: we didn't move any cursegs before 3)
3)   relocate all cursegs to new segments
4)   trigger checkpoint to persist all recovered data(fs' meta, file's meta/data)

> 
> If this is the case, I think write pointer inconsistency will happen even if
> fsck does "migrate all dnode blocks" (I assume that the warm node curseg's next

Actually, I notice that scheme's problem is: we've changed zone's write pointer
during dnode blocks migration, however if kernel drops recovery, we will still
face inconsistent status in between .next_blkaddr and .write_pointer, once we
start to write data from .next_blkaddr. So in fsck, after migration, we need to
reset .write_pointer to .next_blkaddr.

I guess we need to consider four cases:

o: support .write_pointer recovery
x: not support .write_pointer recovery

1) kernel: o, fsck: x, trigger recovery in kernel
2) kernel: o, fsck: x, not trigger recovery in kernel
3) kernel: x, fsck: o, trigger recovery in kernel
4) kernel: x, fsck: o, not trigger recovery in kernel

For 1) and 2), we can simply adjust to reset all invalid zone and allocate new
zone for curseg before data/meta writes.

For 3) and 4), I haven't figured out a way handling all cases perfectly.

> blkoff points to the migrated dnode block chain start). After the fsync data fix
> by kernel, warm node curseg will not point to the write pointer position.
> Anyway, kernel code change will be required to fix the inconsistency after fsync
> data fix.
> 
> How about to have fsck leave warm node curseg position untouched if the fsync
> data exists? The kernel side change for write pointer inconsistency will be able
> to cover this case.

If kernel can handle such case, fsck doesn't need to do any thing, IIUC.

Thanks,

> 
> --
> Best Regards,
> Shin'ichiro Kawasaki.
> 


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments
  2019-09-05  9:58       ` Chao Yu
@ 2019-09-06  8:31         ` Shinichiro Kawasaki
  2019-09-09  7:14           ` Chao Yu
  0 siblings, 1 reply; 18+ messages in thread
From: Shinichiro Kawasaki @ 2019-09-06  8:31 UTC (permalink / raw)
  To: Chao Yu; +Cc: Jaegeuk Kim, Damien Le Moal, linux-f2fs-devel

On Sep 05, 2019 / 17:58, Chao Yu wrote:
> Hi Shinichiro,
> 
> Sorry for the delay.
> 
> On 2019/9/3 16:37, Shinichiro Kawasaki wrote:
> > On Sep 02, 2019 / 15:02, Chao Yu wrote:
> >> On 2019/8/30 18:19, Shin'ichiro Kawasaki wrote:
> >>> On sudden f2fs shutdown, zoned block device status and f2fs current
> >>> segment positions in meta data can be inconsistent. When f2fs shutdown
> >>> happens before write operations completes, write pointers of zoned block
> >>> devices can go further but f2fs meta data keeps current segments at
> >>> positions before the write operations. After remounting the f2fs, the
> >>> inconsistency causes write operations not at write pointers and
> >>> "Unaligned write command" error is reported. This error was observed when
> >>> xfstests test case generic/388 was run with f2fs on a zoned block device.
> >>>
> >>> To avoid the error, have f2fs.fsck check consistency between each current
> >>> segment's position and the write pointer of the zone the current segment
> >>> points to. If the write pointer goes advance from the current segment,
> >>> fix the current segment position setting at same as the write pointer
> >>> position. If the write pointer goes to the zone end, find a new zone and
> >>> set the current segment position at the new zone start. In case the write
> >>> pointer is behind the current segment, write zero data at the write
> >>> pointer position to make write pointer position at same as the current
> >>> segment.
> >>>
> >>> When inconsistencies are found, turn on c.bug_on flag in fsck_verify() to
> >>> ask users to fix them or not. When inconsistencies get fixed, turn on
> >>> 'force' flag in fsck_verify() to enforce fixes in following checks. This
> >>> position fix is done at the beginning of do_fsck() function so that other
> >>> checks reflect the current segment modification.
> >>>
> >>> Also add GET_SEC_FROM_SEG and GET_SEG_FROM_SEC macros in fsck/fsck.h to
> >>> simplify the code.
> >>>
> >>> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> >>> ---
> >>>  fsck/f2fs.h |   5 ++
> >>>  fsck/fsck.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>  fsck/fsck.h |   3 +
> >>>  fsck/main.c |   2 +
> >>>  4 files changed, 208 insertions(+)
> >>>
> >>> diff --git a/fsck/f2fs.h b/fsck/f2fs.h
> >>> index 4dc6698..2c1c2b3 100644
> >>> --- a/fsck/f2fs.h
> >>> +++ b/fsck/f2fs.h
> >>> @@ -337,6 +337,11 @@ static inline block_t __end_block_addr(struct f2fs_sb_info *sbi)
> >>>  #define GET_BLKOFF_FROM_SEG0(sbi, blk_addr)				\
> >>>  	(GET_SEGOFF_FROM_SEG0(sbi, blk_addr) & (sbi->blocks_per_seg - 1))
> >>>  
> >>> +#define GET_SEC_FROM_SEG(sbi, segno)					\
> >>> +	((segno) / (sbi)->segs_per_sec)
> >>> +#define GET_SEG_FROM_SEC(sbi, secno)					\
> >>> +	((secno) * (sbi)->segs_per_sec)
> >>> +
> >>>  #define FREE_I_START_SEGNO(sbi)						\
> >>>  	GET_SEGNO_FROM_SEG0(sbi, SM_I(sbi)->main_blkaddr)
> >>>  #define GET_R2L_SEGNO(sbi, segno)	(segno + FREE_I_START_SEGNO(sbi))
> >>> diff --git a/fsck/fsck.c b/fsck/fsck.c
> >>> index 8953ca1..a0f6849 100644
> >>> --- a/fsck/fsck.c
> >>> +++ b/fsck/fsck.c
> >>> @@ -2574,6 +2574,190 @@ out:
> >>>  	return cnt;
> >>>  }
> >>>  
> >>> +/*
> >>> + * Search a free section in main area. Start search from the section specified
> >>> + * with segno argument toward main area end. Return first segment of the found
> >>> + * section in segno argument.
> >>> + */
> >>> +static int find_next_free_section(struct f2fs_sb_info *sbi,
> >>> +				  unsigned int *segno)
> >>> +{
> >>> +	unsigned int i, sec, section_valid_blocks;
> >>> +	unsigned int end_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr)
> >>> +		+ SM_I(sbi)->main_segments;
> >>> +	unsigned int end_sec = GET_SEC_FROM_SEG(sbi, end_segno);
> >>> +	struct seg_entry *se;
> >>> +	struct curseg_info *cs;
> >>> +
> >>> +	for (sec = GET_SEC_FROM_SEG(sbi, *segno); sec < end_sec; sec++) {
> >>> +		/* find a section without valid blocks */
> >>> +		section_valid_blocks = 0;
> >>> +		for (i = 0; i < sbi->segs_per_sec; i++) {
> >>> +			se = get_seg_entry(sbi, GET_SEG_FROM_SEC(sbi, sec) + i);
> >>> +			section_valid_blocks += se->valid_blocks;
> >>
> >> If we want to find a 'real' free section, we'd better to use
> >> se->ckpt_valid_blocks (wrapped with get_seg_vblocks()) in where we has recorded
> >> fsynced data count.
> > 
> > Thanks. When I create next patch series, I will use get_seg_vblocks().
> > I will rebase to dev-test branch in which get_seg_vblocks() is available.
> > 
> >>
> >>> +		}
> >>> +		if (section_valid_blocks)
> >>> +			continue;
> >>> +
> >>> +		/* check the cursegs do not use the section */
> >>> +		for (i = 0; i < NO_CHECK_TYPE; i++) {
> >>> +			cs = &SM_I(sbi)->curseg_array[i];
> >>> +			if (GET_SEC_FROM_SEG(sbi, cs->segno) == sec)
> >>> +				break;
> >>> +		}
> >>> +		if (i >= NR_CURSEG_TYPE) {
> >>> +			*segno = GET_SEG_FROM_SEC(sbi, sec);
> >>> +			return 0;
> >>> +		}
> >>> +	}
> >>> +
> >>> +	return -1;
> >>> +}
> >>> +
> >>> +struct write_pointer_check_data {
> >>> +	struct f2fs_sb_info *sbi;
> >>> +	struct device_info *dev;
> >>> +};
> >>> +
> >>> +static int fsck_chk_write_pointer(int i, struct blk_zone *blkz, void *opaque)
> >>> +{
> >>> +	struct write_pointer_check_data *wpd = opaque;
> >>> +	struct f2fs_sb_info *sbi = wpd->sbi;
> >>> +	struct device_info *dev = wpd->dev;
> >>> +	struct f2fs_fsck *fsck = F2FS_FSCK(sbi);
> >>> +	block_t zone_block, wp_block, wp_blkoff, cs_block, b;
> >>> +	unsigned int zone_segno, wp_segno, new_segno;
> >>> +	struct seg_entry *se;
> >>> +	struct curseg_info *cs;
> >>> +	int cs_index, ret;
> >>> +	int log_sectors_per_block = sbi->log_blocksize - SECTOR_SHIFT;
> >>> +	unsigned int segs_per_zone = sbi->segs_per_sec * sbi->secs_per_zone;
> >>> +	void *zero_blk;
> >>> +
> >>> +	if (blk_zone_conv(blkz))
> >>> +		return 0;
> >>> +
> >>> +	zone_block = dev->start_blkaddr
> >>> +		+ (blk_zone_sector(blkz) >> log_sectors_per_block);
> >>> +	zone_segno = GET_SEGNO(sbi, zone_block);
> >>> +	wp_block = dev->start_blkaddr
> >>> +		+ (blk_zone_wp_sector(blkz) >> log_sectors_per_block);
> >>> +	wp_segno = GET_SEGNO(sbi, wp_block);
> >>> +	wp_blkoff = wp_block - START_BLOCK(sbi, wp_segno);
> >>> +
> >>> +	/* find the curseg which points to the zone */
> >>> +	for (cs_index = 0; cs_index < NO_CHECK_TYPE; cs_index++) {
> >>> +		cs = &SM_I(sbi)->curseg_array[cs_index];
> >>> +		if (zone_segno <= cs->segno &&
> >>> +		    cs->segno < zone_segno + segs_per_zone)
> >>> +			break;
> >>> +	}
> >>> +
> >>> +	if (cs_index >= NR_CURSEG_TYPE)
> >>> +		return 0;
> >>> +
> >>> +	/* check write pointer consistency with the curseg in the zone */
> >>> +	cs_block = START_BLOCK(sbi, cs->segno) + cs->next_blkoff;
> >>> +	if (wp_block == cs_block)
> >>> +		return 0;
> >>> +
> >>> +	if (!c.fix_on) {
> >>> +		MSG(0, "Inconsistent write pointer: "
> >>> +		    "curseg %d[0x%x,0x%x] wp[0x%x,0x%x]\n",
> >>> +		    cs_index, cs->segno, cs->next_blkoff, wp_segno, wp_blkoff);
> >>> +		fsck->chk.wp_inconsistent_zones++;
> >>> +		return 0;
> >>> +	}
> >>> +
> >>> +	/*
> >>> +	 * If the curseg is in advance from the write pointer, write zero to
> >>> +	 * move the write pointer forward to the same position as the curseg.
> >>> +	 */
> >>> +	if (wp_block < cs_block) {
> >>> +		ret = 0;
> >>> +		zero_blk = calloc(BLOCK_SZ, 1);
> >>> +		if (!zero_blk)
> >>> +			return -EINVAL;
> >>> +
> >>> +		FIX_MSG("Advance write pointer to match with curseg %d: "
> >>> +			"[0x%x,0x%x]->[0x%x,0x%x]",
> >>> +			cs_index, wp_segno, wp_blkoff,
> >>> +			cs->segno, cs->next_blkoff);
> >>> +		for (b = wp_block; b < cs_block && !ret; b++)
> >>> +			ret = dev_write_block(zero_blk, b);
> >>> +
> >>> +		fsck->chk.wp_fixed_zones++;
> >>> +		free(zero_blk);
> >>> +		return ret;
> >>> +	}
> >>> +
> >>> +	if (wp_segno == zone_segno + segs_per_zone) {
> >>> +		/*
> >>> +		 * If the write pointer is in advance from the curseg and at
> >>> +		 * the zone end (section end), search a new free zone (section)
> >>> +		 * between the curseg and main area end.
> >>> +		 */
> >>> +		new_segno = wp_segno;
> >>> +		ret = find_next_free_section(sbi, &new_segno);
> >>> +		if (ret) {
> >>> +			/* search again from main area start */
> >>> +			new_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr);
> >>> +			ret = find_next_free_section(sbi, &new_segno);
> >>> +		}
> >>
> >> If curseg's type is warm node, relocating curseg would ruin warm node chain,
> >> result in losing fsynced data for ever as well.
> >>
> >> So I guess we should migrate all dnode blocks chained with cs->next_blkoff in
> >> current section.
> >>
> >>> +		if (ret) {
> >>> +			MSG(0, "Free section not found\n");
> >>> +			return ret;
> >>> +		}
> >>> +		FIX_MSG("New section for curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
> >>> +			cs_index, cs->segno, cs->next_blkoff, new_segno, 0);
> >>> +		cs->segno = new_segno;
> >>> +		cs->next_blkoff = 0;
> >>> +	} else {
> >>> +		/*
> >>> +		 * If the write pointer is in advance from the curseg within
> >>> +		 * the zone, modify the curseg position to be same as the
> >>> +		 * write pointer.
> >>> +		 */
> >>> +		ASSERT(wp_segno < zone_segno + segs_per_zone);
> >>> +		FIX_MSG("Advance curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
> >>> +			cs_index, cs->segno, cs->next_blkoff,
> >>> +			wp_segno, wp_blkoff);
> >>> +		cs->segno = wp_segno;
> >>> +		cs->next_blkoff = wp_blkoff;
> >>
> >> The same data lose problem here, I guess we'd better handle it with the same way
> >> as above.
> >>
> >> Thoughts?
> > 
> > I created f2fs status with fsync data and warm node chain, and confirmed the v4
> > implementation ruins the dnode blocks chain. Hmm.
> > 
> > My understanding is that when f2fs kernel module recovers the fsync data, it
> > sets the warm node curseg position at the start of the fsync data, and the fsync
> > data will be overwritten as new nodes created. Is this understanding correct?
> 
> Sorry, I'm not sure I've understood you correctly.

Apology is mine, my question was not clear enough.
And thanks for the explanation below. It helps me to understand better.

> Current recovery flow is:
> 1)   find all valid fsynced dnode in warm node chain
> 2.a) recover fsynced dnode in memory, and set it dirty
> 2.b) recover directory entry in memory, and set it dirty
> 2.c) during regular's dnode recovery, physical blocks indexed by recovered dnode
> will be revalidated
> Note: we didn't move any cursegs before 3)
> 3)   relocate all cursegs to new segments
> 4)   trigger checkpoint to persist all recovered data(fs' meta, file's meta/data)

Question, does the step 3) correspond to f2fs_allocate_new_segments(sbi) call
at the end of recover_data()? The f2fs_allocate_new_segments() function
relocates new segments only for DATA cursegs, and it keeps NODE cursegs same as
before the fsync data recovery. Then I thought WARM NODE curseg would not change
by recovery (and still keeps pointing to the fsync recovery data).

> > 
> > If this is the case, I think write pointer inconsistency will happen even if
> > fsck does "migrate all dnode blocks" (I assume that the warm node curseg's next
> 
> Actually, I notice that scheme's problem is: we've changed zone's write pointer
> during dnode blocks migration, however if kernel drops recovery, we will still
> face inconsistent status in between .next_blkaddr and .write_pointer, once we
> start to write data from .next_blkaddr. So in fsck, after migration, we need to
> reset .write_pointer to .next_blkaddr.
> 
> I guess we need to consider four cases:
> 
> o: support .write_pointer recovery
> x: not support .write_pointer recovery
> 
> 1) kernel: o, fsck: x, trigger recovery in kernel
> 2) kernel: o, fsck: x, not trigger recovery in kernel
> 3) kernel: x, fsck: o, trigger recovery in kernel
> 4) kernel: x, fsck: o, not trigger recovery in kernel
> 
> For 1) and 2), we can simply adjust to reset all invalid zone and allocate new
> zone for curseg before data/meta writes.

Thanks for the clarification. This approach for case 1) and 2) is simple. Let me
try to create a patch for it.

> 
> For 3) and 4), I haven't figured out a way handling all cases perfectly.

For 3), I suppose fsck cannot fix write pointer inconsistency without fsync data
loss, since recovery is judged and done by kernel. The write pointer consistency
fix after recovery can be done only by kernel.

It is not a good one but one idea is to confirm fsck users to enforce fsync data
loss or not. This could be better than nothing.

For 4), my understanding is that fsync data is not left in the file system. I
think fsck can check fsync data existence and fix write pointer consistency, as
my patch series shows.

> 
> > blkoff points to the migrated dnode block chain start). After the fsync data fix
> > by kernel, warm node curseg will not point to the write pointer position.
> > Anyway, kernel code change will be required to fix the inconsistency after fsync
> > data fix.
> > 
> > How about to have fsck leave warm node curseg position untouched if the fsync
> > data exists? The kernel side change for write pointer inconsistency will be able
> > to cover this case.
> 
> If kernel can handle such case, fsck doesn't need to do any thing, IIUC.

You are right. I mixed cases 1)2) and cases 3)4).

Thanks again for your comments.

--
Best Regards,
Shin'ichiro Kawasaki

_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments
  2019-09-06  8:31         ` Shinichiro Kawasaki
@ 2019-09-09  7:14           ` Chao Yu
  2019-09-10  8:10             ` Shinichiro Kawasaki
  0 siblings, 1 reply; 18+ messages in thread
From: Chao Yu @ 2019-09-09  7:14 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: Jaegeuk Kim, Damien Le Moal, linux-f2fs-devel

On 2019/9/6 16:31, Shinichiro Kawasaki wrote:
> On Sep 05, 2019 / 17:58, Chao Yu wrote:
>> Hi Shinichiro,
>>
>> Sorry for the delay.
>>
>> On 2019/9/3 16:37, Shinichiro Kawasaki wrote:
>>> On Sep 02, 2019 / 15:02, Chao Yu wrote:
>>>> On 2019/8/30 18:19, Shin'ichiro Kawasaki wrote:
>>>>> On sudden f2fs shutdown, zoned block device status and f2fs current
>>>>> segment positions in meta data can be inconsistent. When f2fs shutdown
>>>>> happens before write operations completes, write pointers of zoned block
>>>>> devices can go further but f2fs meta data keeps current segments at
>>>>> positions before the write operations. After remounting the f2fs, the
>>>>> inconsistency causes write operations not at write pointers and
>>>>> "Unaligned write command" error is reported. This error was observed when
>>>>> xfstests test case generic/388 was run with f2fs on a zoned block device.
>>>>>
>>>>> To avoid the error, have f2fs.fsck check consistency between each current
>>>>> segment's position and the write pointer of the zone the current segment
>>>>> points to. If the write pointer goes advance from the current segment,
>>>>> fix the current segment position setting at same as the write pointer
>>>>> position. If the write pointer goes to the zone end, find a new zone and
>>>>> set the current segment position at the new zone start. In case the write
>>>>> pointer is behind the current segment, write zero data at the write
>>>>> pointer position to make write pointer position at same as the current
>>>>> segment.
>>>>>
>>>>> When inconsistencies are found, turn on c.bug_on flag in fsck_verify() to
>>>>> ask users to fix them or not. When inconsistencies get fixed, turn on
>>>>> 'force' flag in fsck_verify() to enforce fixes in following checks. This
>>>>> position fix is done at the beginning of do_fsck() function so that other
>>>>> checks reflect the current segment modification.
>>>>>
>>>>> Also add GET_SEC_FROM_SEG and GET_SEG_FROM_SEC macros in fsck/fsck.h to
>>>>> simplify the code.
>>>>>
>>>>> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>>>>> ---
>>>>>  fsck/f2fs.h |   5 ++
>>>>>  fsck/fsck.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>  fsck/fsck.h |   3 +
>>>>>  fsck/main.c |   2 +
>>>>>  4 files changed, 208 insertions(+)
>>>>>
>>>>> diff --git a/fsck/f2fs.h b/fsck/f2fs.h
>>>>> index 4dc6698..2c1c2b3 100644
>>>>> --- a/fsck/f2fs.h
>>>>> +++ b/fsck/f2fs.h
>>>>> @@ -337,6 +337,11 @@ static inline block_t __end_block_addr(struct f2fs_sb_info *sbi)
>>>>>  #define GET_BLKOFF_FROM_SEG0(sbi, blk_addr)				\
>>>>>  	(GET_SEGOFF_FROM_SEG0(sbi, blk_addr) & (sbi->blocks_per_seg - 1))
>>>>>  
>>>>> +#define GET_SEC_FROM_SEG(sbi, segno)					\
>>>>> +	((segno) / (sbi)->segs_per_sec)
>>>>> +#define GET_SEG_FROM_SEC(sbi, secno)					\
>>>>> +	((secno) * (sbi)->segs_per_sec)
>>>>> +
>>>>>  #define FREE_I_START_SEGNO(sbi)						\
>>>>>  	GET_SEGNO_FROM_SEG0(sbi, SM_I(sbi)->main_blkaddr)
>>>>>  #define GET_R2L_SEGNO(sbi, segno)	(segno + FREE_I_START_SEGNO(sbi))
>>>>> diff --git a/fsck/fsck.c b/fsck/fsck.c
>>>>> index 8953ca1..a0f6849 100644
>>>>> --- a/fsck/fsck.c
>>>>> +++ b/fsck/fsck.c
>>>>> @@ -2574,6 +2574,190 @@ out:
>>>>>  	return cnt;
>>>>>  }
>>>>>  
>>>>> +/*
>>>>> + * Search a free section in main area. Start search from the section specified
>>>>> + * with segno argument toward main area end. Return first segment of the found
>>>>> + * section in segno argument.
>>>>> + */
>>>>> +static int find_next_free_section(struct f2fs_sb_info *sbi,
>>>>> +				  unsigned int *segno)
>>>>> +{
>>>>> +	unsigned int i, sec, section_valid_blocks;
>>>>> +	unsigned int end_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr)
>>>>> +		+ SM_I(sbi)->main_segments;
>>>>> +	unsigned int end_sec = GET_SEC_FROM_SEG(sbi, end_segno);
>>>>> +	struct seg_entry *se;
>>>>> +	struct curseg_info *cs;
>>>>> +
>>>>> +	for (sec = GET_SEC_FROM_SEG(sbi, *segno); sec < end_sec; sec++) {
>>>>> +		/* find a section without valid blocks */
>>>>> +		section_valid_blocks = 0;
>>>>> +		for (i = 0; i < sbi->segs_per_sec; i++) {
>>>>> +			se = get_seg_entry(sbi, GET_SEG_FROM_SEC(sbi, sec) + i);
>>>>> +			section_valid_blocks += se->valid_blocks;
>>>>
>>>> If we want to find a 'real' free section, we'd better to use
>>>> se->ckpt_valid_blocks (wrapped with get_seg_vblocks()) in where we has recorded
>>>> fsynced data count.
>>>
>>> Thanks. When I create next patch series, I will use get_seg_vblocks().
>>> I will rebase to dev-test branch in which get_seg_vblocks() is available.
>>>
>>>>
>>>>> +		}
>>>>> +		if (section_valid_blocks)
>>>>> +			continue;
>>>>> +
>>>>> +		/* check the cursegs do not use the section */
>>>>> +		for (i = 0; i < NO_CHECK_TYPE; i++) {
>>>>> +			cs = &SM_I(sbi)->curseg_array[i];
>>>>> +			if (GET_SEC_FROM_SEG(sbi, cs->segno) == sec)
>>>>> +				break;
>>>>> +		}
>>>>> +		if (i >= NR_CURSEG_TYPE) {
>>>>> +			*segno = GET_SEG_FROM_SEC(sbi, sec);
>>>>> +			return 0;
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>> +	return -1;
>>>>> +}
>>>>> +
>>>>> +struct write_pointer_check_data {
>>>>> +	struct f2fs_sb_info *sbi;
>>>>> +	struct device_info *dev;
>>>>> +};
>>>>> +
>>>>> +static int fsck_chk_write_pointer(int i, struct blk_zone *blkz, void *opaque)
>>>>> +{
>>>>> +	struct write_pointer_check_data *wpd = opaque;
>>>>> +	struct f2fs_sb_info *sbi = wpd->sbi;
>>>>> +	struct device_info *dev = wpd->dev;
>>>>> +	struct f2fs_fsck *fsck = F2FS_FSCK(sbi);
>>>>> +	block_t zone_block, wp_block, wp_blkoff, cs_block, b;
>>>>> +	unsigned int zone_segno, wp_segno, new_segno;
>>>>> +	struct seg_entry *se;
>>>>> +	struct curseg_info *cs;
>>>>> +	int cs_index, ret;
>>>>> +	int log_sectors_per_block = sbi->log_blocksize - SECTOR_SHIFT;
>>>>> +	unsigned int segs_per_zone = sbi->segs_per_sec * sbi->secs_per_zone;
>>>>> +	void *zero_blk;
>>>>> +
>>>>> +	if (blk_zone_conv(blkz))
>>>>> +		return 0;
>>>>> +
>>>>> +	zone_block = dev->start_blkaddr
>>>>> +		+ (blk_zone_sector(blkz) >> log_sectors_per_block);
>>>>> +	zone_segno = GET_SEGNO(sbi, zone_block);
>>>>> +	wp_block = dev->start_blkaddr
>>>>> +		+ (blk_zone_wp_sector(blkz) >> log_sectors_per_block);
>>>>> +	wp_segno = GET_SEGNO(sbi, wp_block);
>>>>> +	wp_blkoff = wp_block - START_BLOCK(sbi, wp_segno);
>>>>> +
>>>>> +	/* find the curseg which points to the zone */
>>>>> +	for (cs_index = 0; cs_index < NO_CHECK_TYPE; cs_index++) {
>>>>> +		cs = &SM_I(sbi)->curseg_array[cs_index];
>>>>> +		if (zone_segno <= cs->segno &&
>>>>> +		    cs->segno < zone_segno + segs_per_zone)
>>>>> +			break;
>>>>> +	}
>>>>> +
>>>>> +	if (cs_index >= NR_CURSEG_TYPE)
>>>>> +		return 0;
>>>>> +
>>>>> +	/* check write pointer consistency with the curseg in the zone */
>>>>> +	cs_block = START_BLOCK(sbi, cs->segno) + cs->next_blkoff;
>>>>> +	if (wp_block == cs_block)
>>>>> +		return 0;
>>>>> +
>>>>> +	if (!c.fix_on) {
>>>>> +		MSG(0, "Inconsistent write pointer: "
>>>>> +		    "curseg %d[0x%x,0x%x] wp[0x%x,0x%x]\n",
>>>>> +		    cs_index, cs->segno, cs->next_blkoff, wp_segno, wp_blkoff);
>>>>> +		fsck->chk.wp_inconsistent_zones++;
>>>>> +		return 0;
>>>>> +	}
>>>>> +
>>>>> +	/*
>>>>> +	 * If the curseg is in advance from the write pointer, write zero to
>>>>> +	 * move the write pointer forward to the same position as the curseg.
>>>>> +	 */
>>>>> +	if (wp_block < cs_block) {
>>>>> +		ret = 0;
>>>>> +		zero_blk = calloc(BLOCK_SZ, 1);
>>>>> +		if (!zero_blk)
>>>>> +			return -EINVAL;
>>>>> +
>>>>> +		FIX_MSG("Advance write pointer to match with curseg %d: "
>>>>> +			"[0x%x,0x%x]->[0x%x,0x%x]",
>>>>> +			cs_index, wp_segno, wp_blkoff,
>>>>> +			cs->segno, cs->next_blkoff);
>>>>> +		for (b = wp_block; b < cs_block && !ret; b++)
>>>>> +			ret = dev_write_block(zero_blk, b);
>>>>> +
>>>>> +		fsck->chk.wp_fixed_zones++;
>>>>> +		free(zero_blk);
>>>>> +		return ret;
>>>>> +	}
>>>>> +
>>>>> +	if (wp_segno == zone_segno + segs_per_zone) {
>>>>> +		/*
>>>>> +		 * If the write pointer is in advance from the curseg and at
>>>>> +		 * the zone end (section end), search a new free zone (section)
>>>>> +		 * between the curseg and main area end.
>>>>> +		 */
>>>>> +		new_segno = wp_segno;
>>>>> +		ret = find_next_free_section(sbi, &new_segno);
>>>>> +		if (ret) {
>>>>> +			/* search again from main area start */
>>>>> +			new_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr);
>>>>> +			ret = find_next_free_section(sbi, &new_segno);
>>>>> +		}
>>>>
>>>> If curseg's type is warm node, relocating curseg would ruin warm node chain,
>>>> result in losing fsynced data for ever as well.
>>>>
>>>> So I guess we should migrate all dnode blocks chained with cs->next_blkoff in
>>>> current section.
>>>>
>>>>> +		if (ret) {
>>>>> +			MSG(0, "Free section not found\n");
>>>>> +			return ret;
>>>>> +		}
>>>>> +		FIX_MSG("New section for curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
>>>>> +			cs_index, cs->segno, cs->next_blkoff, new_segno, 0);
>>>>> +		cs->segno = new_segno;
>>>>> +		cs->next_blkoff = 0;
>>>>> +	} else {
>>>>> +		/*
>>>>> +		 * If the write pointer is in advance from the curseg within
>>>>> +		 * the zone, modify the curseg position to be same as the
>>>>> +		 * write pointer.
>>>>> +		 */
>>>>> +		ASSERT(wp_segno < zone_segno + segs_per_zone);
>>>>> +		FIX_MSG("Advance curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
>>>>> +			cs_index, cs->segno, cs->next_blkoff,
>>>>> +			wp_segno, wp_blkoff);
>>>>> +		cs->segno = wp_segno;
>>>>> +		cs->next_blkoff = wp_blkoff;
>>>>
>>>> The same data lose problem here, I guess we'd better handle it with the same way
>>>> as above.
>>>>
>>>> Thoughts?
>>>
>>> I created f2fs status with fsync data and warm node chain, and confirmed the v4
>>> implementation ruins the dnode blocks chain. Hmm.
>>>
>>> My understanding is that when f2fs kernel module recovers the fsync data, it
>>> sets the warm node curseg position at the start of the fsync data, and the fsync
>>> data will be overwritten as new nodes created. Is this understanding correct?
>>
>> Sorry, I'm not sure I've understood you correctly.
> 
> Apology is mine, my question was not clear enough.
> And thanks for the explanation below. It helps me to understand better.
> 
>> Current recovery flow is:
>> 1)   find all valid fsynced dnode in warm node chain
>> 2.a) recover fsynced dnode in memory, and set it dirty
>> 2.b) recover directory entry in memory, and set it dirty
>> 2.c) during regular's dnode recovery, physical blocks indexed by recovered dnode
>> will be revalidated
>> Note: we didn't move any cursegs before 3)
>> 3)   relocate all cursegs to new segments
>> 4)   trigger checkpoint to persist all recovered data(fs' meta, file's meta/data)
> 
> Question, does the step 3) correspond to f2fs_allocate_new_segments(sbi) call
> at the end of recover_data()? The f2fs_allocate_new_segments() function

Yeah, I meant that function.

> relocates new segments only for DATA cursegs, and it keeps NODE cursegs same as
> before the fsync data recovery. Then I thought WARM NODE curseg would not change
> by recovery (and still keeps pointing to the fsync recovery data).

Yes, that's correct. WARM NODE curseg won't change until step 4).
> 
>>>
>>> If this is the case, I think write pointer inconsistency will happen even if
>>> fsck does "migrate all dnode blocks" (I assume that the warm node curseg's next
>>
>> Actually, I notice that scheme's problem is: we've changed zone's write pointer
>> during dnode blocks migration, however if kernel drops recovery, we will still
>> face inconsistent status in between .next_blkaddr and .write_pointer, once we
>> start to write data from .next_blkaddr. So in fsck, after migration, we need to
>> reset .write_pointer to .next_blkaddr.
>>
>> I guess we need to consider four cases:
>>
>> o: support .write_pointer recovery
>> x: not support .write_pointer recovery
>>
>> 1) kernel: o, fsck: x, trigger recovery in kernel
>> 2) kernel: o, fsck: x, not trigger recovery in kernel
>> 3) kernel: x, fsck: o, trigger recovery in kernel
>> 4) kernel: x, fsck: o, not trigger recovery in kernel
>>
>> For 1) and 2), we can simply adjust to reset all invalid zone and allocate new
>> zone for curseg before data/meta writes.
> 
> Thanks for the clarification. This approach for case 1) and 2) is simple. Let me
> try to create a patch for it.
> 
>>
>> For 3) and 4), I haven't figured out a way handling all cases perfectly.
> 
> For 3), I suppose fsck cannot fix write pointer inconsistency without fsync data
> loss, since recovery is judged and done by kernel. The write pointer consistency
> fix after recovery can be done only by kernel.
> 
> It is not a good one but one idea is to confirm fsck users to enforce fsync data
> loss or not. This could be better than nothing.
> 
> For 4), my understanding is that fsync data is not left in the file system. I
> think fsck can check fsync data existence and fix write pointer consistency, as
> my patch series shows.

Yeah.

Let's think about that whether there is a way to cover all cases.

1) For non-opened zones, we need to adjust all such zones' write pointer to
zero. I assume that once write pointer is zero, we still can access valid block
in zone. (recovery may need to revalidate blocks in free segments)

2) For opened zones (cursegs point to)
2.a) .write_pointer == .next_blkaddr, no need to handle
2.b) .write_pointer > .next_blkaddr, that should be the common state in sudden
power-off case. It needs to reset .write_pointer to .next_blkaddr.
2.c) .write_pointer < .next_blkaddr, should be a bug of f2fs, we should have
lost datas in the range of [.write_pointer, .next_blkaddr] at least, if we're
sure about that there is no valid data existed out side of .write_pointer? can we?

3) f2fs_allocate_new_segments() may reset data type cursegs' position, I'd like
to know what's the value of -z option we configured, if we configured as one
zone contains one section, no matter kernel triggers recovery or not (via set
disable_roll_forward option), the write pointer consistency should has been kept
in all cases. Otherwise (one zone contains multiple sections), if kernel doesn't
support .write_pointer recovery, f2fs_allocate_new_segments() can make
.next_blkaddr larger than .write_pointer in the same zone.

Let me know, if I'm missing sth.

Thanks,

> 
>>
>>> blkoff points to the migrated dnode block chain start). After the fsync data fix
>>> by kernel, warm node curseg will not point to the write pointer position.
>>> Anyway, kernel code change will be required to fix the inconsistency after fsync
>>> data fix.
>>>
>>> How about to have fsck leave warm node curseg position untouched if the fsync
>>> data exists? The kernel side change for write pointer inconsistency will be able
>>> to cover this case.
>>
>> If kernel can handle such case, fsck doesn't need to do any thing, IIUC.
> 
> You are right. I mixed cases 1)2) and cases 3)4).
> 
> Thanks again for your comments.
> 
> --
> Best Regards,
> Shin'ichiro Kawasaki.
> 


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments
  2019-09-09  7:14           ` Chao Yu
@ 2019-09-10  8:10             ` Shinichiro Kawasaki
  2019-09-10  9:12               ` Chao Yu
  0 siblings, 1 reply; 18+ messages in thread
From: Shinichiro Kawasaki @ 2019-09-10  8:10 UTC (permalink / raw)
  To: Chao Yu; +Cc: Jaegeuk Kim, Damien Le Moal, linux-f2fs-devel

On Sep 09, 2019 / 15:14, Chao Yu wrote:
> On 2019/9/6 16:31, Shinichiro Kawasaki wrote:
> > On Sep 05, 2019 / 17:58, Chao Yu wrote:
> >> Hi Shinichiro,
> >>
> >> Sorry for the delay.
> >>
> >> On 2019/9/3 16:37, Shinichiro Kawasaki wrote:
> >>> On Sep 02, 2019 / 15:02, Chao Yu wrote:
> >>>> On 2019/8/30 18:19, Shin'ichiro Kawasaki wrote:
> >>>>> On sudden f2fs shutdown, zoned block device status and f2fs current
> >>>>> segment positions in meta data can be inconsistent. When f2fs shutdown
> >>>>> happens before write operations completes, write pointers of zoned block
> >>>>> devices can go further but f2fs meta data keeps current segments at
> >>>>> positions before the write operations. After remounting the f2fs, the
> >>>>> inconsistency causes write operations not at write pointers and
> >>>>> "Unaligned write command" error is reported. This error was observed when
> >>>>> xfstests test case generic/388 was run with f2fs on a zoned block device.
> >>>>>
> >>>>> To avoid the error, have f2fs.fsck check consistency between each current
> >>>>> segment's position and the write pointer of the zone the current segment
> >>>>> points to. If the write pointer goes advance from the current segment,
> >>>>> fix the current segment position setting at same as the write pointer
> >>>>> position. If the write pointer goes to the zone end, find a new zone and
> >>>>> set the current segment position at the new zone start. In case the write
> >>>>> pointer is behind the current segment, write zero data at the write
> >>>>> pointer position to make write pointer position at same as the current
> >>>>> segment.
> >>>>>
> >>>>> When inconsistencies are found, turn on c.bug_on flag in fsck_verify() to
> >>>>> ask users to fix them or not. When inconsistencies get fixed, turn on
> >>>>> 'force' flag in fsck_verify() to enforce fixes in following checks. This
> >>>>> position fix is done at the beginning of do_fsck() function so that other
> >>>>> checks reflect the current segment modification.
> >>>>>
> >>>>> Also add GET_SEC_FROM_SEG and GET_SEG_FROM_SEC macros in fsck/fsck.h to
> >>>>> simplify the code.
> >>>>>
> >>>>> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> >>>>> ---
> >>>>>  fsck/f2fs.h |   5 ++
> >>>>>  fsck/fsck.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>  fsck/fsck.h |   3 +
> >>>>>  fsck/main.c |   2 +
> >>>>>  4 files changed, 208 insertions(+)
> >>>>>
> >>>>> diff --git a/fsck/f2fs.h b/fsck/f2fs.h
> >>>>> index 4dc6698..2c1c2b3 100644
> >>>>> --- a/fsck/f2fs.h
> >>>>> +++ b/fsck/f2fs.h
> >>>>> @@ -337,6 +337,11 @@ static inline block_t __end_block_addr(struct f2fs_sb_info *sbi)
> >>>>>  #define GET_BLKOFF_FROM_SEG0(sbi, blk_addr)				\
> >>>>>  	(GET_SEGOFF_FROM_SEG0(sbi, blk_addr) & (sbi->blocks_per_seg - 1))
> >>>>>  
> >>>>> +#define GET_SEC_FROM_SEG(sbi, segno)					\
> >>>>> +	((segno) / (sbi)->segs_per_sec)
> >>>>> +#define GET_SEG_FROM_SEC(sbi, secno)					\
> >>>>> +	((secno) * (sbi)->segs_per_sec)
> >>>>> +
> >>>>>  #define FREE_I_START_SEGNO(sbi)						\
> >>>>>  	GET_SEGNO_FROM_SEG0(sbi, SM_I(sbi)->main_blkaddr)
> >>>>>  #define GET_R2L_SEGNO(sbi, segno)	(segno + FREE_I_START_SEGNO(sbi))
> >>>>> diff --git a/fsck/fsck.c b/fsck/fsck.c
> >>>>> index 8953ca1..a0f6849 100644
> >>>>> --- a/fsck/fsck.c
> >>>>> +++ b/fsck/fsck.c
> >>>>> @@ -2574,6 +2574,190 @@ out:
> >>>>>  	return cnt;
> >>>>>  }
> >>>>>  
> >>>>> +/*
> >>>>> + * Search a free section in main area. Start search from the section specified
> >>>>> + * with segno argument toward main area end. Return first segment of the found
> >>>>> + * section in segno argument.
> >>>>> + */
> >>>>> +static int find_next_free_section(struct f2fs_sb_info *sbi,
> >>>>> +				  unsigned int *segno)
> >>>>> +{
> >>>>> +	unsigned int i, sec, section_valid_blocks;
> >>>>> +	unsigned int end_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr)
> >>>>> +		+ SM_I(sbi)->main_segments;
> >>>>> +	unsigned int end_sec = GET_SEC_FROM_SEG(sbi, end_segno);
> >>>>> +	struct seg_entry *se;
> >>>>> +	struct curseg_info *cs;
> >>>>> +
> >>>>> +	for (sec = GET_SEC_FROM_SEG(sbi, *segno); sec < end_sec; sec++) {
> >>>>> +		/* find a section without valid blocks */
> >>>>> +		section_valid_blocks = 0;
> >>>>> +		for (i = 0; i < sbi->segs_per_sec; i++) {
> >>>>> +			se = get_seg_entry(sbi, GET_SEG_FROM_SEC(sbi, sec) + i);
> >>>>> +			section_valid_blocks += se->valid_blocks;
> >>>>
> >>>> If we want to find a 'real' free section, we'd better to use
> >>>> se->ckpt_valid_blocks (wrapped with get_seg_vblocks()) in where we has recorded
> >>>> fsynced data count.
> >>>
> >>> Thanks. When I create next patch series, I will use get_seg_vblocks().
> >>> I will rebase to dev-test branch in which get_seg_vblocks() is available.
> >>>
> >>>>
> >>>>> +		}
> >>>>> +		if (section_valid_blocks)
> >>>>> +			continue;
> >>>>> +
> >>>>> +		/* check the cursegs do not use the section */
> >>>>> +		for (i = 0; i < NO_CHECK_TYPE; i++) {
> >>>>> +			cs = &SM_I(sbi)->curseg_array[i];
> >>>>> +			if (GET_SEC_FROM_SEG(sbi, cs->segno) == sec)
> >>>>> +				break;
> >>>>> +		}
> >>>>> +		if (i >= NR_CURSEG_TYPE) {
> >>>>> +			*segno = GET_SEG_FROM_SEC(sbi, sec);
> >>>>> +			return 0;
> >>>>> +		}
> >>>>> +	}
> >>>>> +
> >>>>> +	return -1;
> >>>>> +}
> >>>>> +
> >>>>> +struct write_pointer_check_data {
> >>>>> +	struct f2fs_sb_info *sbi;
> >>>>> +	struct device_info *dev;
> >>>>> +};
> >>>>> +
> >>>>> +static int fsck_chk_write_pointer(int i, struct blk_zone *blkz, void *opaque)
> >>>>> +{
> >>>>> +	struct write_pointer_check_data *wpd = opaque;
> >>>>> +	struct f2fs_sb_info *sbi = wpd->sbi;
> >>>>> +	struct device_info *dev = wpd->dev;
> >>>>> +	struct f2fs_fsck *fsck = F2FS_FSCK(sbi);
> >>>>> +	block_t zone_block, wp_block, wp_blkoff, cs_block, b;
> >>>>> +	unsigned int zone_segno, wp_segno, new_segno;
> >>>>> +	struct seg_entry *se;
> >>>>> +	struct curseg_info *cs;
> >>>>> +	int cs_index, ret;
> >>>>> +	int log_sectors_per_block = sbi->log_blocksize - SECTOR_SHIFT;
> >>>>> +	unsigned int segs_per_zone = sbi->segs_per_sec * sbi->secs_per_zone;
> >>>>> +	void *zero_blk;
> >>>>> +
> >>>>> +	if (blk_zone_conv(blkz))
> >>>>> +		return 0;
> >>>>> +
> >>>>> +	zone_block = dev->start_blkaddr
> >>>>> +		+ (blk_zone_sector(blkz) >> log_sectors_per_block);
> >>>>> +	zone_segno = GET_SEGNO(sbi, zone_block);
> >>>>> +	wp_block = dev->start_blkaddr
> >>>>> +		+ (blk_zone_wp_sector(blkz) >> log_sectors_per_block);
> >>>>> +	wp_segno = GET_SEGNO(sbi, wp_block);
> >>>>> +	wp_blkoff = wp_block - START_BLOCK(sbi, wp_segno);
> >>>>> +
> >>>>> +	/* find the curseg which points to the zone */
> >>>>> +	for (cs_index = 0; cs_index < NO_CHECK_TYPE; cs_index++) {
> >>>>> +		cs = &SM_I(sbi)->curseg_array[cs_index];
> >>>>> +		if (zone_segno <= cs->segno &&
> >>>>> +		    cs->segno < zone_segno + segs_per_zone)
> >>>>> +			break;
> >>>>> +	}
> >>>>> +
> >>>>> +	if (cs_index >= NR_CURSEG_TYPE)
> >>>>> +		return 0;
> >>>>> +
> >>>>> +	/* check write pointer consistency with the curseg in the zone */
> >>>>> +	cs_block = START_BLOCK(sbi, cs->segno) + cs->next_blkoff;
> >>>>> +	if (wp_block == cs_block)
> >>>>> +		return 0;
> >>>>> +
> >>>>> +	if (!c.fix_on) {
> >>>>> +		MSG(0, "Inconsistent write pointer: "
> >>>>> +		    "curseg %d[0x%x,0x%x] wp[0x%x,0x%x]\n",
> >>>>> +		    cs_index, cs->segno, cs->next_blkoff, wp_segno, wp_blkoff);
> >>>>> +		fsck->chk.wp_inconsistent_zones++;
> >>>>> +		return 0;
> >>>>> +	}
> >>>>> +
> >>>>> +	/*
> >>>>> +	 * If the curseg is in advance from the write pointer, write zero to
> >>>>> +	 * move the write pointer forward to the same position as the curseg.
> >>>>> +	 */
> >>>>> +	if (wp_block < cs_block) {
> >>>>> +		ret = 0;
> >>>>> +		zero_blk = calloc(BLOCK_SZ, 1);
> >>>>> +		if (!zero_blk)
> >>>>> +			return -EINVAL;
> >>>>> +
> >>>>> +		FIX_MSG("Advance write pointer to match with curseg %d: "
> >>>>> +			"[0x%x,0x%x]->[0x%x,0x%x]",
> >>>>> +			cs_index, wp_segno, wp_blkoff,
> >>>>> +			cs->segno, cs->next_blkoff);
> >>>>> +		for (b = wp_block; b < cs_block && !ret; b++)
> >>>>> +			ret = dev_write_block(zero_blk, b);
> >>>>> +
> >>>>> +		fsck->chk.wp_fixed_zones++;
> >>>>> +		free(zero_blk);
> >>>>> +		return ret;
> >>>>> +	}
> >>>>> +
> >>>>> +	if (wp_segno == zone_segno + segs_per_zone) {
> >>>>> +		/*
> >>>>> +		 * If the write pointer is in advance from the curseg and at
> >>>>> +		 * the zone end (section end), search a new free zone (section)
> >>>>> +		 * between the curseg and main area end.
> >>>>> +		 */
> >>>>> +		new_segno = wp_segno;
> >>>>> +		ret = find_next_free_section(sbi, &new_segno);
> >>>>> +		if (ret) {
> >>>>> +			/* search again from main area start */
> >>>>> +			new_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr);
> >>>>> +			ret = find_next_free_section(sbi, &new_segno);
> >>>>> +		}
> >>>>
> >>>> If curseg's type is warm node, relocating curseg would ruin warm node chain,
> >>>> result in losing fsynced data for ever as well.
> >>>>
> >>>> So I guess we should migrate all dnode blocks chained with cs->next_blkoff in
> >>>> current section.
> >>>>
> >>>>> +		if (ret) {
> >>>>> +			MSG(0, "Free section not found\n");
> >>>>> +			return ret;
> >>>>> +		}
> >>>>> +		FIX_MSG("New section for curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
> >>>>> +			cs_index, cs->segno, cs->next_blkoff, new_segno, 0);
> >>>>> +		cs->segno = new_segno;
> >>>>> +		cs->next_blkoff = 0;
> >>>>> +	} else {
> >>>>> +		/*
> >>>>> +		 * If the write pointer is in advance from the curseg within
> >>>>> +		 * the zone, modify the curseg position to be same as the
> >>>>> +		 * write pointer.
> >>>>> +		 */
> >>>>> +		ASSERT(wp_segno < zone_segno + segs_per_zone);
> >>>>> +		FIX_MSG("Advance curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
> >>>>> +			cs_index, cs->segno, cs->next_blkoff,
> >>>>> +			wp_segno, wp_blkoff);
> >>>>> +		cs->segno = wp_segno;
> >>>>> +		cs->next_blkoff = wp_blkoff;
> >>>>
> >>>> The same data lose problem here, I guess we'd better handle it with the same way
> >>>> as above.
> >>>>
> >>>> Thoughts?
> >>>
> >>> I created f2fs status with fsync data and warm node chain, and confirmed the v4
> >>> implementation ruins the dnode blocks chain. Hmm.
> >>>
> >>> My understanding is that when f2fs kernel module recovers the fsync data, it
> >>> sets the warm node curseg position at the start of the fsync data, and the fsync
> >>> data will be overwritten as new nodes created. Is this understanding correct?
> >>
> >> Sorry, I'm not sure I've understood you correctly.
> > 
> > Apology is mine, my question was not clear enough.
> > And thanks for the explanation below. It helps me to understand better.
> > 
> >> Current recovery flow is:
> >> 1)   find all valid fsynced dnode in warm node chain
> >> 2.a) recover fsynced dnode in memory, and set it dirty
> >> 2.b) recover directory entry in memory, and set it dirty
> >> 2.c) during regular's dnode recovery, physical blocks indexed by recovered dnode
> >> will be revalidated
> >> Note: we didn't move any cursegs before 3)
> >> 3)   relocate all cursegs to new segments
> >> 4)   trigger checkpoint to persist all recovered data(fs' meta, file's meta/data)
> > 
> > Question, does the step 3) correspond to f2fs_allocate_new_segments(sbi) call
> > at the end of recover_data()? The f2fs_allocate_new_segments() function
> 
> Yeah, I meant that function.
> 
> > relocates new segments only for DATA cursegs, and it keeps NODE cursegs same as
> > before the fsync data recovery. Then I thought WARM NODE curseg would not change
> > by recovery (and still keeps pointing to the fsync recovery data).
> 
> Yes, that's correct. WARM NODE curseg won't change until step 4).

Thanks. Following your idea "we can simply adjust to reset all invalid zone and
allocate new zone for curseg before data/meta writes" for fix by kernel, I think
kernel code change is required to allocate new zones for NODE cursegs also at
step 3). WARM NODE curseg should be kept untouched by step 2 completion to refer
fsynced dnodes at WARM NODE curseg's next_blkaddr. And at step 4, the fsyced
dnodes recovered and set dirty will be written back with one of NODE cursegs
(HOT NODE curseg?). At that time, we need to make sure the NODE curseg points to
the position consistent with its zone's write pointer.

> > 
> >>>
> >>> If this is the case, I think write pointer inconsistency will happen even if
> >>> fsck does "migrate all dnode blocks" (I assume that the warm node curseg's next
> >>
> >> Actually, I notice that scheme's problem is: we've changed zone's write pointer
> >> during dnode blocks migration, however if kernel drops recovery, we will still
> >> face inconsistent status in between .next_blkaddr and .write_pointer, once we
> >> start to write data from .next_blkaddr. So in fsck, after migration, we need to
> >> reset .write_pointer to .next_blkaddr.
> >>
> >> I guess we need to consider four cases:
> >>
> >> o: support .write_pointer recovery
> >> x: not support .write_pointer recovery
> >>
> >> 1) kernel: o, fsck: x, trigger recovery in kernel
> >> 2) kernel: o, fsck: x, not trigger recovery in kernel
> >> 3) kernel: x, fsck: o, trigger recovery in kernel
> >> 4) kernel: x, fsck: o, not trigger recovery in kernel
> >>
> >> For 1) and 2), we can simply adjust to reset all invalid zone and allocate new
> >> zone for curseg before data/meta writes.
> > 
> > Thanks for the clarification. This approach for case 1) and 2) is simple. Let me
> > try to create a patch for it.
> > 
> >>
> >> For 3) and 4), I haven't figured out a way handling all cases perfectly.
> > 
> > For 3), I suppose fsck cannot fix write pointer inconsistency without fsync data
> > loss, since recovery is judged and done by kernel. The write pointer consistency
> > fix after recovery can be done only by kernel.
> > 
> > It is not a good one but one idea is to confirm fsck users to enforce fsync data
> > loss or not. This could be better than nothing.
> > 
> > For 4), my understanding is that fsync data is not left in the file system. I
> > think fsck can check fsync data existence and fix write pointer consistency, as
> > my patch series shows.
> 
> Yeah.
> 
> Let's think about that whether there is a way to cover all cases.
> 
> 1) For non-opened zones, we need to adjust all such zones' write pointer to
> zero. I assume that once write pointer is zero, we still can access valid block
> in zone. (recovery may need to revalidate blocks in free segments)

When write pointer is reset to zero, all data in the zone gets lost. When we
read data in the zone beyond the write pointer, just zero data is read. When any
valid block or fsync data is left in a non-opened zone, I think the zone's write
pointer should be left as is. Otherwise, if the zone do not have valid block and
fsync data, the zone should be reset to avoid unaligned write error.

One additional check I can think of is to check the last valid block in the zone
and write pointer position of the zone. If .write_pointer >= .last_valid_block,
, it is ok. If .write_pointer < .last_valid_block, this is a bug of f2fs. In
this case, the data in the valid blocks beyond write pointer is just lost, and
there is no way to recover this. I think this zone will not be selected for
cursegs for further data write in the zone until the zone get discarded. No
need to fix write pointer position to avoid unaligned write errors. I wonder
if fsck or kernel should detect and report this case, because users still use
the f2fs partition without any action. May I ask your opinion?

> 
> 2) For opened zones (cursegs point to)
> 2.a) .write_pointer == .next_blkaddr, no need to handle
> 2.b) .write_pointer > .next_blkaddr, that should be the common state in sudden
> power-off case. It needs to reset .write_pointer to .next_blkaddr.

Agreed. To be more precise, if fsync data is recorded after .next_blkaddr, we
need to allocate a new zone to the curseg after fsync data recovery.

> 2.c) .write_pointer < .next_blkaddr, should be a bug of f2fs, we should have
> lost datas in the range of [.write_pointer, .next_blkaddr] at least, if we're
> sure about that there is no valid data existed out side of .write_pointer? can we?

As you note, this is a bug and we lost the data in the range of [.write_pointer,
.next_blkaddr]. We should leave the write pointer as is to keep the data in the
zone, and allocate a new zone to the curseg to avoid unaligned write errors.

> 
> 3) f2fs_allocate_new_segments() may reset data type cursegs' position, I'd like
> to know what's the value of -z option we configured, if we configured as one
> zone contains one section, no matter kernel triggers recovery or not (via set
> disable_roll_forward option), the write pointer consistency should has been kept
> in all cases. Otherwise (one zone contains multiple sections), if kernel doesn't
> support .write_pointer recovery, f2fs_allocate_new_segments() can make
> .next_blkaddr larger than .write_pointer in the same zone.

"Single section per zone" is a key design of the f2fs zoned block device
support. For zoned block devices, mkfs.f2fs requires -m option. When -m
options is set, mkfs sets '1' to c.secs_per_zone regardless of the -z option
(refer f2fs_get_device_info() in lib/libf2fs.c). With this prerequisite,
I think new zones are allocated for DATA cursegs in
f2fs_allocate_new_segment() and its sub-function call path.

> 
> Let me know, if I'm missing sth.

Assuming fsync data is the only one exceptional data which is read beyond the
cursegs' next_blkoff, this discussion looks comprehensive for me:

 - with and without kernel write pointer fix
 - with and without fsync data recovery
 - open/non-open zones (zones cursegs point to, zones cursegs do not point to)

All fs meta data in front of main segments are stored in conventional zones
which do not have write pointers (refer f2fs_prepare_super_block() in mkfs/
f2fs_format.c). Just we need to care about write pointer consistency for the
main data area.

Thanks!

--
Best Regards,
Shin'ichiro Kawasaki

_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments
  2019-09-10  8:10             ` Shinichiro Kawasaki
@ 2019-09-10  9:12               ` Chao Yu
  2019-09-12  8:16                 ` Shinichiro Kawasaki
  0 siblings, 1 reply; 18+ messages in thread
From: Chao Yu @ 2019-09-10  9:12 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: Jaegeuk Kim, Damien Le Moal, linux-f2fs-devel

On 2019/9/10 16:10, Shinichiro Kawasaki wrote:
> On Sep 09, 2019 / 15:14, Chao Yu wrote:
>> On 2019/9/6 16:31, Shinichiro Kawasaki wrote:
>>> On Sep 05, 2019 / 17:58, Chao Yu wrote:
>>>> Hi Shinichiro,
>>>>
>>>> Sorry for the delay.
>>>>
>>>> On 2019/9/3 16:37, Shinichiro Kawasaki wrote:
>>>>> On Sep 02, 2019 / 15:02, Chao Yu wrote:
>>>>>> On 2019/8/30 18:19, Shin'ichiro Kawasaki wrote:
>>>>>>> On sudden f2fs shutdown, zoned block device status and f2fs current
>>>>>>> segment positions in meta data can be inconsistent. When f2fs shutdown
>>>>>>> happens before write operations completes, write pointers of zoned block
>>>>>>> devices can go further but f2fs meta data keeps current segments at
>>>>>>> positions before the write operations. After remounting the f2fs, the
>>>>>>> inconsistency causes write operations not at write pointers and
>>>>>>> "Unaligned write command" error is reported. This error was observed when
>>>>>>> xfstests test case generic/388 was run with f2fs on a zoned block device.
>>>>>>>
>>>>>>> To avoid the error, have f2fs.fsck check consistency between each current
>>>>>>> segment's position and the write pointer of the zone the current segment
>>>>>>> points to. If the write pointer goes advance from the current segment,
>>>>>>> fix the current segment position setting at same as the write pointer
>>>>>>> position. If the write pointer goes to the zone end, find a new zone and
>>>>>>> set the current segment position at the new zone start. In case the write
>>>>>>> pointer is behind the current segment, write zero data at the write
>>>>>>> pointer position to make write pointer position at same as the current
>>>>>>> segment.
>>>>>>>
>>>>>>> When inconsistencies are found, turn on c.bug_on flag in fsck_verify() to
>>>>>>> ask users to fix them or not. When inconsistencies get fixed, turn on
>>>>>>> 'force' flag in fsck_verify() to enforce fixes in following checks. This
>>>>>>> position fix is done at the beginning of do_fsck() function so that other
>>>>>>> checks reflect the current segment modification.
>>>>>>>
>>>>>>> Also add GET_SEC_FROM_SEG and GET_SEG_FROM_SEC macros in fsck/fsck.h to
>>>>>>> simplify the code.
>>>>>>>
>>>>>>> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>>>>>>> ---
>>>>>>>  fsck/f2fs.h |   5 ++
>>>>>>>  fsck/fsck.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>  fsck/fsck.h |   3 +
>>>>>>>  fsck/main.c |   2 +
>>>>>>>  4 files changed, 208 insertions(+)
>>>>>>>
>>>>>>> diff --git a/fsck/f2fs.h b/fsck/f2fs.h
>>>>>>> index 4dc6698..2c1c2b3 100644
>>>>>>> --- a/fsck/f2fs.h
>>>>>>> +++ b/fsck/f2fs.h
>>>>>>> @@ -337,6 +337,11 @@ static inline block_t __end_block_addr(struct f2fs_sb_info *sbi)
>>>>>>>  #define GET_BLKOFF_FROM_SEG0(sbi, blk_addr)				\
>>>>>>>  	(GET_SEGOFF_FROM_SEG0(sbi, blk_addr) & (sbi->blocks_per_seg - 1))
>>>>>>>  
>>>>>>> +#define GET_SEC_FROM_SEG(sbi, segno)					\
>>>>>>> +	((segno) / (sbi)->segs_per_sec)
>>>>>>> +#define GET_SEG_FROM_SEC(sbi, secno)					\
>>>>>>> +	((secno) * (sbi)->segs_per_sec)
>>>>>>> +
>>>>>>>  #define FREE_I_START_SEGNO(sbi)						\
>>>>>>>  	GET_SEGNO_FROM_SEG0(sbi, SM_I(sbi)->main_blkaddr)
>>>>>>>  #define GET_R2L_SEGNO(sbi, segno)	(segno + FREE_I_START_SEGNO(sbi))
>>>>>>> diff --git a/fsck/fsck.c b/fsck/fsck.c
>>>>>>> index 8953ca1..a0f6849 100644
>>>>>>> --- a/fsck/fsck.c
>>>>>>> +++ b/fsck/fsck.c
>>>>>>> @@ -2574,6 +2574,190 @@ out:
>>>>>>>  	return cnt;
>>>>>>>  }
>>>>>>>  
>>>>>>> +/*
>>>>>>> + * Search a free section in main area. Start search from the section specified
>>>>>>> + * with segno argument toward main area end. Return first segment of the found
>>>>>>> + * section in segno argument.
>>>>>>> + */
>>>>>>> +static int find_next_free_section(struct f2fs_sb_info *sbi,
>>>>>>> +				  unsigned int *segno)
>>>>>>> +{
>>>>>>> +	unsigned int i, sec, section_valid_blocks;
>>>>>>> +	unsigned int end_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr)
>>>>>>> +		+ SM_I(sbi)->main_segments;
>>>>>>> +	unsigned int end_sec = GET_SEC_FROM_SEG(sbi, end_segno);
>>>>>>> +	struct seg_entry *se;
>>>>>>> +	struct curseg_info *cs;
>>>>>>> +
>>>>>>> +	for (sec = GET_SEC_FROM_SEG(sbi, *segno); sec < end_sec; sec++) {
>>>>>>> +		/* find a section without valid blocks */
>>>>>>> +		section_valid_blocks = 0;
>>>>>>> +		for (i = 0; i < sbi->segs_per_sec; i++) {
>>>>>>> +			se = get_seg_entry(sbi, GET_SEG_FROM_SEC(sbi, sec) + i);
>>>>>>> +			section_valid_blocks += se->valid_blocks;
>>>>>>
>>>>>> If we want to find a 'real' free section, we'd better to use
>>>>>> se->ckpt_valid_blocks (wrapped with get_seg_vblocks()) in where we has recorded
>>>>>> fsynced data count.
>>>>>
>>>>> Thanks. When I create next patch series, I will use get_seg_vblocks().
>>>>> I will rebase to dev-test branch in which get_seg_vblocks() is available.
>>>>>
>>>>>>
>>>>>>> +		}
>>>>>>> +		if (section_valid_blocks)
>>>>>>> +			continue;
>>>>>>> +
>>>>>>> +		/* check the cursegs do not use the section */
>>>>>>> +		for (i = 0; i < NO_CHECK_TYPE; i++) {
>>>>>>> +			cs = &SM_I(sbi)->curseg_array[i];
>>>>>>> +			if (GET_SEC_FROM_SEG(sbi, cs->segno) == sec)
>>>>>>> +				break;
>>>>>>> +		}
>>>>>>> +		if (i >= NR_CURSEG_TYPE) {
>>>>>>> +			*segno = GET_SEG_FROM_SEC(sbi, sec);
>>>>>>> +			return 0;
>>>>>>> +		}
>>>>>>> +	}
>>>>>>> +
>>>>>>> +	return -1;
>>>>>>> +}
>>>>>>> +
>>>>>>> +struct write_pointer_check_data {
>>>>>>> +	struct f2fs_sb_info *sbi;
>>>>>>> +	struct device_info *dev;
>>>>>>> +};
>>>>>>> +
>>>>>>> +static int fsck_chk_write_pointer(int i, struct blk_zone *blkz, void *opaque)
>>>>>>> +{
>>>>>>> +	struct write_pointer_check_data *wpd = opaque;
>>>>>>> +	struct f2fs_sb_info *sbi = wpd->sbi;
>>>>>>> +	struct device_info *dev = wpd->dev;
>>>>>>> +	struct f2fs_fsck *fsck = F2FS_FSCK(sbi);
>>>>>>> +	block_t zone_block, wp_block, wp_blkoff, cs_block, b;
>>>>>>> +	unsigned int zone_segno, wp_segno, new_segno;
>>>>>>> +	struct seg_entry *se;
>>>>>>> +	struct curseg_info *cs;
>>>>>>> +	int cs_index, ret;
>>>>>>> +	int log_sectors_per_block = sbi->log_blocksize - SECTOR_SHIFT;
>>>>>>> +	unsigned int segs_per_zone = sbi->segs_per_sec * sbi->secs_per_zone;
>>>>>>> +	void *zero_blk;
>>>>>>> +
>>>>>>> +	if (blk_zone_conv(blkz))
>>>>>>> +		return 0;
>>>>>>> +
>>>>>>> +	zone_block = dev->start_blkaddr
>>>>>>> +		+ (blk_zone_sector(blkz) >> log_sectors_per_block);
>>>>>>> +	zone_segno = GET_SEGNO(sbi, zone_block);
>>>>>>> +	wp_block = dev->start_blkaddr
>>>>>>> +		+ (blk_zone_wp_sector(blkz) >> log_sectors_per_block);
>>>>>>> +	wp_segno = GET_SEGNO(sbi, wp_block);
>>>>>>> +	wp_blkoff = wp_block - START_BLOCK(sbi, wp_segno);
>>>>>>> +
>>>>>>> +	/* find the curseg which points to the zone */
>>>>>>> +	for (cs_index = 0; cs_index < NO_CHECK_TYPE; cs_index++) {
>>>>>>> +		cs = &SM_I(sbi)->curseg_array[cs_index];
>>>>>>> +		if (zone_segno <= cs->segno &&
>>>>>>> +		    cs->segno < zone_segno + segs_per_zone)
>>>>>>> +			break;
>>>>>>> +	}
>>>>>>> +
>>>>>>> +	if (cs_index >= NR_CURSEG_TYPE)
>>>>>>> +		return 0;
>>>>>>> +
>>>>>>> +	/* check write pointer consistency with the curseg in the zone */
>>>>>>> +	cs_block = START_BLOCK(sbi, cs->segno) + cs->next_blkoff;
>>>>>>> +	if (wp_block == cs_block)
>>>>>>> +		return 0;
>>>>>>> +
>>>>>>> +	if (!c.fix_on) {
>>>>>>> +		MSG(0, "Inconsistent write pointer: "
>>>>>>> +		    "curseg %d[0x%x,0x%x] wp[0x%x,0x%x]\n",
>>>>>>> +		    cs_index, cs->segno, cs->next_blkoff, wp_segno, wp_blkoff);
>>>>>>> +		fsck->chk.wp_inconsistent_zones++;
>>>>>>> +		return 0;
>>>>>>> +	}
>>>>>>> +
>>>>>>> +	/*
>>>>>>> +	 * If the curseg is in advance from the write pointer, write zero to
>>>>>>> +	 * move the write pointer forward to the same position as the curseg.
>>>>>>> +	 */
>>>>>>> +	if (wp_block < cs_block) {
>>>>>>> +		ret = 0;
>>>>>>> +		zero_blk = calloc(BLOCK_SZ, 1);
>>>>>>> +		if (!zero_blk)
>>>>>>> +			return -EINVAL;
>>>>>>> +
>>>>>>> +		FIX_MSG("Advance write pointer to match with curseg %d: "
>>>>>>> +			"[0x%x,0x%x]->[0x%x,0x%x]",
>>>>>>> +			cs_index, wp_segno, wp_blkoff,
>>>>>>> +			cs->segno, cs->next_blkoff);
>>>>>>> +		for (b = wp_block; b < cs_block && !ret; b++)
>>>>>>> +			ret = dev_write_block(zero_blk, b);
>>>>>>> +
>>>>>>> +		fsck->chk.wp_fixed_zones++;
>>>>>>> +		free(zero_blk);
>>>>>>> +		return ret;
>>>>>>> +	}
>>>>>>> +
>>>>>>> +	if (wp_segno == zone_segno + segs_per_zone) {
>>>>>>> +		/*
>>>>>>> +		 * If the write pointer is in advance from the curseg and at
>>>>>>> +		 * the zone end (section end), search a new free zone (section)
>>>>>>> +		 * between the curseg and main area end.
>>>>>>> +		 */
>>>>>>> +		new_segno = wp_segno;
>>>>>>> +		ret = find_next_free_section(sbi, &new_segno);
>>>>>>> +		if (ret) {
>>>>>>> +			/* search again from main area start */
>>>>>>> +			new_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr);
>>>>>>> +			ret = find_next_free_section(sbi, &new_segno);
>>>>>>> +		}
>>>>>>
>>>>>> If curseg's type is warm node, relocating curseg would ruin warm node chain,
>>>>>> result in losing fsynced data for ever as well.
>>>>>>
>>>>>> So I guess we should migrate all dnode blocks chained with cs->next_blkoff in
>>>>>> current section.
>>>>>>
>>>>>>> +		if (ret) {
>>>>>>> +			MSG(0, "Free section not found\n");
>>>>>>> +			return ret;
>>>>>>> +		}
>>>>>>> +		FIX_MSG("New section for curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
>>>>>>> +			cs_index, cs->segno, cs->next_blkoff, new_segno, 0);
>>>>>>> +		cs->segno = new_segno;
>>>>>>> +		cs->next_blkoff = 0;
>>>>>>> +	} else {
>>>>>>> +		/*
>>>>>>> +		 * If the write pointer is in advance from the curseg within
>>>>>>> +		 * the zone, modify the curseg position to be same as the
>>>>>>> +		 * write pointer.
>>>>>>> +		 */
>>>>>>> +		ASSERT(wp_segno < zone_segno + segs_per_zone);
>>>>>>> +		FIX_MSG("Advance curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
>>>>>>> +			cs_index, cs->segno, cs->next_blkoff,
>>>>>>> +			wp_segno, wp_blkoff);
>>>>>>> +		cs->segno = wp_segno;
>>>>>>> +		cs->next_blkoff = wp_blkoff;
>>>>>>
>>>>>> The same data lose problem here, I guess we'd better handle it with the same way
>>>>>> as above.
>>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> I created f2fs status with fsync data and warm node chain, and confirmed the v4
>>>>> implementation ruins the dnode blocks chain. Hmm.
>>>>>
>>>>> My understanding is that when f2fs kernel module recovers the fsync data, it
>>>>> sets the warm node curseg position at the start of the fsync data, and the fsync
>>>>> data will be overwritten as new nodes created. Is this understanding correct?
>>>>
>>>> Sorry, I'm not sure I've understood you correctly.
>>>
>>> Apology is mine, my question was not clear enough.
>>> And thanks for the explanation below. It helps me to understand better.
>>>
>>>> Current recovery flow is:
>>>> 1)   find all valid fsynced dnode in warm node chain
>>>> 2.a) recover fsynced dnode in memory, and set it dirty
>>>> 2.b) recover directory entry in memory, and set it dirty
>>>> 2.c) during regular's dnode recovery, physical blocks indexed by recovered dnode
>>>> will be revalidated
>>>> Note: we didn't move any cursegs before 3)
>>>> 3)   relocate all cursegs to new segments
>>>> 4)   trigger checkpoint to persist all recovered data(fs' meta, file's meta/data)
>>>
>>> Question, does the step 3) correspond to f2fs_allocate_new_segments(sbi) call
>>> at the end of recover_data()? The f2fs_allocate_new_segments() function
>>
>> Yeah, I meant that function.
>>
>>> relocates new segments only for DATA cursegs, and it keeps NODE cursegs same as
>>> before the fsync data recovery. Then I thought WARM NODE curseg would not change
>>> by recovery (and still keeps pointing to the fsync recovery data).
>>
>> Yes, that's correct. WARM NODE curseg won't change until step 4).
> 
> Thanks. Following your idea "we can simply adjust to reset all invalid zone and
> allocate new zone for curseg before data/meta writes" for fix by kernel, I think
> kernel code change is required to allocate new zones for NODE cursegs also at
> step 3). WARM NODE curseg should be kept untouched by step 2 completion to refer
> fsynced dnodes at WARM NODE curseg's next_blkaddr. And at step 4, the fsyced
> dnodes recovered and set dirty will be written back with one of NODE cursegs
> (HOT NODE curseg?). At that time, we need to make sure the NODE curseg points to

Directory's dnode goes to hot node area, other file's dnode goes to warm node
area, the left node goes to cold node area.

> the position consistent with its zone's write pointer.

Yes, before step 4), we should keep f2fs and zoned block device's write pointer
being consistent.

> 
>>>
>>>>>
>>>>> If this is the case, I think write pointer inconsistency will happen even if
>>>>> fsck does "migrate all dnode blocks" (I assume that the warm node curseg's next
>>>>
>>>> Actually, I notice that scheme's problem is: we've changed zone's write pointer
>>>> during dnode blocks migration, however if kernel drops recovery, we will still
>>>> face inconsistent status in between .next_blkaddr and .write_pointer, once we
>>>> start to write data from .next_blkaddr. So in fsck, after migration, we need to
>>>> reset .write_pointer to .next_blkaddr.
>>>>
>>>> I guess we need to consider four cases:
>>>>
>>>> o: support .write_pointer recovery
>>>> x: not support .write_pointer recovery
>>>>
>>>> 1) kernel: o, fsck: x, trigger recovery in kernel
>>>> 2) kernel: o, fsck: x, not trigger recovery in kernel
>>>> 3) kernel: x, fsck: o, trigger recovery in kernel
>>>> 4) kernel: x, fsck: o, not trigger recovery in kernel
>>>>
>>>> For 1) and 2), we can simply adjust to reset all invalid zone and allocate new
>>>> zone for curseg before data/meta writes.
>>>
>>> Thanks for the clarification. This approach for case 1) and 2) is simple. Let me
>>> try to create a patch for it.
>>>
>>>>
>>>> For 3) and 4), I haven't figured out a way handling all cases perfectly.
>>>
>>> For 3), I suppose fsck cannot fix write pointer inconsistency without fsync data
>>> loss, since recovery is judged and done by kernel. The write pointer consistency
>>> fix after recovery can be done only by kernel.
>>>
>>> It is not a good one but one idea is to confirm fsck users to enforce fsync data
>>> loss or not. This could be better than nothing.
>>>
>>> For 4), my understanding is that fsync data is not left in the file system. I
>>> think fsck can check fsync data existence and fix write pointer consistency, as
>>> my patch series shows.
>>
>> Yeah.
>>
>> Let's think about that whether there is a way to cover all cases.
>>
>> 1) For non-opened zones, we need to adjust all such zones' write pointer to
>> zero. I assume that once write pointer is zero, we still can access valid block
>> in zone. (recovery may need to revalidate blocks in free segments)
> 
> When write pointer is reset to zero, all data in the zone gets lost. When we
> read data in the zone beyond the write pointer, just zero data is read. When any
> valid block or fsync data is left in a non-opened zone, I think the zone's write
> pointer should be left as is. Otherwise, if the zone do not have valid block and
> fsync data, the zone should be reset to avoid unaligned write error.

Okay, if data beyond write pointer is invalid, we should keep write pointer as
it is if there are fsynced data in that zone.

> 
> One additional check I can think of is to check the last valid block in the zone
> and write pointer position of the zone. If .write_pointer >= .last_valid_block,
> , it is ok. If .write_pointer < .last_valid_block, this is a bug of f2fs. In

Sounds reasonable, how can we find last valid block, as you said, content of
block beyond write pointer is all zero... or you mean curseg's next_blkaddr?
like the condition 2.c) described?

> this case, the data in the valid blocks beyond write pointer is just lost, and
> there is no way to recover this. I think this zone will not be selected for
> cursegs for further data write in the zone until the zone get discarded. No
> need to fix write pointer position to avoid unaligned write errors. I wonder

Yes,

> if fsck or kernel should detect and report this case, because users still use
> the f2fs partition without any action. May I ask your opinion?

If we can detect that, I think it should be reported.

> 
>>
>> 2) For opened zones (cursegs point to)
>> 2.a) .write_pointer == .next_blkaddr, no need to handle
>> 2.b) .write_pointer > .next_blkaddr, that should be the common state in sudden
>> power-off case. It needs to reset .write_pointer to .next_blkaddr.
> 
> Agreed. To be more precise, if fsync data is recorded after .next_blkaddr, we
> need to allocate a new zone to the curseg after fsync data recovery.
> 
>> 2.c) .write_pointer < .next_blkaddr, should be a bug of f2fs, we should have
>> lost datas in the range of [.write_pointer, .next_blkaddr] at least, if we're
>> sure about that there is no valid data existed out side of .write_pointer? can we?
> 
> As you note, this is a bug and we lost the data in the range of [.write_pointer,
> .next_blkaddr]. We should leave the write pointer as is to keep the data in the
> zone, and allocate a new zone to the curseg to avoid unaligned write errors.

Okay.

> 
>>
>> 3) f2fs_allocate_new_segments() may reset data type cursegs' position, I'd like
>> to know what's the value of -z option we configured, if we configured as one
>> zone contains one section, no matter kernel triggers recovery or not (via set
>> disable_roll_forward option), the write pointer consistency should has been kept
>> in all cases. Otherwise (one zone contains multiple sections), if kernel doesn't
>> support .write_pointer recovery, f2fs_allocate_new_segments() can make
>> .next_blkaddr larger than .write_pointer in the same zone.
> 
> "Single section per zone" is a key design of the f2fs zoned block device
> support. For zoned block devices, mkfs.f2fs requires -m option. When -m
> options is set, mkfs sets '1' to c.secs_per_zone regardless of the -z option
> (refer f2fs_get_device_info() in lib/libf2fs.c). With this prerequisite,

Oh, correct, thanks for noticing that. I did miss that we only use -m option for
zoned block device.

> I think new zones are allocated for DATA cursegs in
> f2fs_allocate_new_segment() and its sub-function call path.
> 
>>
>> Let me know, if I'm missing sth.
> 
> Assuming fsync data is the only one exceptional data which is read beyond the
> cursegs' next_blkoff, this discussion looks comprehensive for me:
> 
>  - with and without kernel write pointer fix
>  - with and without fsync data recovery
>  - open/non-open zones (zones cursegs point to, zones cursegs do not point to)
> 
> All fs meta data in front of main segments are stored in conventional zones
> which do not have write pointers (refer f2fs_prepare_super_block() in mkfs/
> f2fs_format.c). Just we need to care about write pointer consistency for the
> main data area.

Agreed,

Thanks,

> 
> Thanks!
> 
> --
> Best Regards,
> Shin'ichiro Kawasaki.
> 


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments
  2019-09-10  9:12               ` Chao Yu
@ 2019-09-12  8:16                 ` Shinichiro Kawasaki
  2019-09-16  1:37                   ` Chao Yu
  0 siblings, 1 reply; 18+ messages in thread
From: Shinichiro Kawasaki @ 2019-09-12  8:16 UTC (permalink / raw)
  To: Chao Yu; +Cc: Jaegeuk Kim, Damien Le Moal, linux-f2fs-devel

On Sep 10, 2019 / 17:12, Chao Yu wrote:
> On 2019/9/10 16:10, Shinichiro Kawasaki wrote:
> > On Sep 09, 2019 / 15:14, Chao Yu wrote:
> >> On 2019/9/6 16:31, Shinichiro Kawasaki wrote:
> >>> On Sep 05, 2019 / 17:58, Chao Yu wrote:
> >>>> Hi Shinichiro,
> >>>>
> >>>> Sorry for the delay.
> >>>>
> >>>> On 2019/9/3 16:37, Shinichiro Kawasaki wrote:
> >>>>> On Sep 02, 2019 / 15:02, Chao Yu wrote:
> >>>>>> On 2019/8/30 18:19, Shin'ichiro Kawasaki wrote:
> >>>>>>> On sudden f2fs shutdown, zoned block device status and f2fs current
> >>>>>>> segment positions in meta data can be inconsistent. When f2fs shutdown
> >>>>>>> happens before write operations completes, write pointers of zoned block
> >>>>>>> devices can go further but f2fs meta data keeps current segments at
> >>>>>>> positions before the write operations. After remounting the f2fs, the
> >>>>>>> inconsistency causes write operations not at write pointers and
> >>>>>>> "Unaligned write command" error is reported. This error was observed when
> >>>>>>> xfstests test case generic/388 was run with f2fs on a zoned block device.
> >>>>>>>
> >>>>>>> To avoid the error, have f2fs.fsck check consistency between each current
> >>>>>>> segment's position and the write pointer of the zone the current segment
> >>>>>>> points to. If the write pointer goes advance from the current segment,
> >>>>>>> fix the current segment position setting at same as the write pointer
> >>>>>>> position. If the write pointer goes to the zone end, find a new zone and
> >>>>>>> set the current segment position at the new zone start. In case the write
> >>>>>>> pointer is behind the current segment, write zero data at the write
> >>>>>>> pointer position to make write pointer position at same as the current
> >>>>>>> segment.
> >>>>>>>
> >>>>>>> When inconsistencies are found, turn on c.bug_on flag in fsck_verify() to
> >>>>>>> ask users to fix them or not. When inconsistencies get fixed, turn on
> >>>>>>> 'force' flag in fsck_verify() to enforce fixes in following checks. This
> >>>>>>> position fix is done at the beginning of do_fsck() function so that other
> >>>>>>> checks reflect the current segment modification.
> >>>>>>>
> >>>>>>> Also add GET_SEC_FROM_SEG and GET_SEG_FROM_SEC macros in fsck/fsck.h to
> >>>>>>> simplify the code.
> >>>>>>>
> >>>>>>> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> >>>>>>> ---
> >>>>>>>  fsck/f2fs.h |   5 ++
> >>>>>>>  fsck/fsck.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>  fsck/fsck.h |   3 +
> >>>>>>>  fsck/main.c |   2 +
> >>>>>>>  4 files changed, 208 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/fsck/f2fs.h b/fsck/f2fs.h
> >>>>>>> index 4dc6698..2c1c2b3 100644
> >>>>>>> --- a/fsck/f2fs.h
> >>>>>>> +++ b/fsck/f2fs.h
> >>>>>>> @@ -337,6 +337,11 @@ static inline block_t __end_block_addr(struct f2fs_sb_info *sbi)
> >>>>>>>  #define GET_BLKOFF_FROM_SEG0(sbi, blk_addr)				\
> >>>>>>>  	(GET_SEGOFF_FROM_SEG0(sbi, blk_addr) & (sbi->blocks_per_seg - 1))
> >>>>>>>  
> >>>>>>> +#define GET_SEC_FROM_SEG(sbi, segno)					\
> >>>>>>> +	((segno) / (sbi)->segs_per_sec)
> >>>>>>> +#define GET_SEG_FROM_SEC(sbi, secno)					\
> >>>>>>> +	((secno) * (sbi)->segs_per_sec)
> >>>>>>> +
> >>>>>>>  #define FREE_I_START_SEGNO(sbi)						\
> >>>>>>>  	GET_SEGNO_FROM_SEG0(sbi, SM_I(sbi)->main_blkaddr)
> >>>>>>>  #define GET_R2L_SEGNO(sbi, segno)	(segno + FREE_I_START_SEGNO(sbi))
> >>>>>>> diff --git a/fsck/fsck.c b/fsck/fsck.c
> >>>>>>> index 8953ca1..a0f6849 100644
> >>>>>>> --- a/fsck/fsck.c
> >>>>>>> +++ b/fsck/fsck.c
> >>>>>>> @@ -2574,6 +2574,190 @@ out:
> >>>>>>>  	return cnt;
> >>>>>>>  }
> >>>>>>>  
> >>>>>>> +/*
> >>>>>>> + * Search a free section in main area. Start search from the section specified
> >>>>>>> + * with segno argument toward main area end. Return first segment of the found
> >>>>>>> + * section in segno argument.
> >>>>>>> + */
> >>>>>>> +static int find_next_free_section(struct f2fs_sb_info *sbi,
> >>>>>>> +				  unsigned int *segno)
> >>>>>>> +{
> >>>>>>> +	unsigned int i, sec, section_valid_blocks;
> >>>>>>> +	unsigned int end_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr)
> >>>>>>> +		+ SM_I(sbi)->main_segments;
> >>>>>>> +	unsigned int end_sec = GET_SEC_FROM_SEG(sbi, end_segno);
> >>>>>>> +	struct seg_entry *se;
> >>>>>>> +	struct curseg_info *cs;
> >>>>>>> +
> >>>>>>> +	for (sec = GET_SEC_FROM_SEG(sbi, *segno); sec < end_sec; sec++) {
> >>>>>>> +		/* find a section without valid blocks */
> >>>>>>> +		section_valid_blocks = 0;
> >>>>>>> +		for (i = 0; i < sbi->segs_per_sec; i++) {
> >>>>>>> +			se = get_seg_entry(sbi, GET_SEG_FROM_SEC(sbi, sec) + i);
> >>>>>>> +			section_valid_blocks += se->valid_blocks;
> >>>>>>
> >>>>>> If we want to find a 'real' free section, we'd better to use
> >>>>>> se->ckpt_valid_blocks (wrapped with get_seg_vblocks()) in where we has recorded
> >>>>>> fsynced data count.
> >>>>>
> >>>>> Thanks. When I create next patch series, I will use get_seg_vblocks().
> >>>>> I will rebase to dev-test branch in which get_seg_vblocks() is available.
> >>>>>
> >>>>>>
> >>>>>>> +		}
> >>>>>>> +		if (section_valid_blocks)
> >>>>>>> +			continue;
> >>>>>>> +
> >>>>>>> +		/* check the cursegs do not use the section */
> >>>>>>> +		for (i = 0; i < NO_CHECK_TYPE; i++) {
> >>>>>>> +			cs = &SM_I(sbi)->curseg_array[i];
> >>>>>>> +			if (GET_SEC_FROM_SEG(sbi, cs->segno) == sec)
> >>>>>>> +				break;
> >>>>>>> +		}
> >>>>>>> +		if (i >= NR_CURSEG_TYPE) {
> >>>>>>> +			*segno = GET_SEG_FROM_SEC(sbi, sec);
> >>>>>>> +			return 0;
> >>>>>>> +		}
> >>>>>>> +	}
> >>>>>>> +
> >>>>>>> +	return -1;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +struct write_pointer_check_data {
> >>>>>>> +	struct f2fs_sb_info *sbi;
> >>>>>>> +	struct device_info *dev;
> >>>>>>> +};
> >>>>>>> +
> >>>>>>> +static int fsck_chk_write_pointer(int i, struct blk_zone *blkz, void *opaque)
> >>>>>>> +{
> >>>>>>> +	struct write_pointer_check_data *wpd = opaque;
> >>>>>>> +	struct f2fs_sb_info *sbi = wpd->sbi;
> >>>>>>> +	struct device_info *dev = wpd->dev;
> >>>>>>> +	struct f2fs_fsck *fsck = F2FS_FSCK(sbi);
> >>>>>>> +	block_t zone_block, wp_block, wp_blkoff, cs_block, b;
> >>>>>>> +	unsigned int zone_segno, wp_segno, new_segno;
> >>>>>>> +	struct seg_entry *se;
> >>>>>>> +	struct curseg_info *cs;
> >>>>>>> +	int cs_index, ret;
> >>>>>>> +	int log_sectors_per_block = sbi->log_blocksize - SECTOR_SHIFT;
> >>>>>>> +	unsigned int segs_per_zone = sbi->segs_per_sec * sbi->secs_per_zone;
> >>>>>>> +	void *zero_blk;
> >>>>>>> +
> >>>>>>> +	if (blk_zone_conv(blkz))
> >>>>>>> +		return 0;
> >>>>>>> +
> >>>>>>> +	zone_block = dev->start_blkaddr
> >>>>>>> +		+ (blk_zone_sector(blkz) >> log_sectors_per_block);
> >>>>>>> +	zone_segno = GET_SEGNO(sbi, zone_block);
> >>>>>>> +	wp_block = dev->start_blkaddr
> >>>>>>> +		+ (blk_zone_wp_sector(blkz) >> log_sectors_per_block);
> >>>>>>> +	wp_segno = GET_SEGNO(sbi, wp_block);
> >>>>>>> +	wp_blkoff = wp_block - START_BLOCK(sbi, wp_segno);
> >>>>>>> +
> >>>>>>> +	/* find the curseg which points to the zone */
> >>>>>>> +	for (cs_index = 0; cs_index < NO_CHECK_TYPE; cs_index++) {
> >>>>>>> +		cs = &SM_I(sbi)->curseg_array[cs_index];
> >>>>>>> +		if (zone_segno <= cs->segno &&
> >>>>>>> +		    cs->segno < zone_segno + segs_per_zone)
> >>>>>>> +			break;
> >>>>>>> +	}
> >>>>>>> +
> >>>>>>> +	if (cs_index >= NR_CURSEG_TYPE)
> >>>>>>> +		return 0;
> >>>>>>> +
> >>>>>>> +	/* check write pointer consistency with the curseg in the zone */
> >>>>>>> +	cs_block = START_BLOCK(sbi, cs->segno) + cs->next_blkoff;
> >>>>>>> +	if (wp_block == cs_block)
> >>>>>>> +		return 0;
> >>>>>>> +
> >>>>>>> +	if (!c.fix_on) {
> >>>>>>> +		MSG(0, "Inconsistent write pointer: "
> >>>>>>> +		    "curseg %d[0x%x,0x%x] wp[0x%x,0x%x]\n",
> >>>>>>> +		    cs_index, cs->segno, cs->next_blkoff, wp_segno, wp_blkoff);
> >>>>>>> +		fsck->chk.wp_inconsistent_zones++;
> >>>>>>> +		return 0;
> >>>>>>> +	}
> >>>>>>> +
> >>>>>>> +	/*
> >>>>>>> +	 * If the curseg is in advance from the write pointer, write zero to
> >>>>>>> +	 * move the write pointer forward to the same position as the curseg.
> >>>>>>> +	 */
> >>>>>>> +	if (wp_block < cs_block) {
> >>>>>>> +		ret = 0;
> >>>>>>> +		zero_blk = calloc(BLOCK_SZ, 1);
> >>>>>>> +		if (!zero_blk)
> >>>>>>> +			return -EINVAL;
> >>>>>>> +
> >>>>>>> +		FIX_MSG("Advance write pointer to match with curseg %d: "
> >>>>>>> +			"[0x%x,0x%x]->[0x%x,0x%x]",
> >>>>>>> +			cs_index, wp_segno, wp_blkoff,
> >>>>>>> +			cs->segno, cs->next_blkoff);
> >>>>>>> +		for (b = wp_block; b < cs_block && !ret; b++)
> >>>>>>> +			ret = dev_write_block(zero_blk, b);
> >>>>>>> +
> >>>>>>> +		fsck->chk.wp_fixed_zones++;
> >>>>>>> +		free(zero_blk);
> >>>>>>> +		return ret;
> >>>>>>> +	}
> >>>>>>> +
> >>>>>>> +	if (wp_segno == zone_segno + segs_per_zone) {
> >>>>>>> +		/*
> >>>>>>> +		 * If the write pointer is in advance from the curseg and at
> >>>>>>> +		 * the zone end (section end), search a new free zone (section)
> >>>>>>> +		 * between the curseg and main area end.
> >>>>>>> +		 */
> >>>>>>> +		new_segno = wp_segno;
> >>>>>>> +		ret = find_next_free_section(sbi, &new_segno);
> >>>>>>> +		if (ret) {
> >>>>>>> +			/* search again from main area start */
> >>>>>>> +			new_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr);
> >>>>>>> +			ret = find_next_free_section(sbi, &new_segno);
> >>>>>>> +		}
> >>>>>>
> >>>>>> If curseg's type is warm node, relocating curseg would ruin warm node chain,
> >>>>>> result in losing fsynced data for ever as well.
> >>>>>>
> >>>>>> So I guess we should migrate all dnode blocks chained with cs->next_blkoff in
> >>>>>> current section.
> >>>>>>
> >>>>>>> +		if (ret) {
> >>>>>>> +			MSG(0, "Free section not found\n");
> >>>>>>> +			return ret;
> >>>>>>> +		}
> >>>>>>> +		FIX_MSG("New section for curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
> >>>>>>> +			cs_index, cs->segno, cs->next_blkoff, new_segno, 0);
> >>>>>>> +		cs->segno = new_segno;
> >>>>>>> +		cs->next_blkoff = 0;
> >>>>>>> +	} else {
> >>>>>>> +		/*
> >>>>>>> +		 * If the write pointer is in advance from the curseg within
> >>>>>>> +		 * the zone, modify the curseg position to be same as the
> >>>>>>> +		 * write pointer.
> >>>>>>> +		 */
> >>>>>>> +		ASSERT(wp_segno < zone_segno + segs_per_zone);
> >>>>>>> +		FIX_MSG("Advance curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
> >>>>>>> +			cs_index, cs->segno, cs->next_blkoff,
> >>>>>>> +			wp_segno, wp_blkoff);
> >>>>>>> +		cs->segno = wp_segno;
> >>>>>>> +		cs->next_blkoff = wp_blkoff;
> >>>>>>
> >>>>>> The same data lose problem here, I guess we'd better handle it with the same way
> >>>>>> as above.
> >>>>>>
> >>>>>> Thoughts?
> >>>>>
> >>>>> I created f2fs status with fsync data and warm node chain, and confirmed the v4
> >>>>> implementation ruins the dnode blocks chain. Hmm.
> >>>>>
> >>>>> My understanding is that when f2fs kernel module recovers the fsync data, it
> >>>>> sets the warm node curseg position at the start of the fsync data, and the fsync
> >>>>> data will be overwritten as new nodes created. Is this understanding correct?
> >>>>
> >>>> Sorry, I'm not sure I've understood you correctly.
> >>>
> >>> Apology is mine, my question was not clear enough.
> >>> And thanks for the explanation below. It helps me to understand better.
> >>>
> >>>> Current recovery flow is:
> >>>> 1)   find all valid fsynced dnode in warm node chain
> >>>> 2.a) recover fsynced dnode in memory, and set it dirty
> >>>> 2.b) recover directory entry in memory, and set it dirty
> >>>> 2.c) during regular's dnode recovery, physical blocks indexed by recovered dnode
> >>>> will be revalidated
> >>>> Note: we didn't move any cursegs before 3)
> >>>> 3)   relocate all cursegs to new segments
> >>>> 4)   trigger checkpoint to persist all recovered data(fs' meta, file's meta/data)
> >>>
> >>> Question, does the step 3) correspond to f2fs_allocate_new_segments(sbi) call
> >>> at the end of recover_data()? The f2fs_allocate_new_segments() function
> >>
> >> Yeah, I meant that function.
> >>
> >>> relocates new segments only for DATA cursegs, and it keeps NODE cursegs same as
> >>> before the fsync data recovery. Then I thought WARM NODE curseg would not change
> >>> by recovery (and still keeps pointing to the fsync recovery data).
> >>
> >> Yes, that's correct. WARM NODE curseg won't change until step 4).
> > 
> > Thanks. Following your idea "we can simply adjust to reset all invalid zone and
> > allocate new zone for curseg before data/meta writes" for fix by kernel, I think
> > kernel code change is required to allocate new zones for NODE cursegs also at
> > step 3). WARM NODE curseg should be kept untouched by step 2 completion to refer
> > fsynced dnodes at WARM NODE curseg's next_blkaddr. And at step 4, the fsyced
> > dnodes recovered and set dirty will be written back with one of NODE cursegs
> > (HOT NODE curseg?). At that time, we need to make sure the NODE curseg points to
> 
> Directory's dnode goes to hot node area, other file's dnode goes to warm node
> area, the left node goes to cold node area.
> 
> > the position consistent with its zone's write pointer.
> 
> Yes, before step 4), we should keep f2fs and zoned block device's write pointer
> being consistent.

Ok, thanks.

> 
> > 
> >>>
> >>>>>
> >>>>> If this is the case, I think write pointer inconsistency will happen even if
> >>>>> fsck does "migrate all dnode blocks" (I assume that the warm node curseg's next
> >>>>
> >>>> Actually, I notice that scheme's problem is: we've changed zone's write pointer
> >>>> during dnode blocks migration, however if kernel drops recovery, we will still
> >>>> face inconsistent status in between .next_blkaddr and .write_pointer, once we
> >>>> start to write data from .next_blkaddr. So in fsck, after migration, we need to
> >>>> reset .write_pointer to .next_blkaddr.
> >>>>
> >>>> I guess we need to consider four cases:
> >>>>
> >>>> o: support .write_pointer recovery
> >>>> x: not support .write_pointer recovery
> >>>>
> >>>> 1) kernel: o, fsck: x, trigger recovery in kernel
> >>>> 2) kernel: o, fsck: x, not trigger recovery in kernel
> >>>> 3) kernel: x, fsck: o, trigger recovery in kernel
> >>>> 4) kernel: x, fsck: o, not trigger recovery in kernel
> >>>>
> >>>> For 1) and 2), we can simply adjust to reset all invalid zone and allocate new
> >>>> zone for curseg before data/meta writes.
> >>>
> >>> Thanks for the clarification. This approach for case 1) and 2) is simple. Let me
> >>> try to create a patch for it.
> >>>
> >>>>
> >>>> For 3) and 4), I haven't figured out a way handling all cases perfectly.
> >>>
> >>> For 3), I suppose fsck cannot fix write pointer inconsistency without fsync data
> >>> loss, since recovery is judged and done by kernel. The write pointer consistency
> >>> fix after recovery can be done only by kernel.
> >>>
> >>> It is not a good one but one idea is to confirm fsck users to enforce fsync data
> >>> loss or not. This could be better than nothing.
> >>>
> >>> For 4), my understanding is that fsync data is not left in the file system. I
> >>> think fsck can check fsync data existence and fix write pointer consistency, as
> >>> my patch series shows.
> >>
> >> Yeah.
> >>
> >> Let's think about that whether there is a way to cover all cases.
> >>
> >> 1) For non-opened zones, we need to adjust all such zones' write pointer to
> >> zero. I assume that once write pointer is zero, we still can access valid block
> >> in zone. (recovery may need to revalidate blocks in free segments)
> > 
> > When write pointer is reset to zero, all data in the zone gets lost. When we
> > read data in the zone beyond the write pointer, just zero data is read. When any
> > valid block or fsync data is left in a non-opened zone, I think the zone's write
> > pointer should be left as is. Otherwise, if the zone do not have valid block and
> > fsync data, the zone should be reset to avoid unaligned write error.
> 
> Okay, if data beyond write pointer is invalid, we should keep write pointer as
> it is if there are fsynced data in that zone.
> 
> > 
> > One additional check I can think of is to check the last valid block in the zone
> > and write pointer position of the zone. If .write_pointer >= .last_valid_block,
> > , it is ok. If .write_pointer < .last_valid_block, this is a bug of f2fs. In
> 
> Sounds reasonable, how can we find last valid block, as you said, content of
> block beyond write pointer is all zero... or you mean curseg's next_blkaddr?
> like the condition 2.c) described?

I think we can get each zone's last valid block referring each segment's valid
block bitmap in SIT. In other words, this is a consistency check between write
pointer and SIT. Is this feasible approach?

> 
> > this case, the data in the valid blocks beyond write pointer is just lost, and
> > there is no way to recover this. I think this zone will not be selected for
> > cursegs for further data write in the zone until the zone get discarded. No
> > need to fix write pointer position to avoid unaligned write errors. I wonder
> 
> Yes,
> 
> > if fsck or kernel should detect and report this case, because users still use
> > the f2fs partition without any action. May I ask your opinion?
> 
> If we can detect that, I think it should be reported.

I see, thanks for the comment.

> 
> > 
> >>
> >> 2) For opened zones (cursegs point to)
> >> 2.a) .write_pointer == .next_blkaddr, no need to handle
> >> 2.b) .write_pointer > .next_blkaddr, that should be the common state in sudden
> >> power-off case. It needs to reset .write_pointer to .next_blkaddr.

Sorry but let me amend. I have just noticed that the fix above is not possible.
We cannot set .write_pointer at .next_blkaddr, because write pointers cannot be
reset to desired position. It only can be reset to zero (at the zone start).
Instead of resetting .write_pointer, how about to fix by allocating a new zone to
the curseg for 2.b) in same manner as 2.c)?

> > 
> > Agreed. To be more precise, if fsync data is recorded after .next_blkaddr, we
> > need to allocate a new zone to the curseg after fsync data recovery.
> > 
> >> 2.c) .write_pointer < .next_blkaddr, should be a bug of f2fs, we should have
> >> lost datas in the range of [.write_pointer, .next_blkaddr] at least, if we're
> >> sure about that there is no valid data existed out side of .write_pointer? can we?
> > 
> > As you note, this is a bug and we lost the data in the range of [.write_pointer,
> > .next_blkaddr]. We should leave the write pointer as is to keep the data in the
> > zone, and allocate a new zone to the curseg to avoid unaligned write errors.
> 
> Okay.
> 
> > 
> >>
> >> 3) f2fs_allocate_new_segments() may reset data type cursegs' position, I'd like
> >> to know what's the value of -z option we configured, if we configured as one
> >> zone contains one section, no matter kernel triggers recovery or not (via set
> >> disable_roll_forward option), the write pointer consistency should has been kept
> >> in all cases. Otherwise (one zone contains multiple sections), if kernel doesn't
> >> support .write_pointer recovery, f2fs_allocate_new_segments() can make
> >> .next_blkaddr larger than .write_pointer in the same zone.
> > 
> > "Single section per zone" is a key design of the f2fs zoned block device
> > support. For zoned block devices, mkfs.f2fs requires -m option. When -m
> > options is set, mkfs sets '1' to c.secs_per_zone regardless of the -z option
> > (refer f2fs_get_device_info() in lib/libf2fs.c). With this prerequisite,
> 
> Oh, correct, thanks for noticing that. I did miss that we only use -m option for
> zoned block device.
> 
> > I think new zones are allocated for DATA cursegs in
> > f2fs_allocate_new_segment() and its sub-function call path.
> > 
> >>
> >> Let me know, if I'm missing sth.
> > 
> > Assuming fsync data is the only one exceptional data which is read beyond the
> > cursegs' next_blkoff, this discussion looks comprehensive for me:
> > 
> >  - with and without kernel write pointer fix
> >  - with and without fsync data recovery
> >  - open/non-open zones (zones cursegs point to, zones cursegs do not point to)
> > 
> > All fs meta data in front of main segments are stored in conventional zones
> > which do not have write pointers (refer f2fs_prepare_super_block() in mkfs/
> > f2fs_format.c). Just we need to care about write pointer consistency for the
> > main data area.
> 
> Agreed,

--
Best Regards,
Shin'ichiro Kawasaki

_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments
  2019-09-12  8:16                 ` Shinichiro Kawasaki
@ 2019-09-16  1:37                   ` Chao Yu
  2019-09-18  3:07                     ` Shinichiro Kawasaki
  0 siblings, 1 reply; 18+ messages in thread
From: Chao Yu @ 2019-09-16  1:37 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: Jaegeuk Kim, Damien Le Moal, linux-f2fs-devel

On 2019/9/12 16:16, Shinichiro Kawasaki wrote:
> On Sep 10, 2019 / 17:12, Chao Yu wrote:
>> On 2019/9/10 16:10, Shinichiro Kawasaki wrote:
>>> On Sep 09, 2019 / 15:14, Chao Yu wrote:
>>>> On 2019/9/6 16:31, Shinichiro Kawasaki wrote:
>>>>> On Sep 05, 2019 / 17:58, Chao Yu wrote:
>>>>>> Hi Shinichiro,
>>>>>>
>>>>>> Sorry for the delay.
>>>>>>
>>>>>> On 2019/9/3 16:37, Shinichiro Kawasaki wrote:
>>>>>>> On Sep 02, 2019 / 15:02, Chao Yu wrote:
>>>>>>>> On 2019/8/30 18:19, Shin'ichiro Kawasaki wrote:
>>>>>>>>> On sudden f2fs shutdown, zoned block device status and f2fs current
>>>>>>>>> segment positions in meta data can be inconsistent. When f2fs shutdown
>>>>>>>>> happens before write operations completes, write pointers of zoned block
>>>>>>>>> devices can go further but f2fs meta data keeps current segments at
>>>>>>>>> positions before the write operations. After remounting the f2fs, the
>>>>>>>>> inconsistency causes write operations not at write pointers and
>>>>>>>>> "Unaligned write command" error is reported. This error was observed when
>>>>>>>>> xfstests test case generic/388 was run with f2fs on a zoned block device.
>>>>>>>>>
>>>>>>>>> To avoid the error, have f2fs.fsck check consistency between each current
>>>>>>>>> segment's position and the write pointer of the zone the current segment
>>>>>>>>> points to. If the write pointer goes advance from the current segment,
>>>>>>>>> fix the current segment position setting at same as the write pointer
>>>>>>>>> position. If the write pointer goes to the zone end, find a new zone and
>>>>>>>>> set the current segment position at the new zone start. In case the write
>>>>>>>>> pointer is behind the current segment, write zero data at the write
>>>>>>>>> pointer position to make write pointer position at same as the current
>>>>>>>>> segment.
>>>>>>>>>
>>>>>>>>> When inconsistencies are found, turn on c.bug_on flag in fsck_verify() to
>>>>>>>>> ask users to fix them or not. When inconsistencies get fixed, turn on
>>>>>>>>> 'force' flag in fsck_verify() to enforce fixes in following checks. This
>>>>>>>>> position fix is done at the beginning of do_fsck() function so that other
>>>>>>>>> checks reflect the current segment modification.
>>>>>>>>>
>>>>>>>>> Also add GET_SEC_FROM_SEG and GET_SEG_FROM_SEC macros in fsck/fsck.h to
>>>>>>>>> simplify the code.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>>>>>>>>> ---
>>>>>>>>>  fsck/f2fs.h |   5 ++
>>>>>>>>>  fsck/fsck.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>  fsck/fsck.h |   3 +
>>>>>>>>>  fsck/main.c |   2 +
>>>>>>>>>  4 files changed, 208 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/fsck/f2fs.h b/fsck/f2fs.h
>>>>>>>>> index 4dc6698..2c1c2b3 100644
>>>>>>>>> --- a/fsck/f2fs.h
>>>>>>>>> +++ b/fsck/f2fs.h
>>>>>>>>> @@ -337,6 +337,11 @@ static inline block_t __end_block_addr(struct f2fs_sb_info *sbi)
>>>>>>>>>  #define GET_BLKOFF_FROM_SEG0(sbi, blk_addr)				\
>>>>>>>>>  	(GET_SEGOFF_FROM_SEG0(sbi, blk_addr) & (sbi->blocks_per_seg - 1))
>>>>>>>>>  
>>>>>>>>> +#define GET_SEC_FROM_SEG(sbi, segno)					\
>>>>>>>>> +	((segno) / (sbi)->segs_per_sec)
>>>>>>>>> +#define GET_SEG_FROM_SEC(sbi, secno)					\
>>>>>>>>> +	((secno) * (sbi)->segs_per_sec)
>>>>>>>>> +
>>>>>>>>>  #define FREE_I_START_SEGNO(sbi)						\
>>>>>>>>>  	GET_SEGNO_FROM_SEG0(sbi, SM_I(sbi)->main_blkaddr)
>>>>>>>>>  #define GET_R2L_SEGNO(sbi, segno)	(segno + FREE_I_START_SEGNO(sbi))
>>>>>>>>> diff --git a/fsck/fsck.c b/fsck/fsck.c
>>>>>>>>> index 8953ca1..a0f6849 100644
>>>>>>>>> --- a/fsck/fsck.c
>>>>>>>>> +++ b/fsck/fsck.c
>>>>>>>>> @@ -2574,6 +2574,190 @@ out:
>>>>>>>>>  	return cnt;
>>>>>>>>>  }
>>>>>>>>>  
>>>>>>>>> +/*
>>>>>>>>> + * Search a free section in main area. Start search from the section specified
>>>>>>>>> + * with segno argument toward main area end. Return first segment of the found
>>>>>>>>> + * section in segno argument.
>>>>>>>>> + */
>>>>>>>>> +static int find_next_free_section(struct f2fs_sb_info *sbi,
>>>>>>>>> +				  unsigned int *segno)
>>>>>>>>> +{
>>>>>>>>> +	unsigned int i, sec, section_valid_blocks;
>>>>>>>>> +	unsigned int end_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr)
>>>>>>>>> +		+ SM_I(sbi)->main_segments;
>>>>>>>>> +	unsigned int end_sec = GET_SEC_FROM_SEG(sbi, end_segno);
>>>>>>>>> +	struct seg_entry *se;
>>>>>>>>> +	struct curseg_info *cs;
>>>>>>>>> +
>>>>>>>>> +	for (sec = GET_SEC_FROM_SEG(sbi, *segno); sec < end_sec; sec++) {
>>>>>>>>> +		/* find a section without valid blocks */
>>>>>>>>> +		section_valid_blocks = 0;
>>>>>>>>> +		for (i = 0; i < sbi->segs_per_sec; i++) {
>>>>>>>>> +			se = get_seg_entry(sbi, GET_SEG_FROM_SEC(sbi, sec) + i);
>>>>>>>>> +			section_valid_blocks += se->valid_blocks;
>>>>>>>>
>>>>>>>> If we want to find a 'real' free section, we'd better to use
>>>>>>>> se->ckpt_valid_blocks (wrapped with get_seg_vblocks()) in where we has recorded
>>>>>>>> fsynced data count.
>>>>>>>
>>>>>>> Thanks. When I create next patch series, I will use get_seg_vblocks().
>>>>>>> I will rebase to dev-test branch in which get_seg_vblocks() is available.
>>>>>>>
>>>>>>>>
>>>>>>>>> +		}
>>>>>>>>> +		if (section_valid_blocks)
>>>>>>>>> +			continue;
>>>>>>>>> +
>>>>>>>>> +		/* check the cursegs do not use the section */
>>>>>>>>> +		for (i = 0; i < NO_CHECK_TYPE; i++) {
>>>>>>>>> +			cs = &SM_I(sbi)->curseg_array[i];
>>>>>>>>> +			if (GET_SEC_FROM_SEG(sbi, cs->segno) == sec)
>>>>>>>>> +				break;
>>>>>>>>> +		}
>>>>>>>>> +		if (i >= NR_CURSEG_TYPE) {
>>>>>>>>> +			*segno = GET_SEG_FROM_SEC(sbi, sec);
>>>>>>>>> +			return 0;
>>>>>>>>> +		}
>>>>>>>>> +	}
>>>>>>>>> +
>>>>>>>>> +	return -1;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +struct write_pointer_check_data {
>>>>>>>>> +	struct f2fs_sb_info *sbi;
>>>>>>>>> +	struct device_info *dev;
>>>>>>>>> +};
>>>>>>>>> +
>>>>>>>>> +static int fsck_chk_write_pointer(int i, struct blk_zone *blkz, void *opaque)
>>>>>>>>> +{
>>>>>>>>> +	struct write_pointer_check_data *wpd = opaque;
>>>>>>>>> +	struct f2fs_sb_info *sbi = wpd->sbi;
>>>>>>>>> +	struct device_info *dev = wpd->dev;
>>>>>>>>> +	struct f2fs_fsck *fsck = F2FS_FSCK(sbi);
>>>>>>>>> +	block_t zone_block, wp_block, wp_blkoff, cs_block, b;
>>>>>>>>> +	unsigned int zone_segno, wp_segno, new_segno;
>>>>>>>>> +	struct seg_entry *se;
>>>>>>>>> +	struct curseg_info *cs;
>>>>>>>>> +	int cs_index, ret;
>>>>>>>>> +	int log_sectors_per_block = sbi->log_blocksize - SECTOR_SHIFT;
>>>>>>>>> +	unsigned int segs_per_zone = sbi->segs_per_sec * sbi->secs_per_zone;
>>>>>>>>> +	void *zero_blk;
>>>>>>>>> +
>>>>>>>>> +	if (blk_zone_conv(blkz))
>>>>>>>>> +		return 0;
>>>>>>>>> +
>>>>>>>>> +	zone_block = dev->start_blkaddr
>>>>>>>>> +		+ (blk_zone_sector(blkz) >> log_sectors_per_block);
>>>>>>>>> +	zone_segno = GET_SEGNO(sbi, zone_block);
>>>>>>>>> +	wp_block = dev->start_blkaddr
>>>>>>>>> +		+ (blk_zone_wp_sector(blkz) >> log_sectors_per_block);
>>>>>>>>> +	wp_segno = GET_SEGNO(sbi, wp_block);
>>>>>>>>> +	wp_blkoff = wp_block - START_BLOCK(sbi, wp_segno);
>>>>>>>>> +
>>>>>>>>> +	/* find the curseg which points to the zone */
>>>>>>>>> +	for (cs_index = 0; cs_index < NO_CHECK_TYPE; cs_index++) {
>>>>>>>>> +		cs = &SM_I(sbi)->curseg_array[cs_index];
>>>>>>>>> +		if (zone_segno <= cs->segno &&
>>>>>>>>> +		    cs->segno < zone_segno + segs_per_zone)
>>>>>>>>> +			break;
>>>>>>>>> +	}
>>>>>>>>> +
>>>>>>>>> +	if (cs_index >= NR_CURSEG_TYPE)
>>>>>>>>> +		return 0;
>>>>>>>>> +
>>>>>>>>> +	/* check write pointer consistency with the curseg in the zone */
>>>>>>>>> +	cs_block = START_BLOCK(sbi, cs->segno) + cs->next_blkoff;
>>>>>>>>> +	if (wp_block == cs_block)
>>>>>>>>> +		return 0;
>>>>>>>>> +
>>>>>>>>> +	if (!c.fix_on) {
>>>>>>>>> +		MSG(0, "Inconsistent write pointer: "
>>>>>>>>> +		    "curseg %d[0x%x,0x%x] wp[0x%x,0x%x]\n",
>>>>>>>>> +		    cs_index, cs->segno, cs->next_blkoff, wp_segno, wp_blkoff);
>>>>>>>>> +		fsck->chk.wp_inconsistent_zones++;
>>>>>>>>> +		return 0;
>>>>>>>>> +	}
>>>>>>>>> +
>>>>>>>>> +	/*
>>>>>>>>> +	 * If the curseg is in advance from the write pointer, write zero to
>>>>>>>>> +	 * move the write pointer forward to the same position as the curseg.
>>>>>>>>> +	 */
>>>>>>>>> +	if (wp_block < cs_block) {
>>>>>>>>> +		ret = 0;
>>>>>>>>> +		zero_blk = calloc(BLOCK_SZ, 1);
>>>>>>>>> +		if (!zero_blk)
>>>>>>>>> +			return -EINVAL;
>>>>>>>>> +
>>>>>>>>> +		FIX_MSG("Advance write pointer to match with curseg %d: "
>>>>>>>>> +			"[0x%x,0x%x]->[0x%x,0x%x]",
>>>>>>>>> +			cs_index, wp_segno, wp_blkoff,
>>>>>>>>> +			cs->segno, cs->next_blkoff);
>>>>>>>>> +		for (b = wp_block; b < cs_block && !ret; b++)
>>>>>>>>> +			ret = dev_write_block(zero_blk, b);
>>>>>>>>> +
>>>>>>>>> +		fsck->chk.wp_fixed_zones++;
>>>>>>>>> +		free(zero_blk);
>>>>>>>>> +		return ret;
>>>>>>>>> +	}
>>>>>>>>> +
>>>>>>>>> +	if (wp_segno == zone_segno + segs_per_zone) {
>>>>>>>>> +		/*
>>>>>>>>> +		 * If the write pointer is in advance from the curseg and at
>>>>>>>>> +		 * the zone end (section end), search a new free zone (section)
>>>>>>>>> +		 * between the curseg and main area end.
>>>>>>>>> +		 */
>>>>>>>>> +		new_segno = wp_segno;
>>>>>>>>> +		ret = find_next_free_section(sbi, &new_segno);
>>>>>>>>> +		if (ret) {
>>>>>>>>> +			/* search again from main area start */
>>>>>>>>> +			new_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr);
>>>>>>>>> +			ret = find_next_free_section(sbi, &new_segno);
>>>>>>>>> +		}
>>>>>>>>
>>>>>>>> If curseg's type is warm node, relocating curseg would ruin warm node chain,
>>>>>>>> result in losing fsynced data for ever as well.
>>>>>>>>
>>>>>>>> So I guess we should migrate all dnode blocks chained with cs->next_blkoff in
>>>>>>>> current section.
>>>>>>>>
>>>>>>>>> +		if (ret) {
>>>>>>>>> +			MSG(0, "Free section not found\n");
>>>>>>>>> +			return ret;
>>>>>>>>> +		}
>>>>>>>>> +		FIX_MSG("New section for curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
>>>>>>>>> +			cs_index, cs->segno, cs->next_blkoff, new_segno, 0);
>>>>>>>>> +		cs->segno = new_segno;
>>>>>>>>> +		cs->next_blkoff = 0;
>>>>>>>>> +	} else {
>>>>>>>>> +		/*
>>>>>>>>> +		 * If the write pointer is in advance from the curseg within
>>>>>>>>> +		 * the zone, modify the curseg position to be same as the
>>>>>>>>> +		 * write pointer.
>>>>>>>>> +		 */
>>>>>>>>> +		ASSERT(wp_segno < zone_segno + segs_per_zone);
>>>>>>>>> +		FIX_MSG("Advance curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
>>>>>>>>> +			cs_index, cs->segno, cs->next_blkoff,
>>>>>>>>> +			wp_segno, wp_blkoff);
>>>>>>>>> +		cs->segno = wp_segno;
>>>>>>>>> +		cs->next_blkoff = wp_blkoff;
>>>>>>>>
>>>>>>>> The same data lose problem here, I guess we'd better handle it with the same way
>>>>>>>> as above.
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>>
>>>>>>> I created f2fs status with fsync data and warm node chain, and confirmed the v4
>>>>>>> implementation ruins the dnode blocks chain. Hmm.
>>>>>>>
>>>>>>> My understanding is that when f2fs kernel module recovers the fsync data, it
>>>>>>> sets the warm node curseg position at the start of the fsync data, and the fsync
>>>>>>> data will be overwritten as new nodes created. Is this understanding correct?
>>>>>>
>>>>>> Sorry, I'm not sure I've understood you correctly.
>>>>>
>>>>> Apology is mine, my question was not clear enough.
>>>>> And thanks for the explanation below. It helps me to understand better.
>>>>>
>>>>>> Current recovery flow is:
>>>>>> 1)   find all valid fsynced dnode in warm node chain
>>>>>> 2.a) recover fsynced dnode in memory, and set it dirty
>>>>>> 2.b) recover directory entry in memory, and set it dirty
>>>>>> 2.c) during regular's dnode recovery, physical blocks indexed by recovered dnode
>>>>>> will be revalidated
>>>>>> Note: we didn't move any cursegs before 3)
>>>>>> 3)   relocate all cursegs to new segments
>>>>>> 4)   trigger checkpoint to persist all recovered data(fs' meta, file's meta/data)
>>>>>
>>>>> Question, does the step 3) correspond to f2fs_allocate_new_segments(sbi) call
>>>>> at the end of recover_data()? The f2fs_allocate_new_segments() function
>>>>
>>>> Yeah, I meant that function.
>>>>
>>>>> relocates new segments only for DATA cursegs, and it keeps NODE cursegs same as
>>>>> before the fsync data recovery. Then I thought WARM NODE curseg would not change
>>>>> by recovery (and still keeps pointing to the fsync recovery data).
>>>>
>>>> Yes, that's correct. WARM NODE curseg won't change until step 4).
>>>
>>> Thanks. Following your idea "we can simply adjust to reset all invalid zone and
>>> allocate new zone for curseg before data/meta writes" for fix by kernel, I think
>>> kernel code change is required to allocate new zones for NODE cursegs also at
>>> step 3). WARM NODE curseg should be kept untouched by step 2 completion to refer
>>> fsynced dnodes at WARM NODE curseg's next_blkaddr. And at step 4, the fsyced
>>> dnodes recovered and set dirty will be written back with one of NODE cursegs
>>> (HOT NODE curseg?). At that time, we need to make sure the NODE curseg points to
>>
>> Directory's dnode goes to hot node area, other file's dnode goes to warm node
>> area, the left node goes to cold node area.
>>
>>> the position consistent with its zone's write pointer.
>>
>> Yes, before step 4), we should keep f2fs and zoned block device's write pointer
>> being consistent.
> 
> Ok, thanks.
> 
>>
>>>
>>>>>
>>>>>>>
>>>>>>> If this is the case, I think write pointer inconsistency will happen even if
>>>>>>> fsck does "migrate all dnode blocks" (I assume that the warm node curseg's next
>>>>>>
>>>>>> Actually, I notice that scheme's problem is: we've changed zone's write pointer
>>>>>> during dnode blocks migration, however if kernel drops recovery, we will still
>>>>>> face inconsistent status in between .next_blkaddr and .write_pointer, once we
>>>>>> start to write data from .next_blkaddr. So in fsck, after migration, we need to
>>>>>> reset .write_pointer to .next_blkaddr.
>>>>>>
>>>>>> I guess we need to consider four cases:
>>>>>>
>>>>>> o: support .write_pointer recovery
>>>>>> x: not support .write_pointer recovery
>>>>>>
>>>>>> 1) kernel: o, fsck: x, trigger recovery in kernel
>>>>>> 2) kernel: o, fsck: x, not trigger recovery in kernel
>>>>>> 3) kernel: x, fsck: o, trigger recovery in kernel
>>>>>> 4) kernel: x, fsck: o, not trigger recovery in kernel
>>>>>>
>>>>>> For 1) and 2), we can simply adjust to reset all invalid zone and allocate new
>>>>>> zone for curseg before data/meta writes.
>>>>>
>>>>> Thanks for the clarification. This approach for case 1) and 2) is simple. Let me
>>>>> try to create a patch for it.
>>>>>
>>>>>>
>>>>>> For 3) and 4), I haven't figured out a way handling all cases perfectly.
>>>>>
>>>>> For 3), I suppose fsck cannot fix write pointer inconsistency without fsync data
>>>>> loss, since recovery is judged and done by kernel. The write pointer consistency
>>>>> fix after recovery can be done only by kernel.
>>>>>
>>>>> It is not a good one but one idea is to confirm fsck users to enforce fsync data
>>>>> loss or not. This could be better than nothing.
>>>>>
>>>>> For 4), my understanding is that fsync data is not left in the file system. I
>>>>> think fsck can check fsync data existence and fix write pointer consistency, as
>>>>> my patch series shows.
>>>>
>>>> Yeah.
>>>>
>>>> Let's think about that whether there is a way to cover all cases.
>>>>
>>>> 1) For non-opened zones, we need to adjust all such zones' write pointer to
>>>> zero. I assume that once write pointer is zero, we still can access valid block
>>>> in zone. (recovery may need to revalidate blocks in free segments)
>>>
>>> When write pointer is reset to zero, all data in the zone gets lost. When we
>>> read data in the zone beyond the write pointer, just zero data is read. When any
>>> valid block or fsync data is left in a non-opened zone, I think the zone's write
>>> pointer should be left as is. Otherwise, if the zone do not have valid block and
>>> fsync data, the zone should be reset to avoid unaligned write error.
>>
>> Okay, if data beyond write pointer is invalid, we should keep write pointer as
>> it is if there are fsynced data in that zone.
>>
>>>
>>> One additional check I can think of is to check the last valid block in the zone
>>> and write pointer position of the zone. If .write_pointer >= .last_valid_block,
>>> , it is ok. If .write_pointer < .last_valid_block, this is a bug of f2fs. In
>>
>> Sounds reasonable, how can we find last valid block, as you said, content of
>> block beyond write pointer is all zero... or you mean curseg's next_blkaddr?
>> like the condition 2.c) described?
> 
> I think we can get each zone's last valid block referring each segment's valid
> block bitmap in SIT. In other words, this is a consistency check between write
> pointer and SIT. Is this feasible approach?

Good point.

I guess
- we should do such sanity check with a image which has consistent metadata (SIT
should not be broken)
- need to consider fsynced block in SIT

> 
>>
>>> this case, the data in the valid blocks beyond write pointer is just lost, and
>>> there is no way to recover this. I think this zone will not be selected for
>>> cursegs for further data write in the zone until the zone get discarded. No
>>> need to fix write pointer position to avoid unaligned write errors. I wonder
>>
>> Yes,
>>
>>> if fsck or kernel should detect and report this case, because users still use
>>> the f2fs partition without any action. May I ask your opinion?
>>
>> If we can detect that, I think it should be reported.
> 
> I see, thanks for the comment.
> 
>>
>>>
>>>>
>>>> 2) For opened zones (cursegs point to)
>>>> 2.a) .write_pointer == .next_blkaddr, no need to handle
>>>> 2.b) .write_pointer > .next_blkaddr, that should be the common state in sudden
>>>> power-off case. It needs to reset .write_pointer to .next_blkaddr.
> 
> Sorry but let me amend. I have just noticed that the fix above is not possible.
> We cannot set .write_pointer at .next_blkaddr, because write pointers cannot be

Alright..

> reset to desired position. It only can be reset to zero (at the zone start).
> Instead of resetting .write_pointer, how about to fix by allocating a new zone to
> the curseg for 2.b) in same manner as 2.c)?

Yeah, it's okay to me.

Thanks,

> 
>>>
>>> Agreed. To be more precise, if fsync data is recorded after .next_blkaddr, we
>>> need to allocate a new zone to the curseg after fsync data recovery.
>>>
>>>> 2.c) .write_pointer < .next_blkaddr, should be a bug of f2fs, we should have
>>>> lost datas in the range of [.write_pointer, .next_blkaddr] at least, if we're
>>>> sure about that there is no valid data existed out side of .write_pointer? can we?
>>>
>>> As you note, this is a bug and we lost the data in the range of [.write_pointer,
>>> .next_blkaddr]. We should leave the write pointer as is to keep the data in the
>>> zone, and allocate a new zone to the curseg to avoid unaligned write errors.
>>
>> Okay.
>>
>>>
>>>>
>>>> 3) f2fs_allocate_new_segments() may reset data type cursegs' position, I'd like
>>>> to know what's the value of -z option we configured, if we configured as one
>>>> zone contains one section, no matter kernel triggers recovery or not (via set
>>>> disable_roll_forward option), the write pointer consistency should has been kept
>>>> in all cases. Otherwise (one zone contains multiple sections), if kernel doesn't
>>>> support .write_pointer recovery, f2fs_allocate_new_segments() can make
>>>> .next_blkaddr larger than .write_pointer in the same zone.
>>>
>>> "Single section per zone" is a key design of the f2fs zoned block device
>>> support. For zoned block devices, mkfs.f2fs requires -m option. When -m
>>> options is set, mkfs sets '1' to c.secs_per_zone regardless of the -z option
>>> (refer f2fs_get_device_info() in lib/libf2fs.c). With this prerequisite,
>>
>> Oh, correct, thanks for noticing that. I did miss that we only use -m option for
>> zoned block device.
>>
>>> I think new zones are allocated for DATA cursegs in
>>> f2fs_allocate_new_segment() and its sub-function call path.
>>>
>>>>
>>>> Let me know, if I'm missing sth.
>>>
>>> Assuming fsync data is the only one exceptional data which is read beyond the
>>> cursegs' next_blkoff, this discussion looks comprehensive for me:
>>>
>>>  - with and without kernel write pointer fix
>>>  - with and without fsync data recovery
>>>  - open/non-open zones (zones cursegs point to, zones cursegs do not point to)
>>>
>>> All fs meta data in front of main segments are stored in conventional zones
>>> which do not have write pointers (refer f2fs_prepare_super_block() in mkfs/
>>> f2fs_format.c). Just we need to care about write pointer consistency for the
>>> main data area.
>>
>> Agreed,
> 
> --
> Best Regards,
> Shin'ichiro Kawasaki.
> 


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments
  2019-09-16  1:37                   ` Chao Yu
@ 2019-09-18  3:07                     ` Shinichiro Kawasaki
  2019-09-21  9:42                       ` Chao Yu
  0 siblings, 1 reply; 18+ messages in thread
From: Shinichiro Kawasaki @ 2019-09-18  3:07 UTC (permalink / raw)
  To: Chao Yu; +Cc: Jaegeuk Kim, Damien Le Moal, linux-f2fs-devel

On Sep 16, 2019 / 09:37, Chao Yu wrote:
> On 2019/9/12 16:16, Shinichiro Kawasaki wrote:
> > On Sep 10, 2019 / 17:12, Chao Yu wrote:
> >> On 2019/9/10 16:10, Shinichiro Kawasaki wrote:
> >>> On Sep 09, 2019 / 15:14, Chao Yu wrote:
> >>>> On 2019/9/6 16:31, Shinichiro Kawasaki wrote:
> >>>>> On Sep 05, 2019 / 17:58, Chao Yu wrote:
> >>>>>> Hi Shinichiro,
> >>>>>>
> >>>>>> Sorry for the delay.
> >>>>>>
> >>>>>> On 2019/9/3 16:37, Shinichiro Kawasaki wrote:
> >>>>>>> On Sep 02, 2019 / 15:02, Chao Yu wrote:
> >>>>>>>> On 2019/8/30 18:19, Shin'ichiro Kawasaki wrote:
> >>>>>>>>> On sudden f2fs shutdown, zoned block device status and f2fs current
> >>>>>>>>> segment positions in meta data can be inconsistent. When f2fs shutdown
> >>>>>>>>> happens before write operations completes, write pointers of zoned block
> >>>>>>>>> devices can go further but f2fs meta data keeps current segments at
> >>>>>>>>> positions before the write operations. After remounting the f2fs, the
> >>>>>>>>> inconsistency causes write operations not at write pointers and
> >>>>>>>>> "Unaligned write command" error is reported. This error was observed when
> >>>>>>>>> xfstests test case generic/388 was run with f2fs on a zoned block device.
> >>>>>>>>>
> >>>>>>>>> To avoid the error, have f2fs.fsck check consistency between each current
> >>>>>>>>> segment's position and the write pointer of the zone the current segment
> >>>>>>>>> points to. If the write pointer goes advance from the current segment,
> >>>>>>>>> fix the current segment position setting at same as the write pointer
> >>>>>>>>> position. If the write pointer goes to the zone end, find a new zone and
> >>>>>>>>> set the current segment position at the new zone start. In case the write
> >>>>>>>>> pointer is behind the current segment, write zero data at the write
> >>>>>>>>> pointer position to make write pointer position at same as the current
> >>>>>>>>> segment.
> >>>>>>>>>
> >>>>>>>>> When inconsistencies are found, turn on c.bug_on flag in fsck_verify() to
> >>>>>>>>> ask users to fix them or not. When inconsistencies get fixed, turn on
> >>>>>>>>> 'force' flag in fsck_verify() to enforce fixes in following checks. This
> >>>>>>>>> position fix is done at the beginning of do_fsck() function so that other
> >>>>>>>>> checks reflect the current segment modification.
> >>>>>>>>>
> >>>>>>>>> Also add GET_SEC_FROM_SEG and GET_SEG_FROM_SEC macros in fsck/fsck.h to
> >>>>>>>>> simplify the code.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> >>>>>>>>> ---
> >>>>>>>>>  fsck/f2fs.h |   5 ++
> >>>>>>>>>  fsck/fsck.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>>  fsck/fsck.h |   3 +
> >>>>>>>>>  fsck/main.c |   2 +
> >>>>>>>>>  4 files changed, 208 insertions(+)
> >>>>>>>>>
> >>>>>>>>> diff --git a/fsck/f2fs.h b/fsck/f2fs.h
> >>>>>>>>> index 4dc6698..2c1c2b3 100644
> >>>>>>>>> --- a/fsck/f2fs.h
> >>>>>>>>> +++ b/fsck/f2fs.h
> >>>>>>>>> @@ -337,6 +337,11 @@ static inline block_t __end_block_addr(struct f2fs_sb_info *sbi)
> >>>>>>>>>  #define GET_BLKOFF_FROM_SEG0(sbi, blk_addr)				\
> >>>>>>>>>  	(GET_SEGOFF_FROM_SEG0(sbi, blk_addr) & (sbi->blocks_per_seg - 1))
> >>>>>>>>>  
> >>>>>>>>> +#define GET_SEC_FROM_SEG(sbi, segno)					\
> >>>>>>>>> +	((segno) / (sbi)->segs_per_sec)
> >>>>>>>>> +#define GET_SEG_FROM_SEC(sbi, secno)					\
> >>>>>>>>> +	((secno) * (sbi)->segs_per_sec)
> >>>>>>>>> +
> >>>>>>>>>  #define FREE_I_START_SEGNO(sbi)						\
> >>>>>>>>>  	GET_SEGNO_FROM_SEG0(sbi, SM_I(sbi)->main_blkaddr)
> >>>>>>>>>  #define GET_R2L_SEGNO(sbi, segno)	(segno + FREE_I_START_SEGNO(sbi))
> >>>>>>>>> diff --git a/fsck/fsck.c b/fsck/fsck.c
> >>>>>>>>> index 8953ca1..a0f6849 100644
> >>>>>>>>> --- a/fsck/fsck.c
> >>>>>>>>> +++ b/fsck/fsck.c
> >>>>>>>>> @@ -2574,6 +2574,190 @@ out:
> >>>>>>>>>  	return cnt;
> >>>>>>>>>  }
> >>>>>>>>>  
> >>>>>>>>> +/*
> >>>>>>>>> + * Search a free section in main area. Start search from the section specified
> >>>>>>>>> + * with segno argument toward main area end. Return first segment of the found
> >>>>>>>>> + * section in segno argument.
> >>>>>>>>> + */
> >>>>>>>>> +static int find_next_free_section(struct f2fs_sb_info *sbi,
> >>>>>>>>> +				  unsigned int *segno)
> >>>>>>>>> +{
> >>>>>>>>> +	unsigned int i, sec, section_valid_blocks;
> >>>>>>>>> +	unsigned int end_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr)
> >>>>>>>>> +		+ SM_I(sbi)->main_segments;
> >>>>>>>>> +	unsigned int end_sec = GET_SEC_FROM_SEG(sbi, end_segno);
> >>>>>>>>> +	struct seg_entry *se;
> >>>>>>>>> +	struct curseg_info *cs;
> >>>>>>>>> +
> >>>>>>>>> +	for (sec = GET_SEC_FROM_SEG(sbi, *segno); sec < end_sec; sec++) {
> >>>>>>>>> +		/* find a section without valid blocks */
> >>>>>>>>> +		section_valid_blocks = 0;
> >>>>>>>>> +		for (i = 0; i < sbi->segs_per_sec; i++) {
> >>>>>>>>> +			se = get_seg_entry(sbi, GET_SEG_FROM_SEC(sbi, sec) + i);
> >>>>>>>>> +			section_valid_blocks += se->valid_blocks;
> >>>>>>>>
> >>>>>>>> If we want to find a 'real' free section, we'd better to use
> >>>>>>>> se->ckpt_valid_blocks (wrapped with get_seg_vblocks()) in where we has recorded
> >>>>>>>> fsynced data count.
> >>>>>>>
> >>>>>>> Thanks. When I create next patch series, I will use get_seg_vblocks().
> >>>>>>> I will rebase to dev-test branch in which get_seg_vblocks() is available.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>> +		}
> >>>>>>>>> +		if (section_valid_blocks)
> >>>>>>>>> +			continue;
> >>>>>>>>> +
> >>>>>>>>> +		/* check the cursegs do not use the section */
> >>>>>>>>> +		for (i = 0; i < NO_CHECK_TYPE; i++) {
> >>>>>>>>> +			cs = &SM_I(sbi)->curseg_array[i];
> >>>>>>>>> +			if (GET_SEC_FROM_SEG(sbi, cs->segno) == sec)
> >>>>>>>>> +				break;
> >>>>>>>>> +		}
> >>>>>>>>> +		if (i >= NR_CURSEG_TYPE) {
> >>>>>>>>> +			*segno = GET_SEG_FROM_SEC(sbi, sec);
> >>>>>>>>> +			return 0;
> >>>>>>>>> +		}
> >>>>>>>>> +	}
> >>>>>>>>> +
> >>>>>>>>> +	return -1;
> >>>>>>>>> +}
> >>>>>>>>> +
> >>>>>>>>> +struct write_pointer_check_data {
> >>>>>>>>> +	struct f2fs_sb_info *sbi;
> >>>>>>>>> +	struct device_info *dev;
> >>>>>>>>> +};
> >>>>>>>>> +
> >>>>>>>>> +static int fsck_chk_write_pointer(int i, struct blk_zone *blkz, void *opaque)
> >>>>>>>>> +{
> >>>>>>>>> +	struct write_pointer_check_data *wpd = opaque;
> >>>>>>>>> +	struct f2fs_sb_info *sbi = wpd->sbi;
> >>>>>>>>> +	struct device_info *dev = wpd->dev;
> >>>>>>>>> +	struct f2fs_fsck *fsck = F2FS_FSCK(sbi);
> >>>>>>>>> +	block_t zone_block, wp_block, wp_blkoff, cs_block, b;
> >>>>>>>>> +	unsigned int zone_segno, wp_segno, new_segno;
> >>>>>>>>> +	struct seg_entry *se;
> >>>>>>>>> +	struct curseg_info *cs;
> >>>>>>>>> +	int cs_index, ret;
> >>>>>>>>> +	int log_sectors_per_block = sbi->log_blocksize - SECTOR_SHIFT;
> >>>>>>>>> +	unsigned int segs_per_zone = sbi->segs_per_sec * sbi->secs_per_zone;
> >>>>>>>>> +	void *zero_blk;
> >>>>>>>>> +
> >>>>>>>>> +	if (blk_zone_conv(blkz))
> >>>>>>>>> +		return 0;
> >>>>>>>>> +
> >>>>>>>>> +	zone_block = dev->start_blkaddr
> >>>>>>>>> +		+ (blk_zone_sector(blkz) >> log_sectors_per_block);
> >>>>>>>>> +	zone_segno = GET_SEGNO(sbi, zone_block);
> >>>>>>>>> +	wp_block = dev->start_blkaddr
> >>>>>>>>> +		+ (blk_zone_wp_sector(blkz) >> log_sectors_per_block);
> >>>>>>>>> +	wp_segno = GET_SEGNO(sbi, wp_block);
> >>>>>>>>> +	wp_blkoff = wp_block - START_BLOCK(sbi, wp_segno);
> >>>>>>>>> +
> >>>>>>>>> +	/* find the curseg which points to the zone */
> >>>>>>>>> +	for (cs_index = 0; cs_index < NO_CHECK_TYPE; cs_index++) {
> >>>>>>>>> +		cs = &SM_I(sbi)->curseg_array[cs_index];
> >>>>>>>>> +		if (zone_segno <= cs->segno &&
> >>>>>>>>> +		    cs->segno < zone_segno + segs_per_zone)
> >>>>>>>>> +			break;
> >>>>>>>>> +	}
> >>>>>>>>> +
> >>>>>>>>> +	if (cs_index >= NR_CURSEG_TYPE)
> >>>>>>>>> +		return 0;
> >>>>>>>>> +
> >>>>>>>>> +	/* check write pointer consistency with the curseg in the zone */
> >>>>>>>>> +	cs_block = START_BLOCK(sbi, cs->segno) + cs->next_blkoff;
> >>>>>>>>> +	if (wp_block == cs_block)
> >>>>>>>>> +		return 0;
> >>>>>>>>> +
> >>>>>>>>> +	if (!c.fix_on) {
> >>>>>>>>> +		MSG(0, "Inconsistent write pointer: "
> >>>>>>>>> +		    "curseg %d[0x%x,0x%x] wp[0x%x,0x%x]\n",
> >>>>>>>>> +		    cs_index, cs->segno, cs->next_blkoff, wp_segno, wp_blkoff);
> >>>>>>>>> +		fsck->chk.wp_inconsistent_zones++;
> >>>>>>>>> +		return 0;
> >>>>>>>>> +	}
> >>>>>>>>> +
> >>>>>>>>> +	/*
> >>>>>>>>> +	 * If the curseg is in advance from the write pointer, write zero to
> >>>>>>>>> +	 * move the write pointer forward to the same position as the curseg.
> >>>>>>>>> +	 */
> >>>>>>>>> +	if (wp_block < cs_block) {
> >>>>>>>>> +		ret = 0;
> >>>>>>>>> +		zero_blk = calloc(BLOCK_SZ, 1);
> >>>>>>>>> +		if (!zero_blk)
> >>>>>>>>> +			return -EINVAL;
> >>>>>>>>> +
> >>>>>>>>> +		FIX_MSG("Advance write pointer to match with curseg %d: "
> >>>>>>>>> +			"[0x%x,0x%x]->[0x%x,0x%x]",
> >>>>>>>>> +			cs_index, wp_segno, wp_blkoff,
> >>>>>>>>> +			cs->segno, cs->next_blkoff);
> >>>>>>>>> +		for (b = wp_block; b < cs_block && !ret; b++)
> >>>>>>>>> +			ret = dev_write_block(zero_blk, b);
> >>>>>>>>> +
> >>>>>>>>> +		fsck->chk.wp_fixed_zones++;
> >>>>>>>>> +		free(zero_blk);
> >>>>>>>>> +		return ret;
> >>>>>>>>> +	}
> >>>>>>>>> +
> >>>>>>>>> +	if (wp_segno == zone_segno + segs_per_zone) {
> >>>>>>>>> +		/*
> >>>>>>>>> +		 * If the write pointer is in advance from the curseg and at
> >>>>>>>>> +		 * the zone end (section end), search a new free zone (section)
> >>>>>>>>> +		 * between the curseg and main area end.
> >>>>>>>>> +		 */
> >>>>>>>>> +		new_segno = wp_segno;
> >>>>>>>>> +		ret = find_next_free_section(sbi, &new_segno);
> >>>>>>>>> +		if (ret) {
> >>>>>>>>> +			/* search again from main area start */
> >>>>>>>>> +			new_segno = GET_SEGNO(sbi, SM_I(sbi)->main_blkaddr);
> >>>>>>>>> +			ret = find_next_free_section(sbi, &new_segno);
> >>>>>>>>> +		}
> >>>>>>>>
> >>>>>>>> If curseg's type is warm node, relocating curseg would ruin warm node chain,
> >>>>>>>> result in losing fsynced data for ever as well.
> >>>>>>>>
> >>>>>>>> So I guess we should migrate all dnode blocks chained with cs->next_blkoff in
> >>>>>>>> current section.
> >>>>>>>>
> >>>>>>>>> +		if (ret) {
> >>>>>>>>> +			MSG(0, "Free section not found\n");
> >>>>>>>>> +			return ret;
> >>>>>>>>> +		}
> >>>>>>>>> +		FIX_MSG("New section for curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
> >>>>>>>>> +			cs_index, cs->segno, cs->next_blkoff, new_segno, 0);
> >>>>>>>>> +		cs->segno = new_segno;
> >>>>>>>>> +		cs->next_blkoff = 0;
> >>>>>>>>> +	} else {
> >>>>>>>>> +		/*
> >>>>>>>>> +		 * If the write pointer is in advance from the curseg within
> >>>>>>>>> +		 * the zone, modify the curseg position to be same as the
> >>>>>>>>> +		 * write pointer.
> >>>>>>>>> +		 */
> >>>>>>>>> +		ASSERT(wp_segno < zone_segno + segs_per_zone);
> >>>>>>>>> +		FIX_MSG("Advance curseg %d: [0x%x,0x%x]->[0x%x,0x%x]",
> >>>>>>>>> +			cs_index, cs->segno, cs->next_blkoff,
> >>>>>>>>> +			wp_segno, wp_blkoff);
> >>>>>>>>> +		cs->segno = wp_segno;
> >>>>>>>>> +		cs->next_blkoff = wp_blkoff;
> >>>>>>>>
> >>>>>>>> The same data lose problem here, I guess we'd better handle it with the same way
> >>>>>>>> as above.
> >>>>>>>>
> >>>>>>>> Thoughts?
> >>>>>>>
> >>>>>>> I created f2fs status with fsync data and warm node chain, and confirmed the v4
> >>>>>>> implementation ruins the dnode blocks chain. Hmm.
> >>>>>>>
> >>>>>>> My understanding is that when f2fs kernel module recovers the fsync data, it
> >>>>>>> sets the warm node curseg position at the start of the fsync data, and the fsync
> >>>>>>> data will be overwritten as new nodes created. Is this understanding correct?
> >>>>>>
> >>>>>> Sorry, I'm not sure I've understood you correctly.
> >>>>>
> >>>>> Apology is mine, my question was not clear enough.
> >>>>> And thanks for the explanation below. It helps me to understand better.
> >>>>>
> >>>>>> Current recovery flow is:
> >>>>>> 1)   find all valid fsynced dnode in warm node chain
> >>>>>> 2.a) recover fsynced dnode in memory, and set it dirty
> >>>>>> 2.b) recover directory entry in memory, and set it dirty
> >>>>>> 2.c) during regular's dnode recovery, physical blocks indexed by recovered dnode
> >>>>>> will be revalidated
> >>>>>> Note: we didn't move any cursegs before 3)
> >>>>>> 3)   relocate all cursegs to new segments
> >>>>>> 4)   trigger checkpoint to persist all recovered data(fs' meta, file's meta/data)
> >>>>>
> >>>>> Question, does the step 3) correspond to f2fs_allocate_new_segments(sbi) call
> >>>>> at the end of recover_data()? The f2fs_allocate_new_segments() function
> >>>>
> >>>> Yeah, I meant that function.
> >>>>
> >>>>> relocates new segments only for DATA cursegs, and it keeps NODE cursegs same as
> >>>>> before the fsync data recovery. Then I thought WARM NODE curseg would not change
> >>>>> by recovery (and still keeps pointing to the fsync recovery data).
> >>>>
> >>>> Yes, that's correct. WARM NODE curseg won't change until step 4).
> >>>
> >>> Thanks. Following your idea "we can simply adjust to reset all invalid zone and
> >>> allocate new zone for curseg before data/meta writes" for fix by kernel, I think
> >>> kernel code change is required to allocate new zones for NODE cursegs also at
> >>> step 3). WARM NODE curseg should be kept untouched by step 2 completion to refer
> >>> fsynced dnodes at WARM NODE curseg's next_blkaddr. And at step 4, the fsyced
> >>> dnodes recovered and set dirty will be written back with one of NODE cursegs
> >>> (HOT NODE curseg?). At that time, we need to make sure the NODE curseg points to
> >>
> >> Directory's dnode goes to hot node area, other file's dnode goes to warm node
> >> area, the left node goes to cold node area.
> >>
> >>> the position consistent with its zone's write pointer.
> >>
> >> Yes, before step 4), we should keep f2fs and zoned block device's write pointer
> >> being consistent.
> > 
> > Ok, thanks.
> > 
> >>
> >>>
> >>>>>
> >>>>>>>
> >>>>>>> If this is the case, I think write pointer inconsistency will happen even if
> >>>>>>> fsck does "migrate all dnode blocks" (I assume that the warm node curseg's next
> >>>>>>
> >>>>>> Actually, I notice that scheme's problem is: we've changed zone's write pointer
> >>>>>> during dnode blocks migration, however if kernel drops recovery, we will still
> >>>>>> face inconsistent status in between .next_blkaddr and .write_pointer, once we
> >>>>>> start to write data from .next_blkaddr. So in fsck, after migration, we need to
> >>>>>> reset .write_pointer to .next_blkaddr.
> >>>>>>
> >>>>>> I guess we need to consider four cases:
> >>>>>>
> >>>>>> o: support .write_pointer recovery
> >>>>>> x: not support .write_pointer recovery
> >>>>>>
> >>>>>> 1) kernel: o, fsck: x, trigger recovery in kernel
> >>>>>> 2) kernel: o, fsck: x, not trigger recovery in kernel
> >>>>>> 3) kernel: x, fsck: o, trigger recovery in kernel
> >>>>>> 4) kernel: x, fsck: o, not trigger recovery in kernel
> >>>>>>
> >>>>>> For 1) and 2), we can simply adjust to reset all invalid zone and allocate new
> >>>>>> zone for curseg before data/meta writes.
> >>>>>
> >>>>> Thanks for the clarification. This approach for case 1) and 2) is simple. Let me
> >>>>> try to create a patch for it.
> >>>>>
> >>>>>>
> >>>>>> For 3) and 4), I haven't figured out a way handling all cases perfectly.
> >>>>>
> >>>>> For 3), I suppose fsck cannot fix write pointer inconsistency without fsync data
> >>>>> loss, since recovery is judged and done by kernel. The write pointer consistency
> >>>>> fix after recovery can be done only by kernel.
> >>>>>
> >>>>> It is not a good one but one idea is to confirm fsck users to enforce fsync data
> >>>>> loss or not. This could be better than nothing.
> >>>>>
> >>>>> For 4), my understanding is that fsync data is not left in the file system. I
> >>>>> think fsck can check fsync data existence and fix write pointer consistency, as
> >>>>> my patch series shows.
> >>>>
> >>>> Yeah.
> >>>>
> >>>> Let's think about that whether there is a way to cover all cases.
> >>>>
> >>>> 1) For non-opened zones, we need to adjust all such zones' write pointer to
> >>>> zero. I assume that once write pointer is zero, we still can access valid block
> >>>> in zone. (recovery may need to revalidate blocks in free segments)
> >>>
> >>> When write pointer is reset to zero, all data in the zone gets lost. When we
> >>> read data in the zone beyond the write pointer, just zero data is read. When any
> >>> valid block or fsync data is left in a non-opened zone, I think the zone's write
> >>> pointer should be left as is. Otherwise, if the zone do not have valid block and
> >>> fsync data, the zone should be reset to avoid unaligned write error.
> >>
> >> Okay, if data beyond write pointer is invalid, we should keep write pointer as
> >> it is if there are fsynced data in that zone.
> >>
> >>>
> >>> One additional check I can think of is to check the last valid block in the zone
> >>> and write pointer position of the zone. If .write_pointer >= .last_valid_block,
> >>> , it is ok. If .write_pointer < .last_valid_block, this is a bug of f2fs. In
> >>
> >> Sounds reasonable, how can we find last valid block, as you said, content of
> >> block beyond write pointer is all zero... or you mean curseg's next_blkaddr?
> >> like the condition 2.c) described?
> > 
> > I think we can get each zone's last valid block referring each segment's valid
> > block bitmap in SIT. In other words, this is a consistency check between write
> > pointer and SIT. Is this feasible approach?
> 
> Good point.
> 
> I guess
> - we should do such sanity check with a image which has consistent metadata (SIT
> should not be broken)

Thanks for the comments. I read f2fs code further, and think still the
SIT vs write pointer check can be implemented and meaningful.

F2fs ensures consistency of SIT using two CP areas, two SIT areas and
sit_bitmap in CP. These metadata are in the conventional zone that not
affected by write pointer control logic. My current scope is to ensure
write pointer control logic correctness for zoned block device. From this
scope and the f2fs SIT consistency feature, I would like to assume that
SIT entries built in kernel after f2fs mount is correct for the write
pointer position check.

Fsck does additional SIT consistency check in fsck_chk_meta(). It would be
good to do the write pointer position check at the end of fsck_chk_meta().

> - need to consider fsynced block in SIT

As far as I read fsync logic, fsync results in do_write_page() call which
does both of SIT entry update and write bio submit. In other words, SIT
update and write pointer move are expected for fsync also. Then I think
the write pointer consistency check with last valid block obtained from
SIT is meaningful, when I take fsynced blocks into account.

> 
> > 
> >>
> >>> this case, the data in the valid blocks beyond write pointer is just lost, and
> >>> there is no way to recover this. I think this zone will not be selected for
> >>> cursegs for further data write in the zone until the zone get discarded. No
> >>> need to fix write pointer position to avoid unaligned write errors. I wonder
> >>
> >> Yes,
> >>
> >>> if fsck or kernel should detect and report this case, because users still use
> >>> the f2fs partition without any action. May I ask your opinion?
> >>
> >> If we can detect that, I think it should be reported.
> > 
> > I see, thanks for the comment.
> > 
> >>
> >>>
> >>>>
> >>>> 2) For opened zones (cursegs point to)
> >>>> 2.a) .write_pointer == .next_blkaddr, no need to handle
> >>>> 2.b) .write_pointer > .next_blkaddr, that should be the common state in sudden
> >>>> power-off case. It needs to reset .write_pointer to .next_blkaddr.
> > 
> > Sorry but let me amend. I have just noticed that the fix above is not possible.
> > We cannot set .write_pointer at .next_blkaddr, because write pointers cannot be
> 
> Alright..
> 
> > reset to desired position. It only can be reset to zero (at the zone start).
> > Instead of resetting .write_pointer, how about to fix by allocating a new zone to
> > the curseg for 2.b) in same manner as 2.c)?
> 
> Yeah, it's okay to me.

Thanks. Will prepare patch with that logic.

--
Best Regards,
Shin'ichiro Kawasaki

_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments
  2019-09-18  3:07                     ` Shinichiro Kawasaki
@ 2019-09-21  9:42                       ` Chao Yu
  2019-09-25  8:05                         ` Shinichiro Kawasaki
  0 siblings, 1 reply; 18+ messages in thread
From: Chao Yu @ 2019-09-21  9:42 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: Jaegeuk Kim, Damien Le Moal, linux-f2fs-devel

On 2019/9/18 11:07, Shinichiro Kawasaki wrote:
> Thanks for the comments. I read f2fs code further, and think still the
> SIT vs write pointer check can be implemented and meaningful.
> 
> F2fs ensures consistency of SIT using two CP areas, two SIT areas and
> sit_bitmap in CP. These metadata are in the conventional zone that not
> affected by write pointer control logic. My current scope is to ensure
> write pointer control logic correctness for zoned block device. From this
> scope and the f2fs SIT consistency feature, I would like to assume that
> SIT entries built in kernel after f2fs mount is correct for the write
> pointer position check.

SIT may be broken due to software bug or hardware flaw, we'd better not consider
it as a consistent metadata.

> 
> Fsck does additional SIT consistency check in fsck_chk_meta(). It would be
> good to do the write pointer position check at the end of fsck_chk_meta().

SIT can be changed later? e.g. SIT bitmap says one block address is valid,
however fsck found there is no entry can link to it, then it needs to be
deleted? it may affect write_pointer repair, right?

So we'd better look into all SIT update cases in fsck.

> 
>> - need to consider fsynced block in SIT
> 
> As far as I read fsync logic, fsync results in do_write_page() call which
> does both of SIT entry update and write bio submit. In other words, SIT
> update and write pointer move are expected for fsync also. Then I think
> the write pointer consistency check with last valid block obtained from
> SIT is meaningful, when I take fsynced blocks into account.

Yup, :)

Thanks,



_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments
  2019-09-21  9:42                       ` Chao Yu
@ 2019-09-25  8:05                         ` Shinichiro Kawasaki
  2019-09-29  2:09                           ` Chao Yu
  0 siblings, 1 reply; 18+ messages in thread
From: Shinichiro Kawasaki @ 2019-09-25  8:05 UTC (permalink / raw)
  To: Chao Yu; +Cc: Jaegeuk Kim, Damien Le Moal, linux-f2fs-devel

On Sep 21, 2019 / 17:42, Chao Yu wrote:
> On 2019/9/18 11:07, Shinichiro Kawasaki wrote:
> > Thanks for the comments. I read f2fs code further, and think still the
> > SIT vs write pointer check can be implemented and meaningful.
> > 
> > F2fs ensures consistency of SIT using two CP areas, two SIT areas and
> > sit_bitmap in CP. These metadata are in the conventional zone that not
> > affected by write pointer control logic. My current scope is to ensure
> > write pointer control logic correctness for zoned block device. From this
> > scope and the f2fs SIT consistency feature, I would like to assume that
> > SIT entries built in kernel after f2fs mount is correct for the write
> > pointer position check.
> 
> SIT may be broken due to software bug or hardware flaw, we'd better not consider
> it as a consistent metadata.

I see. I found that kernel checks SIT valid blocks with check_block_count()
function, and if mismatch happens it requests fsck. I guess SIT consistency is
not ensured during kernel run, but after fsck run. Is this correct? If so, I
think the write pointer consistency check with SIT valid block bitmaps should
not be done by kernel, but by fsck after SIT rewrite.

As for non-open zones (curseg do not point to), the other check is needed: if
the zone does not have valid blocks and fsync data, need to reset its write
pointer. My understanding is that the valid blocks count comes from SIT. Then
this fix also should be done not by kernel but by fsck after SIT rewrite. I
think kernel just should do this check at mount time to avoid unaligned write
error, and if inconsistency found it should request fsck to recheck and fix.

I assume kernel can check and fix write pointer inconsistency with cursegs in
CP. CP (and SB) have checksum guard and their consistency looks ensured during
kernel run.

> 
> > 
> > Fsck does additional SIT consistency check in fsck_chk_meta(). It would be
> > good to do the write pointer position check at the end of fsck_chk_meta().
> 
> SIT can be changed later? e.g. SIT bitmap says one block address is valid,
> however fsck found there is no entry can link to it, then it needs to be
> deleted? it may affect write_pointer repair, right?
> 
> So we'd better look into all SIT update cases in fsck.

Ah, yes, when fsck updates SIT, write pointer consistency check should be done
after that. I found SIT is rewritten at the end of fsck_verify() by
rewrite_sit_area_bitmap() and fix_curseg_info() calls. I guess write pointer
consistency check with SIT (and write pointer reset if required) should be
done after those function calls to reflect all SIT changes by fsck. I'll try
to create a patch with this approach.

Thanks again for your comments!

--
Best Regards,
Shin'ichiro Kawasaki

_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments
  2019-09-25  8:05                         ` Shinichiro Kawasaki
@ 2019-09-29  2:09                           ` Chao Yu
  2019-10-02  5:30                             ` Shinichiro Kawasaki
  0 siblings, 1 reply; 18+ messages in thread
From: Chao Yu @ 2019-09-29  2:09 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: Jaegeuk Kim, Damien Le Moal, linux-f2fs-devel

On 2019/9/25 16:05, Shinichiro Kawasaki wrote:
> On Sep 21, 2019 / 17:42, Chao Yu wrote:
>> On 2019/9/18 11:07, Shinichiro Kawasaki wrote:
>>> Thanks for the comments. I read f2fs code further, and think still the
>>> SIT vs write pointer check can be implemented and meaningful.
>>>
>>> F2fs ensures consistency of SIT using two CP areas, two SIT areas and
>>> sit_bitmap in CP. These metadata are in the conventional zone that not
>>> affected by write pointer control logic. My current scope is to ensure
>>> write pointer control logic correctness for zoned block device. From this
>>> scope and the f2fs SIT consistency feature, I would like to assume that
>>> SIT entries built in kernel after f2fs mount is correct for the write
>>> pointer position check.
>>
>> SIT may be broken due to software bug or hardware flaw, we'd better not consider
>> it as a consistent metadata.
> 
> I see. I found that kernel checks SIT valid blocks with check_block_count()
> function, and if mismatch happens it requests fsck. I guess SIT consistency is
> not ensured during kernel run, but after fsck run. Is this correct? If so, I
> think the write pointer consistency check with SIT valid block bitmaps should
> not be done by kernel, but by fsck after SIT rewrite.

IMO, consistency check can be done in both kernel and fsck.

If SIT is corrupted, but kernel doesn't aware of it (check_block_count() doesn't
report any inconsistency), write_pointer consistency check may give some clues
for filesystem corruption.

> 
> As for non-open zones (curseg do not point to), the other check is needed: if
> the zone does not have valid blocks and fsync data, need to reset its write
> pointer. My understanding is that the valid blocks count comes from SIT. Then
> this fix also should be done not by kernel but by fsck after SIT rewrite. I
> think kernel just should do this check at mount time to avoid unaligned write
> error, and if inconsistency found it should request fsck to recheck and fix.

Yeah, I agree with this.

> 
> I assume kernel can check and fix write pointer inconsistency with cursegs in
> CP. CP (and SB) have checksum guard and their consistency looks ensured during
> kernel run.
> 
>>
>>>
>>> Fsck does additional SIT consistency check in fsck_chk_meta(). It would be
>>> good to do the write pointer position check at the end of fsck_chk_meta().
>>
>> SIT can be changed later? e.g. SIT bitmap says one block address is valid,
>> however fsck found there is no entry can link to it, then it needs to be
>> deleted? it may affect write_pointer repair, right?
>>
>> So we'd better look into all SIT update cases in fsck.
> 
> Ah, yes, when fsck updates SIT, write pointer consistency check should be done
> after that. I found SIT is rewritten at the end of fsck_verify() by
> rewrite_sit_area_bitmap() and fix_curseg_info() calls. I guess write pointer
> consistency check with SIT (and write pointer reset if required) should be
> done after those function calls to reflect all SIT changes by fsck. I'll try
> to create a patch with this approach.

Quota file repair may grab and write the block which can change SIT status,
please notice that case as well.

Thanks,

> 
> Thanks again for your comments!
> 
> --
> Best Regards,
> Shin'ichiro Kawasaki.
> 


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments
  2019-09-29  2:09                           ` Chao Yu
@ 2019-10-02  5:30                             ` Shinichiro Kawasaki
  2019-10-18  6:55                               ` Shinichiro Kawasaki
  0 siblings, 1 reply; 18+ messages in thread
From: Shinichiro Kawasaki @ 2019-10-02  5:30 UTC (permalink / raw)
  To: Chao Yu; +Cc: Jaegeuk Kim, Damien Le Moal, linux-f2fs-devel

On Sep 29, 2019 / 10:09, Chao Yu wrote:
> On 2019/9/25 16:05, Shinichiro Kawasaki wrote:
> > On Sep 21, 2019 / 17:42, Chao Yu wrote:
> >> On 2019/9/18 11:07, Shinichiro Kawasaki wrote:
> >>> Thanks for the comments. I read f2fs code further, and think still the
> >>> SIT vs write pointer check can be implemented and meaningful.
> >>>
> >>> F2fs ensures consistency of SIT using two CP areas, two SIT areas and
> >>> sit_bitmap in CP. These metadata are in the conventional zone that not
> >>> affected by write pointer control logic. My current scope is to ensure
> >>> write pointer control logic correctness for zoned block device. From this
> >>> scope and the f2fs SIT consistency feature, I would like to assume that
> >>> SIT entries built in kernel after f2fs mount is correct for the write
> >>> pointer position check.
> >>
> >> SIT may be broken due to software bug or hardware flaw, we'd better not consider
> >> it as a consistent metadata.
> > 
> > I see. I found that kernel checks SIT valid blocks with check_block_count()
> > function, and if mismatch happens it requests fsck. I guess SIT consistency is
> > not ensured during kernel run, but after fsck run. Is this correct? If so, I
> > think the write pointer consistency check with SIT valid block bitmaps should
> > not be done by kernel, but by fsck after SIT rewrite.
> 
> IMO, consistency check can be done in both kernel and fsck.
> 
> If SIT is corrupted, but kernel doesn't aware of it (check_block_count() doesn't
> report any inconsistency), write_pointer consistency check may give some clues
> for filesystem corruption.

OK, then let's add the check in kernel also.

> 
> > 
> > As for non-open zones (curseg do not point to), the other check is needed: if
> > the zone does not have valid blocks and fsync data, need to reset its write
> > pointer. My understanding is that the valid blocks count comes from SIT. Then
> > this fix also should be done not by kernel but by fsck after SIT rewrite. I
> > think kernel just should do this check at mount time to avoid unaligned write
> > error, and if inconsistency found it should request fsck to recheck and fix.
> 
> Yeah, I agree with this.
> 
> > 
> > I assume kernel can check and fix write pointer inconsistency with cursegs in
> > CP. CP (and SB) have checksum guard and their consistency looks ensured during
> > kernel run.
> > 
> >>
> >>>
> >>> Fsck does additional SIT consistency check in fsck_chk_meta(). It would be
> >>> good to do the write pointer position check at the end of fsck_chk_meta().
> >>
> >> SIT can be changed later? e.g. SIT bitmap says one block address is valid,
> >> however fsck found there is no entry can link to it, then it needs to be
> >> deleted? it may affect write_pointer repair, right?
> >>
> >> So we'd better look into all SIT update cases in fsck.
> > 
> > Ah, yes, when fsck updates SIT, write pointer consistency check should be done
> > after that. I found SIT is rewritten at the end of fsck_verify() by
> > rewrite_sit_area_bitmap() and fix_curseg_info() calls. I guess write pointer
> > consistency check with SIT (and write pointer reset if required) should be
> > done after those function calls to reflect all SIT changes by fsck. I'll try
> > to create a patch with this approach.
> 
> Quota file repair may grab and write the block which can change SIT status,
> please notice that case as well.

Thanks for this comment. This made me think of the write pointer consistency
fix steps in fsck. Now I think fsck needs to check and fix write pointer
consistency twice. First fix is at beginning of fsck, and second at the end.
When fsck writes data in main segments such as quota file repair, inconsistent
write pointer will cause unaligned error. To avoid this, write pointer positions
should be fixed at the beginning of. After fsck updates for SIT or quota files
repair, recheck for write pointer consistency is required again to ensure write
pointer is consistent with those updates.

Now I'm preparing patches for fsck as well as kernel based on the valuable
and detailed e-mail discussion between you and me. Here I summarize my
take-aways through the discussion.

---- F2FS write pointer consistency check and fix by fsck and kernel ----

A) Check write pointer consistency for open zones and non-open zones

A-1) For open zones (cursegs point to), check consistency between the cursegs
     in CP and the write pointers. If the curseg is inconsistent with the write
     pointer in the zone that curseg points to, keep the write pointer as is
     and set new zone(=section) to the curseg.

A-2) For non-open zones (cursegs do not point to), check consistency between
     SIT valid blocks bitmap and the write pointers.
     A-2-i) if the zone does not have valid blocks and fsync data, reset the
            write pointer.
     A-2-ii) if the last valid block recorded in SIT valid blocks bit map is
             beyond the write pointer, just report the inconsistency to notice
	     kernel bug. No fix is available for this error.

B-1) Implement A-1) check & fix feature in the kernel. Fsync data beyond
     curseg's next_blkoff should be kept until fsync data recovery completion.
     Fix write pointer consistency just before the check point trigger for fsync
     data recovery.

B-2) Implement A-2) check feature in the kernel to avoid unaligned write error
     to the zone. Do not implement fix feature in the kernel to avoid the risk
     of SIT break. Just check and if check fails, ask users to run fsck.

C-1) Implement A-1) check & fix feature in fsck. Do check and fix twice, at the
     beginning of the fsck to avoid write error during fsck and at the end of
     fsck to ensure consistency with updates by fsck.
     In case fsync data is left after the curseg position, do not fix the
     inconsistency by fsck. Leave it so that kernel can recover it later.

C-2) Add check fix feature A-2) to fsck. Do check and fix twice in same manner
     as C-1).

--------------------------------------------------------------------------

Once the patches get ready, your review will be appreciated. Thanks!

--
Best Regards,
Shin'ichiro Kawasaki

_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments
  2019-10-02  5:30                             ` Shinichiro Kawasaki
@ 2019-10-18  6:55                               ` Shinichiro Kawasaki
  0 siblings, 0 replies; 18+ messages in thread
From: Shinichiro Kawasaki @ 2019-10-18  6:55 UTC (permalink / raw)
  To: Chao Yu; +Cc: Jaegeuk Kim, Damien Le Moal, linux-f2fs-devel

On Oct 02, 2019 / 14:30, Shin'ichiro Kawasaki wrote:
> Now I'm preparing patches for fsck as well as kernel based on the valuable
> and detailed e-mail discussion between you and me. Here I summarize my

I have posted the two patch series to add write pointer consistency check,
based on the discussion on the list: one series for fsck (v5), the other for
kernel f2fs module. Review will be appreciated.

> take-aways through the discussion.
> 
> ---- F2FS write pointer consistency check and fix by fsck and kernel ----
> 
> A) Check write pointer consistency for open zones and non-open zones
> 
> A-1) For open zones (cursegs point to), check consistency between the cursegs
>      in CP and the write pointers. If the curseg is inconsistent with the write
>      pointer in the zone that curseg points to, keep the write pointer as is
>      and set new zone(=section) to the curseg.
> 
> A-2) For non-open zones (cursegs do not point to), check consistency between
>      SIT valid blocks bitmap and the write pointers.
>      A-2-i) if the zone does not have valid blocks and fsync data, reset the
>             write pointer.
>      A-2-ii) if the last valid block recorded in SIT valid blocks bit map is
>              beyond the write pointer, just report the inconsistency to notice
> 	     kernel bug. No fix is available for this error.
> 
> B-1) Implement A-1) check & fix feature in the kernel. Fsync data beyond
>      curseg's next_blkoff should be kept until fsync data recovery completion.
>      Fix write pointer consistency just before the check point trigger for fsync
>      data recovery.
> 
> B-2) Implement A-2) check feature in the kernel to avoid unaligned write error
>      to the zone. Do not implement fix feature in the kernel to avoid the risk
>      of SIT break. Just check and if check fails, ask users to run fsck.
> 
> C-1) Implement A-1) check & fix feature in fsck. Do check and fix twice, at the
>      beginning of the fsck to avoid write error during fsck and at the end of
>      fsck to ensure consistency with updates by fsck.
>      In case fsync data is left after the curseg position, do not fix the
>      inconsistency by fsck. Leave it so that kernel can recover it later.
> 
> C-2) Add check fix feature A-2) to fsck. Do check and fix twice in same manner
>      as C-1).
> 
> --------------------------------------------------------------------------
> 
> Once the patches get ready, your review will be appreciated. Thanks!

--
Best Regards,
Shin'ichiro Kawasaki

_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-10-18  6:56 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-30 10:19 [f2fs-dev] [PATCH v4 0/2] fsck: Check write pointers of zoned block devices Shin'ichiro Kawasaki
2019-08-30 10:19 ` [f2fs-dev] [PATCH v4 1/2] libf2fs_zoned: Introduce f2fs_report_zones() helper function Shin'ichiro Kawasaki
2019-08-30 10:19 ` [f2fs-dev] [PATCH v4 2/2] fsck.f2fs: Check write pointer consistency with current segments Shin'ichiro Kawasaki
2019-09-02  7:02   ` Chao Yu
2019-09-03  8:37     ` Shinichiro Kawasaki
2019-09-05  9:58       ` Chao Yu
2019-09-06  8:31         ` Shinichiro Kawasaki
2019-09-09  7:14           ` Chao Yu
2019-09-10  8:10             ` Shinichiro Kawasaki
2019-09-10  9:12               ` Chao Yu
2019-09-12  8:16                 ` Shinichiro Kawasaki
2019-09-16  1:37                   ` Chao Yu
2019-09-18  3:07                     ` Shinichiro Kawasaki
2019-09-21  9:42                       ` Chao Yu
2019-09-25  8:05                         ` Shinichiro Kawasaki
2019-09-29  2:09                           ` Chao Yu
2019-10-02  5:30                             ` Shinichiro Kawasaki
2019-10-18  6:55                               ` Shinichiro Kawasaki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).