linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path
@ 2022-02-17 12:12 fdmanana
  2022-02-17 12:12 ` [PATCH 1/7] btrfs: fix lost prealloc extents beyond eof after full fsync fdmanana
                   ` (7 more replies)
  0 siblings, 8 replies; 13+ messages in thread
From: fdmanana @ 2022-02-17 12:12 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

This fixes a bug (first patch) with preallocated extents beyond eof being
lost after a full fsync and a power failure. The rest is mostly some
improvements to the full fsync code path (less IO, use less memory for
logging checksums, etc), and silence smatch about a possible dereference
of an uninitialized pointer. More details in the changelogs.

Filipe Manana (7):
  btrfs: fix lost prealloc extents beyond eof after full fsync
  btrfs: stop copying old file extents when doing a full fsync
  btrfs: hold on to less memory when logging checksums during full fsync
  btrfs: voluntarily relinquish cpu when doing a full fsync
  btrfs: reset last_reflink_trans after fsyncing inode
  btrfs: fix unexpected error path when reflinking an inline extent
  btrfs: deal with unexpected extent type during reflinking

 fs/btrfs/btrfs_inode.h |  30 +++++
 fs/btrfs/file.c        |   7 +-
 fs/btrfs/inode.c       |  12 +-
 fs/btrfs/reflink.c     |  39 +++---
 fs/btrfs/tree-log.c    | 285 +++++++++++++++++++++++++++--------------
 5 files changed, 254 insertions(+), 119 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/7] btrfs: fix lost prealloc extents beyond eof after full fsync
  2022-02-17 12:12 [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path fdmanana
@ 2022-02-17 12:12 ` fdmanana
  2022-02-21 10:11   ` Wang Yugui
  2022-02-17 12:12 ` [PATCH 2/7] btrfs: stop copying old file extents when doing a " fdmanana
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 13+ messages in thread
From: fdmanana @ 2022-02-17 12:12 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When doing a full fsync, if we have prealloc extents beyond (or at) eof,
and the leaves that contain them were not modified in the current
transaction, we end up not logging them. This results in losing those
extents when we replay the log after a power failure, since the inode is
truncated to the current value of the logged i_size.

Just like for the fast fsync path, we need to always log all prealloc
extents starting at or beyond i_size. The fast fsync case was fixed in
commit 471d557afed155 ("Btrfs: fix loss of prealloc extents past i_size
after fsync log replay") but it missed the full fsync path. The problem
exists since the very early days, when the log tree was added by
commit e02119d5a7b439 ("Btrfs: Add a write ahead tree log to optimize
synchronous operations").

Example reproducer:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt

  # Create our test file with many file extent items, so that they span
  # several leaves of metadata, even if the node/page size is 64K. Use
  # direct IO and not fsync/O_SYNC because it's both faster and it avoids
  # clearing the full sync flag from the inode - we want the fsync below
  # to trigger the slow full sync code path.
  $ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo

  # Now add two preallocated extents to our file without extending the
  # file's size. One right at i_size, and another further beyond, leaving
  # a gap between the two prealloc extents.
  $ xfs_io -c "falloc -k 16M 1M" /mnt/foo
  $ xfs_io -c "falloc -k 20M 1M" /mnt/foo

  # Make sure everything is durably persisted and the transaction is
  # committed. This makes all created extents to have a generation lower
  # than the generation of the transaction used by the next write and
  # fsync.
  sync

  # Now overwrite only the first extent, which will result in modifying
  # only the first leaf of metadata for our inode. Then fsync it. This
  # fsync will use the slow code path (inode full sync bit is set) because
  # it's the first fsync since the inode was created/loaded.
  $ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo

  # Extent list before power failure.
  $ xfs_io -c "fiemap -v" /mnt/foo
  /mnt/foo:
   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
     0: [0..7]:          2178048..2178055     8   0x0
     1: [8..16383]:      26632..43007     16376   0x0
     2: [16384..32767]:  2156544..2172927 16384   0x0
     3: [32768..34815]:  2172928..2174975  2048 0x800
     4: [34816..40959]:  hole              6144
     5: [40960..43007]:  2174976..2177023  2048 0x801

  <power fail>

  # Mount fs again, trigger log replay.
  $ mount /dev/sdc /mnt

  # Extent list after power failure and log replay.
  $ xfs_io -c "fiemap -v" /mnt/foo
  /mnt/foo:
   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
     0: [0..7]:          2178048..2178055     8   0x0
     1: [8..16383]:      26632..43007     16376   0x0
     2: [16384..32767]:  2156544..2172927 16384   0x1

  # The prealloc extents at file offsets 16M and 20M are missing.

So fix this by calling btrfs_log_prealloc_extents() when we are doing a
full fsync, so that we always log all prealloc extents beyond eof.

A test case for fstests will follow soon.

CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 43 +++++++++++++++++++++++++++++++------------
 1 file changed, 31 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index a483337e8f41..71a5a961fef7 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4732,7 +4732,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
 
 /*
  * Log all prealloc extents beyond the inode's i_size to make sure we do not
- * lose them after doing a fast fsync and replaying the log. We scan the
+ * lose them after doing a full/fast fsync and replaying the log. We scan the
  * subvolume's root instead of iterating the inode's extent map tree because
  * otherwise we can log incorrect extent items based on extent map conversion.
  * That can happen due to the fact that extent maps are merged when they
@@ -5510,6 +5510,7 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
 				   struct btrfs_log_ctx *ctx,
 				   bool *need_log_inode_item)
 {
+	const u64 i_size = i_size_read(&inode->vfs_inode);
 	struct btrfs_root *root = inode->root;
 	int ins_start_slot = 0;
 	int ins_nr = 0;
@@ -5530,13 +5531,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
 		if (min_key->type > max_key->type)
 			break;
 
-		if (min_key->type == BTRFS_INODE_ITEM_KEY)
+		if (min_key->type == BTRFS_INODE_ITEM_KEY) {
 			*need_log_inode_item = false;
-
-		if ((min_key->type == BTRFS_INODE_REF_KEY ||
-		     min_key->type == BTRFS_INODE_EXTREF_KEY) &&
-		    inode->generation == trans->transid &&
-		    !recursive_logging) {
+		} else if (min_key->type == BTRFS_EXTENT_DATA_KEY &&
+			   min_key->offset >= i_size) {
+			/*
+			 * Extents at and beyond eof are logged with
+			 * btrfs_log_prealloc_extents().
+			 * Only regular files have BTRFS_EXTENT_DATA_KEY keys,
+			 * and no keys greater than that, so bail out.
+			 */
+			break;
+		} else if ((min_key->type == BTRFS_INODE_REF_KEY ||
+			    min_key->type == BTRFS_INODE_EXTREF_KEY) &&
+			   inode->generation == trans->transid &&
+			   !recursive_logging) {
 			u64 other_ino = 0;
 			u64 other_parent = 0;
 
@@ -5567,10 +5576,8 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
 				btrfs_release_path(path);
 				goto next_key;
 			}
-		}
-
-		/* Skip xattrs, we log them later with btrfs_log_all_xattrs() */
-		if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
+		} else if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
+			/* Skip xattrs, logged later with btrfs_log_all_xattrs() */
 			if (ins_nr == 0)
 				goto next_slot;
 			ret = copy_items(trans, inode, dst_path, path,
@@ -5623,9 +5630,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
 			break;
 		}
 	}
-	if (ins_nr)
+	if (ins_nr) {
 		ret = copy_items(trans, inode, dst_path, path, ins_start_slot,
 				 ins_nr, inode_only, logged_isize);
+		if (ret)
+			return ret;
+	}
+
+	if (inode_only == LOG_INODE_ALL && S_ISREG(inode->vfs_inode.i_mode)) {
+		/*
+		 * Release the path because otherwise we might attempt to double
+		 * lock the same leaf with btrfs_log_prealloc_extents() below.
+		 */
+		btrfs_release_path(path);
+		ret = btrfs_log_prealloc_extents(trans, inode, dst_path);
+	}
 
 	return ret;
 }
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/7] btrfs: stop copying old file extents when doing a full fsync
  2022-02-17 12:12 [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path fdmanana
  2022-02-17 12:12 ` [PATCH 1/7] btrfs: fix lost prealloc extents beyond eof after full fsync fdmanana
@ 2022-02-17 12:12 ` fdmanana
  2022-02-17 12:12 ` [PATCH 3/7] btrfs: hold on to less memory when logging checksums during " fdmanana
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: fdmanana @ 2022-02-17 12:12 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When logging an inode in full sync mode, we go over every leaf that was
modified in the current transaction and has items associated to our inode,
and then copy all those items into the log tree. This includes copying
file extent items that were created and added to the inode in past
transactions, which is useless and only makes use more leaf space in the
log tree.

It's common to have a file with many file extent items spanning many
leaves where only a few file extent items are new and need to be logged,
and in such case we log all the file extent items we find in the modified
leaves.

So change the full sync behaviour to skip over file extent items that are
not needed. Those are the ones that match the following criteria:

1) Have a generation older than the current transaction and the inode
   was not a target of a reflink operation, as that can copy file extent
   items from a past generation from some other inode into our inode, so
   we have to log them;

2) Start at an offset within i_size - we must log anything at or beyond
   i_size, otherwise we would lose prealloc extents after log replay.

The following script exercises a scenario where this happens, and it's
somehow close enough to what happened often on a SQL Server workload which
I had to debug sometime ago to fix an issue where a pattern of writes to
prealloc extents and fsync resulted in fsync failing with -EIO (that was
commit ea7036de0d36c4 ("btrfs: fix fsync failure and transaction abort
after writes to prealloc extents")). In that particular case, we had large
files that had random writes and were often truncated, which made the
next fsync be a full sync.

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  MKFS_OPTIONS="-O no-holes -R free-space-tree"
  MOUNT_OPTIONS="-o ssd"

  FILE_SIZE=$((1 * 1024 * 1024 * 1024)) # 1G
  # FILE_SIZE=$((2 * 1024 * 1024 * 1024)) # 2G
  # FILE_SIZE=$((512 * 1024 * 1024)) # 512M

  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  # Create a file with many extents. Use direct IO to make it faster
  # to create the file - using buffered IO we would have to fsync
  # after each write (terribly slow).
  echo "Creating file with $((FILE_SIZE / 4096)) extents of 4K each..."
  xfs_io -f -d -c "pwrite -b 4K 0 $FILE_SIZE" $MNT/foobar

  # Commit the transaction, so every extent after this is from an
  # old generation.
  sync

  # Now rewrite only a few extents, which are all far spread apart from
  # each other (e.g. 1G / 32M = 32 extents).
  # After this only a few extents have a new generation, while all other
  # ones have an old generation.
  echo "Rewriting $((FILE_SIZE / (32 * 1024 * 1024))) extents..."
  for ((i = 0; i < $FILE_SIZE; i += $((32 * 1024 * 1024)))); do
      xfs_io -c "pwrite $i 4K" $MNT/foobar >/dev/null
  done

  # Fsync, the inode logged in full sync mode since it was never fsynced
  # before.
  echo "Fsyncing file..."
  xfs_io -c "fsync" $MNT/foobar

  umount $MNT

And the following bpftrace program was running when executing the test
script:

  $ cat bpf-script.sh
  #!/usr/bin/bpftrace

  k:btrfs_log_inode
  {
      @start_log_inode[tid] = nsecs;
  }

  kr:btrfs_log_inode
  /@start_log_inode[tid]/
  {
      @log_inode_dur[tid] = (nsecs - @start_log_inode[tid]) / 1000;
      delete(@start_log_inode[tid]);
  }

  k:btrfs_sync_log
  {
      @start_sync_log[tid] = nsecs;
  }

  kr:btrfs_sync_log
  /@start_sync_log[tid]/
  {
      $sync_log_dur = (nsecs - @start_sync_log[tid]) / 1000;
      printf("btrfs_log_inode() took %llu us\n", @log_inode_dur[tid]);
      printf("btrfs_sync_log()  took %llu us\n", $sync_log_dur);
      delete(@start_sync_log[tid]);
      delete(@log_inode_dur[tid]);
      exit();
  }

With 512M test file, before this patch:

  btrfs_log_inode() took 15218 us
  btrfs_sync_log()  took 1328 us

  Log tree has 17 leaves and 1 node, its total size is 294912 bytes.

With 512M test file, after this patch:

  btrfs_log_inode() took 14760 us
  btrfs_sync_log()  took 588 us

  Log tree has a single leaf, its total size is 16K.

With 1G test file, before this patch:

  btrfs_log_inode() took 27301 us
  btrfs_sync_log()  took 1767 us

  Log tree has 33 leaves and 1 node, its total size is 557056 bytes.

With 1G test file, after this patch:

  btrfs_log_inode() took 26166 us
  btrfs_sync_log()  took 593 us

  Log tree has a single leaf, its total size is 16K

With 2G test file, before this patch:

  btrfs_log_inode() took 50892 us
  btrfs_sync_log()  took 3127 us

  Log tree has 65 leaves and 1 node, its total size is 1081344 bytes.

With 2G test file, after this patch:

  btrfs_log_inode() took 50126 us
  btrfs_sync_log()  took 586 us

  Log tree has a single leaf, its total size is 16K.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/reflink.c  |  23 +++--
 fs/btrfs/tree-log.c | 198 ++++++++++++++++++++++++++++++--------------
 2 files changed, 148 insertions(+), 73 deletions(-)

diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index a3930da4eb3f..c083ded71ef7 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -518,17 +518,22 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 		btrfs_release_path(path);
 
 		/*
-		 * If this is a new extent update the last_reflink_trans of both
-		 * inodes. This is used by fsync to make sure it does not log
-		 * multiple checksum items with overlapping ranges. For older
-		 * extents we don't need to do it since inode logging skips the
-		 * checksums for older extents. Also ignore holes and inline
-		 * extents because they don't have checksums in the csum tree.
+		 * Whenever we share an extent we update the last_reflink_trans
+		 * of each inode to the current transaction. This is needed to
+		 * make sure fsync does not log multiple checksum items with
+		 * overlapping ranges (because some extent items might refer
+		 * only to sections of the original extent). For the destination
+		 * inode we do this regardless of the generation of the extents
+		 * or even if they are inline extents or explicit holes, to make
+		 * sure a full fsync does not skip them. For the source inode,
+		 * we only need to update last_reflink_trans in case it's a new
+		 * extent that is not a hole or an inline extent, to deal with
+		 * the checksums problem on fsync.
 		 */
-		if (extent_gen == trans->transid && disko > 0) {
+		if (extent_gen == trans->transid && disko > 0)
 			BTRFS_I(src)->last_reflink_trans = trans->transid;
-			BTRFS_I(inode)->last_reflink_trans = trans->transid;
-		}
+
+		BTRFS_I(inode)->last_reflink_trans = trans->transid;
 
 		last_dest_end = ALIGN(new_key.offset + datal,
 				      fs_info->sectorsize);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 71a5a961fef7..602aa10c9d88 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4400,21 +4400,19 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
 			       int start_slot, int nr, int inode_only,
 			       u64 logged_isize)
 {
-	struct btrfs_fs_info *fs_info = trans->fs_info;
-	unsigned long src_offset;
-	unsigned long dst_offset;
 	struct btrfs_root *log = inode->root->log_root;
 	struct btrfs_file_extent_item *extent;
-	struct btrfs_inode_item *inode_item;
 	struct extent_buffer *src = src_path->nodes[0];
-	int ret;
+	int ret = 0;
 	struct btrfs_key *ins_keys;
 	u32 *ins_sizes;
 	struct btrfs_item_batch batch;
 	char *ins_data;
 	int i;
+	int dst_index;
 	struct list_head ordered_sums;
-	int skip_csum = inode->flags & BTRFS_INODE_NODATASUM;
+	const bool skip_csum = (inode->flags & BTRFS_INODE_NODATASUM);
+	const u64 i_size = i_size_read(&inode->vfs_inode);
 
 	INIT_LIST_HEAD(&ordered_sums);
 
@@ -4428,28 +4426,140 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
 	batch.keys = ins_keys;
 	batch.data_sizes = ins_sizes;
 	batch.total_data_size = 0;
-	batch.nr = nr;
+	batch.nr = 0;
 
+	dst_index = 0;
 	for (i = 0; i < nr; i++) {
-		ins_sizes[i] = btrfs_item_size(src, i + start_slot);
-		batch.total_data_size += ins_sizes[i];
-		btrfs_item_key_to_cpu(src, ins_keys + i, i + start_slot);
+		const int src_slot = start_slot + i;
+		struct btrfs_root *csum_root;
+		u64 disk_bytenr;
+		u64 disk_num_bytes;
+		u64 extent_offset;
+		u64 extent_num_bytes;
+		bool is_old_extent;
+
+		btrfs_item_key_to_cpu(src, &ins_keys[dst_index], src_slot);
+
+		if (ins_keys[dst_index].type != BTRFS_EXTENT_DATA_KEY)
+			goto add_to_batch;
+
+		extent = btrfs_item_ptr(src, src_slot,
+					struct btrfs_file_extent_item);
+
+		is_old_extent = (btrfs_file_extent_generation(src, extent) <
+				 trans->transid);
+
+		/*
+		 * Don't copy extents from past generations. That would make us
+		 * log a lot more metadata for common cases like doing only a
+		 * few random writes into a file and then fsync it for the first
+		 * time or after the full sync flag is set on the inode. We can
+		 * get leaves full of extent items, most of which are from past
+		 * generations, so we can skip them - as long as the inode has
+		 * not been the target of a reflink operation in this transaction,
+		 * as in that case it might have had file extent items with old
+		 * generations copied into it. We also must always log prealloc
+		 * extents that start at or beyond eof, otherwise we would lose
+		 * them on log replay.
+		 */
+		if (is_old_extent &&
+		    ins_keys[dst_index].offset < i_size &&
+		    inode->last_reflink_trans < trans->transid)
+			continue;
+
+		if (skip_csum)
+			goto add_to_batch;
+
+		/* Only regular extents have checksums. */
+		if (btrfs_file_extent_type(src, extent) != BTRFS_FILE_EXTENT_REG)
+			goto add_to_batch;
+
+		/*
+		 * If it's an extent created in a past transaction, then its
+		 * checksums are already accessible from the committed csum tree,
+		 * no need to log them.
+		 */
+		if (is_old_extent)
+			goto add_to_batch;
+
+		disk_bytenr = btrfs_file_extent_disk_bytenr(src, extent);
+		/* If it's an explicit hole, there are no checksums. */
+		if (disk_bytenr == 0)
+			goto add_to_batch;
+
+		disk_num_bytes = btrfs_file_extent_disk_num_bytes(src, extent);
+
+		if (btrfs_file_extent_compression(src, extent)) {
+			extent_offset = 0;
+			extent_num_bytes = disk_num_bytes;
+		} else {
+			extent_offset = btrfs_file_extent_offset(src, extent);
+			extent_num_bytes = btrfs_file_extent_num_bytes(src, extent);
+		}
+
+		csum_root = btrfs_csum_root(trans->fs_info, disk_bytenr);
+		disk_bytenr += extent_offset;
+		ret = btrfs_lookup_csums_range(csum_root, disk_bytenr,
+					       disk_bytenr + extent_num_bytes - 1,
+					       &ordered_sums, 0);
+		if (ret)
+			goto out;
+
+add_to_batch:
+		ins_sizes[dst_index] = btrfs_item_size(src, src_slot);
+		batch.total_data_size += ins_sizes[dst_index];
+		batch.nr++;
+		dst_index++;
 	}
+
+	/*
+	 * We have a leaf full of old extent items that don't need to be logged,
+	 * so we don't need to do anything.
+	 */
+	if (batch.nr == 0)
+		goto out;
+
 	ret = btrfs_insert_empty_items(trans, log, dst_path, &batch);
-	if (ret) {
-		kfree(ins_data);
-		return ret;
-	}
+	if (ret)
+		goto out;
 
-	for (i = 0; i < nr; i++, dst_path->slots[0]++) {
-		dst_offset = btrfs_item_ptr_offset(dst_path->nodes[0],
-						   dst_path->slots[0]);
+	dst_index = 0;
+	for (i = 0; i < nr; i++) {
+		const int src_slot = start_slot + i;
+		const int dst_slot = dst_path->slots[0] + dst_index;
+		struct btrfs_key key;
+		unsigned long src_offset;
+		unsigned long dst_offset;
 
-		src_offset = btrfs_item_ptr_offset(src, start_slot + i);
+		/*
+		 * We're done, all the remaining items in the source leaf
+		 * correspond to old file extent items.
+		 */
+		if (dst_index >= batch.nr)
+			break;
 
-		if (ins_keys[i].type == BTRFS_INODE_ITEM_KEY) {
-			inode_item = btrfs_item_ptr(dst_path->nodes[0],
-						    dst_path->slots[0],
+		btrfs_item_key_to_cpu(src, &key, src_slot);
+
+		if (key.type != BTRFS_EXTENT_DATA_KEY)
+			goto copy_item;
+
+		extent = btrfs_item_ptr(src, src_slot,
+					struct btrfs_file_extent_item);
+
+		/* See the comment in the previous loop, same logic. */
+		if (btrfs_file_extent_generation(src, extent) < trans->transid &&
+		    key.offset < i_size &&
+		    inode->last_reflink_trans < trans->transid)
+			continue;
+
+copy_item:
+		dst_offset = btrfs_item_ptr_offset(dst_path->nodes[0], dst_slot);
+		src_offset = btrfs_item_ptr_offset(src, src_slot);
+
+		if (key.type == BTRFS_INODE_ITEM_KEY) {
+			struct btrfs_inode_item *inode_item;
+
+			inode_item = btrfs_item_ptr(dst_path->nodes[0], dst_slot,
 						    struct btrfs_inode_item);
 			fill_inode_item(trans, dst_path->nodes[0], inode_item,
 					&inode->vfs_inode,
@@ -4457,55 +4567,15 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
 					logged_isize);
 		} else {
 			copy_extent_buffer(dst_path->nodes[0], src, dst_offset,
-					   src_offset, ins_sizes[i]);
+					   src_offset, ins_sizes[dst_index]);
 		}
 
-		/* take a reference on file data extents so that truncates
-		 * or deletes of this inode don't have to relog the inode
-		 * again
-		 */
-		if (ins_keys[i].type == BTRFS_EXTENT_DATA_KEY &&
-		    !skip_csum) {
-			int found_type;
-			extent = btrfs_item_ptr(src, start_slot + i,
-						struct btrfs_file_extent_item);
-
-			if (btrfs_file_extent_generation(src, extent) < trans->transid)
-				continue;
-
-			found_type = btrfs_file_extent_type(src, extent);
-			if (found_type == BTRFS_FILE_EXTENT_REG) {
-				struct btrfs_root *csum_root;
-				u64 ds, dl, cs, cl;
-				ds = btrfs_file_extent_disk_bytenr(src,
-								extent);
-				/* ds == 0 is a hole */
-				if (ds == 0)
-					continue;
-
-				dl = btrfs_file_extent_disk_num_bytes(src,
-								extent);
-				cs = btrfs_file_extent_offset(src, extent);
-				cl = btrfs_file_extent_num_bytes(src,
-								extent);
-				if (btrfs_file_extent_compression(src,
-								  extent)) {
-					cs = 0;
-					cl = dl;
-				}
-
-				csum_root = btrfs_csum_root(fs_info, ds);
-				ret = btrfs_lookup_csums_range(csum_root,
-						ds + cs, ds + cs + cl - 1,
-						&ordered_sums, 0);
-				if (ret)
-					break;
-			}
-		}
+		dst_index++;
 	}
 
 	btrfs_mark_buffer_dirty(dst_path->nodes[0]);
 	btrfs_release_path(dst_path);
+out:
 	kfree(ins_data);
 
 	/*
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/7] btrfs: hold on to less memory when logging checksums during full fsync
  2022-02-17 12:12 [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path fdmanana
  2022-02-17 12:12 ` [PATCH 1/7] btrfs: fix lost prealloc extents beyond eof after full fsync fdmanana
  2022-02-17 12:12 ` [PATCH 2/7] btrfs: stop copying old file extents when doing a " fdmanana
@ 2022-02-17 12:12 ` fdmanana
  2022-02-17 12:12 ` [PATCH 4/7] btrfs: voluntarily relinquish cpu when doing a " fdmanana
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: fdmanana @ 2022-02-17 12:12 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When doing a full fsync, at copy_items(), we iterate over all extents and
then collect their checksums into a list. After copying all the extents to
the log tree, we then log all the previously collected checksums.

Before the previous patch in the series (subject "btrfs: stop copying old
file extents when doing a full fsync"), we had to do it this way, because
while we were iterating over the items in the leaf of the subvolume tree,
we were holding a write lock on a leaf of the log tree, so logging the
checksums for an extent right after we collected them could result in a
deadlock, in case the checksum items ended up in the same leaf.

However after the previous patch in the series we now do a first iteration
over all the items in the leaf of the subvolume tree before locking a path
in the log tree, so we can now log the checksums right after we have
obtained them. This avoids holding in memory all checksums for all extents
in the leaf while copying items from the source leaf to the log tree. The
amount of memory used to hold all checksums of the extents in a leaf can
be significant. For example if a leaf has 200 file extent items referring
to 1M extents, using the default crc32c checksums, would result in using
over 200K of memory (not accounting for the extra overhead of struct
btrfs_ordered_sum), with smaller or less extents it would be less, but
it could be much more with more extents per leaf and/or much larger
extents.

So change copy_items() to log the checksums for an extent after looking
them up, and then free their memory, as they are no longer necessary.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 602aa10c9d88..513baf073510 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4410,12 +4410,9 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
 	char *ins_data;
 	int i;
 	int dst_index;
-	struct list_head ordered_sums;
 	const bool skip_csum = (inode->flags & BTRFS_INODE_NODATASUM);
 	const u64 i_size = i_size_read(&inode->vfs_inode);
 
-	INIT_LIST_HEAD(&ordered_sums);
-
 	ins_data = kmalloc(nr * sizeof(struct btrfs_key) +
 			   nr * sizeof(u32), GFP_NOFS);
 	if (!ins_data)
@@ -4432,6 +4429,9 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
 	for (i = 0; i < nr; i++) {
 		const int src_slot = start_slot + i;
 		struct btrfs_root *csum_root;
+		struct btrfs_ordered_sum *sums;
+		struct btrfs_ordered_sum *sums_next;
+		LIST_HEAD(ordered_sums);
 		u64 disk_bytenr;
 		u64 disk_num_bytes;
 		u64 extent_offset;
@@ -4505,6 +4505,15 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
 		if (ret)
 			goto out;
 
+		list_for_each_entry_safe(sums, sums_next, &ordered_sums, list) {
+			if (!ret)
+				ret = log_csums(trans, inode, log, sums);
+			list_del(&sums->list);
+			kfree(sums);
+		}
+		if (ret)
+			goto out;
+
 add_to_batch:
 		ins_sizes[dst_index] = btrfs_item_size(src, src_slot);
 		batch.total_data_size += ins_sizes[dst_index];
@@ -4578,20 +4587,6 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
 out:
 	kfree(ins_data);
 
-	/*
-	 * we have to do this after the loop above to avoid changing the
-	 * log tree while trying to change the log tree.
-	 */
-	while (!list_empty(&ordered_sums)) {
-		struct btrfs_ordered_sum *sums = list_entry(ordered_sums.next,
-						   struct btrfs_ordered_sum,
-						   list);
-		if (!ret)
-			ret = log_csums(trans, inode, log, sums);
-		list_del(&sums->list);
-		kfree(sums);
-	}
-
 	return ret;
 }
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/7] btrfs: voluntarily relinquish cpu when doing a full fsync
  2022-02-17 12:12 [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path fdmanana
                   ` (2 preceding siblings ...)
  2022-02-17 12:12 ` [PATCH 3/7] btrfs: hold on to less memory when logging checksums during " fdmanana
@ 2022-02-17 12:12 ` fdmanana
  2022-02-17 12:12 ` [PATCH 5/7] btrfs: reset last_reflink_trans after fsyncing inode fdmanana
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: fdmanana @ 2022-02-17 12:12 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Doing a full fsync may require processing many leaves of metadata, which
can take some time and result in a task monopolizing a cpu for too long.
So add a cond_resched() after processing a leaf when doing a full fsync,
while not holding any locks on any tree (a subvolume or a log tree).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 513baf073510..e5b681854117 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5694,6 +5694,13 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
 		} else {
 			break;
 		}
+
+		/*
+		 * We may process many leaves full of items for our inode, so
+		 * avoid monopolizing a cpu for too long by rescheduling while
+		 * not holding locks on any tree.
+		 */
+		cond_resched();
 	}
 	if (ins_nr) {
 		ret = copy_items(trans, inode, dst_path, path, ins_start_slot,
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/7] btrfs: reset last_reflink_trans after fsyncing inode
  2022-02-17 12:12 [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path fdmanana
                   ` (3 preceding siblings ...)
  2022-02-17 12:12 ` [PATCH 4/7] btrfs: voluntarily relinquish cpu when doing a " fdmanana
@ 2022-02-17 12:12 ` fdmanana
  2022-02-17 12:12 ` [PATCH 6/7] btrfs: fix unexpected error path when reflinking an inline extent fdmanana
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: fdmanana @ 2022-02-17 12:12 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When an inode has a last_reflink_trans matching the current transaction,
we have to take special care when logging its checksums in order to
avoid getting checksum items with overlapping ranges in a log tree,
which could result in missing checksums after log replay (more on that
in the changelogs of commit 40e046acbd2f36 ("Btrfs: fix missing data
checksums after replaying a log tree") and commit e289f03ea79bbc ("btrfs:
fix corrupt log due to concurrent fsync of inodes with shared extents")).
We also need to make sure a full fsync will copy all old file extent
items it finds in modified leaves, because they might have been copied
from some other inode.

However once we fsync an inode, we don't need to keep paying the price of
that extra special care in future fsyncs done in the same transaction,
unless the inode is used for another reflink operation or the full sync
flag is set on it (truncate, failure to allocate extent maps for holes,
and other exceptional and infrequent cases).

So after we fsync an inode reset its last_unlink_trans to zero. In case
another reflink happens, we continue to update the last_reflink_trans of
the inode, just as before. Also set last_reflink_trans to the generation
of the last transaction that modified the inode whenever we need to set
the full sync flag on the inode, just like when we need to load an inode
from disk after eviction.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h | 30 ++++++++++++++++++++++++++++++
 fs/btrfs/file.c        |  7 +++----
 fs/btrfs/inode.c       | 12 +++++-------
 fs/btrfs/reflink.c     |  5 ++---
 fs/btrfs/tree-log.c    |  8 ++++++++
 5 files changed, 48 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 6d4f42b0fdce..47e72d72f7d0 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -341,6 +341,36 @@ static inline void btrfs_set_inode_last_sub_trans(struct btrfs_inode *inode)
 	spin_unlock(&inode->lock);
 }
 
+/*
+ * Should be called while holding the inode's VFS lock in exclusive mode or in a
+ * context where no one else can access the inode concurrently (during inode
+ * creation or when loading an inode from disk).
+ */
+static inline void btrfs_set_inode_full_sync(struct btrfs_inode *inode)
+{
+	set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags);
+	/*
+	 * The inode may have been part of a reflink operation in the last
+	 * transaction that modified it, and then a fsync has reset the
+	 * last_reflink_trans to avoid subsequent fsyncs in the same
+	 * transaction to do unnecessary work. So update last_reflink_trans
+	 * to the last_trans value (we have to be pessimistic and assume a
+	 * reflink happened).
+	 *
+	 * The ->last_trans is protected by the inode's spinlock and we can
+	 * have a concurrent ordered extent completion update it. Also set
+	 * last_reflink_trans to ->last_trans only if the former is less than
+	 * the later, because we can be called in a context where
+	 * last_reflink_trans was set to the current transaction generation
+	 * while ->last_trans was not yet updated in the current transaction,
+	 * and therefore has a lower value.
+	 */
+	spin_lock(&inode->lock);
+	if (inode->last_reflink_trans < inode->last_trans)
+		inode->last_reflink_trans = inode->last_trans;
+	spin_unlock(&inode->lock);
+}
+
 static inline bool btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation)
 {
 	bool ret = false;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 625c25bff126..119f8a28fbe6 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2499,7 +2499,7 @@ static int fill_holes(struct btrfs_trans_handle *trans,
 	hole_em = alloc_extent_map();
 	if (!hole_em) {
 		btrfs_drop_extent_cache(inode, offset, end - 1, 0);
-		set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags);
+		btrfs_set_inode_full_sync(inode);
 	} else {
 		hole_em->start = offset;
 		hole_em->len = end - offset;
@@ -2520,8 +2520,7 @@ static int fill_holes(struct btrfs_trans_handle *trans,
 		} while (ret == -EEXIST);
 		free_extent_map(hole_em);
 		if (ret)
-			set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
-					&inode->runtime_flags);
+			btrfs_set_inode_full_sync(inode);
 	}
 
 	return 0;
@@ -2875,7 +2874,7 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
 	 * maps for the replacement extents (or holes).
 	 */
 	if (extent_info && !extent_info->is_new_extent)
-		set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags);
+		btrfs_set_inode_full_sync(inode);
 
 	if (ret)
 		goto out_trans;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 24099fe9e120..fe1597e74791 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -428,7 +428,7 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start,
 		goto out;
 	}
 
-	set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags);
+	btrfs_set_inode_full_sync(inode);
 out:
 	/*
 	 * Don't forget to free the reserved space, as for inlined extent
@@ -4900,8 +4900,7 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
 						cur_offset + hole_size - 1, 0);
 			hole_em = alloc_extent_map();
 			if (!hole_em) {
-				set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
-					&inode->runtime_flags);
+				btrfs_set_inode_full_sync(inode);
 				goto next;
 			}
 			hole_em->start = cur_offset;
@@ -6154,7 +6153,7 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
 	 * sync since it will be a full sync anyway and this will blow away the
 	 * old info in the log.
 	 */
-	set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &BTRFS_I(inode)->runtime_flags);
+	btrfs_set_inode_full_sync(BTRFS_I(inode));
 
 	key[0].objectid = objectid;
 	key[0].type = BTRFS_INODE_ITEM_KEY;
@@ -8725,7 +8724,7 @@ static int btrfs_truncate(struct inode *inode, bool skip_writeback)
 	 * extents beyond i_size to drop.
 	 */
 	if (control.extents_found > 0)
-		set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &BTRFS_I(inode)->runtime_flags);
+		btrfs_set_inode_full_sync(BTRFS_I(inode));
 
 	return ret;
 }
@@ -9933,8 +9932,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 
 		em = alloc_extent_map();
 		if (!em) {
-			set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
-				&BTRFS_I(inode)->runtime_flags);
+			btrfs_set_inode_full_sync(BTRFS_I(inode));
 			goto next;
 		}
 
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index c083ded71ef7..cdcfeab82e12 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -277,7 +277,7 @@ static int clone_copy_inline_extent(struct inode *dst,
 						  path->slots[0]),
 			    size);
 	btrfs_update_inode_bytes(BTRFS_I(dst), datal, drop_args.bytes_found);
-	set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &BTRFS_I(dst)->runtime_flags);
+	btrfs_set_inode_full_sync(BTRFS_I(dst));
 	ret = btrfs_inode_set_file_extent_range(BTRFS_I(dst), 0, aligned_end);
 out:
 	if (!ret && !trans) {
@@ -580,8 +580,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 		 * replaced file extent items.
 		 */
 		if (last_dest_end >= i_size_read(inode))
-			set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
-				&BTRFS_I(inode)->runtime_flags);
+			btrfs_set_inode_full_sync(BTRFS_I(inode));
 
 		ret = btrfs_replace_file_extents(BTRFS_I(inode), path,
 				last_dest_end, destoff + len - 1, NULL, &trans);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index e5b681854117..73ecc4f67a23 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5995,6 +5995,14 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 	if (inode_only != LOG_INODE_EXISTS)
 		inode->last_log_commit = inode->last_sub_trans;
 	spin_unlock(&inode->lock);
+
+	/*
+	 * Reset the last_reflink_trans so that the next fsync does not need to
+	 * go through the slower path when logging extents and their checksums.
+	 */
+	if (inode_only == LOG_INODE_ALL)
+		inode->last_reflink_trans = 0;
+
 out_unlock:
 	mutex_unlock(&inode->log_mutex);
 out:
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 6/7] btrfs: fix unexpected error path when reflinking an inline extent
  2022-02-17 12:12 [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path fdmanana
                   ` (4 preceding siblings ...)
  2022-02-17 12:12 ` [PATCH 5/7] btrfs: reset last_reflink_trans after fsyncing inode fdmanana
@ 2022-02-17 12:12 ` fdmanana
  2022-02-17 12:12 ` [PATCH 7/7] btrfs: deal with unexpected extent type during reflinking fdmanana
  2022-02-21 16:15 ` [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path David Sterba
  7 siblings, 0 replies; 13+ messages in thread
From: fdmanana @ 2022-02-17 12:12 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When reflinking an inline extent, we assert that its file offset is 0 and
that its uncompressed length is not greater than the sector size. We then
return an error if one of those conditions is not satisfied. However we
use a return statement, which results in returning from btrfs_clone()
without freeing the path and buffer that were allocated before, as well as
not clearing the flag BTRFS_INODE_NO_DELALLOC_FLUSH for the destination
inode.

Fix that by jumping to the 'out' label instead, and also add a WARN_ON()
for each condition so that in case assertions are disabled, we get to
known which of the unexpected conditions triggered the error.

Fixes: a61e1e0df9f321 ("Btrfs: simplify inline extent handling when doing reflinks")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/reflink.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index cdcfeab82e12..07e19e069c84 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -505,8 +505,11 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 			 */
 			ASSERT(key.offset == 0);
 			ASSERT(datal <= fs_info->sectorsize);
-			if (key.offset != 0 || datal > fs_info->sectorsize)
-				return -EUCLEAN;
+			if (WARN_ON(key.offset != 0) ||
+			    WARN_ON(datal > fs_info->sectorsize)) {
+				ret = -EUCLEAN;
+				goto out;
+			}
 
 			ret = clone_copy_inline_extent(inode, path, &new_key,
 						       drop_start, datal, size,
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 7/7] btrfs: deal with unexpected extent type during reflinking
  2022-02-17 12:12 [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path fdmanana
                   ` (5 preceding siblings ...)
  2022-02-17 12:12 ` [PATCH 6/7] btrfs: fix unexpected error path when reflinking an inline extent fdmanana
@ 2022-02-17 12:12 ` fdmanana
  2022-02-21 16:15 ` [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path David Sterba
  7 siblings, 0 replies; 13+ messages in thread
From: fdmanana @ 2022-02-17 12:12 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Smatch complains about a possible dereference of a pointer that was not
initialized:

    CC [M]  fs/btrfs/reflink.o
    CHECK   fs/btrfs/reflink.c
  fs/btrfs/reflink.c:533 btrfs_clone() error: potentially dereferencing uninitialized 'trans'.

This is because we are not dealing with the case where the type of a file
extent has an unexpected value (not regular, not prealloc and not inline),
in which case the transaction handle pointer is not initialized.

Such unexpected type should be impossible, except in case of some memory
corruption caused either by bad hardware or some software bug causing
something like a buffer overrun.

So ASSERT that if the extent type is neither regular nor prealloc, then
it must be inline. Bail out with -EUCLEAN and a warning in case it is
not. This silences smatch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/reflink.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 07e19e069c84..03d60db0c5a3 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -494,7 +494,8 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 					&clone_info, &trans);
 			if (ret)
 				goto out;
-		} else if (type == BTRFS_FILE_EXTENT_INLINE) {
+		} else {
+			ASSERT(type == BTRFS_FILE_EXTENT_INLINE);
 			/*
 			 * Inline extents always have to start at file offset 0
 			 * and can never be bigger then the sector size. We can
@@ -505,7 +506,8 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 			 */
 			ASSERT(key.offset == 0);
 			ASSERT(datal <= fs_info->sectorsize);
-			if (WARN_ON(key.offset != 0) ||
+			if (WARN_ON(type != BTRFS_FILE_EXTENT_INLINE) ||
+			    WARN_ON(key.offset != 0) ||
 			    WARN_ON(datal > fs_info->sectorsize)) {
 				ret = -EUCLEAN;
 				goto out;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/7] btrfs: fix lost prealloc extents beyond eof after full fsync
  2022-02-17 12:12 ` [PATCH 1/7] btrfs: fix lost prealloc extents beyond eof after full fsync fdmanana
@ 2022-02-21 10:11   ` Wang Yugui
  2022-02-21 10:41     ` Filipe Manana
  0 siblings, 1 reply; 13+ messages in thread
From: Wang Yugui @ 2022-02-21 10:11 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

Hi,

xfstest btrfs/261 PASSED 40 times without this patch.

so there is some problem in xfstest btrfs/261?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/02/21

> From: Filipe Manana <fdmanana@suse.com>
> 
> When doing a full fsync, if we have prealloc extents beyond (or at) eof,
> and the leaves that contain them were not modified in the current
> transaction, we end up not logging them. This results in losing those
> extents when we replay the log after a power failure, since the inode is
> truncated to the current value of the logged i_size.
> 
> Just like for the fast fsync path, we need to always log all prealloc
> extents starting at or beyond i_size. The fast fsync case was fixed in
> commit 471d557afed155 ("Btrfs: fix loss of prealloc extents past i_size
> after fsync log replay") but it missed the full fsync path. The problem
> exists since the very early days, when the log tree was added by
> commit e02119d5a7b439 ("Btrfs: Add a write ahead tree log to optimize
> synchronous operations").
> 
> Example reproducer:
> 
>   $ mkfs.btrfs -f /dev/sdc
>   $ mount /dev/sdc /mnt
> 
>   # Create our test file with many file extent items, so that they span
>   # several leaves of metadata, even if the node/page size is 64K. Use
>   # direct IO and not fsync/O_SYNC because it's both faster and it avoids
>   # clearing the full sync flag from the inode - we want the fsync below
>   # to trigger the slow full sync code path.
>   $ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo
> 
>   # Now add two preallocated extents to our file without extending the
>   # file's size. One right at i_size, and another further beyond, leaving
>   # a gap between the two prealloc extents.
>   $ xfs_io -c "falloc -k 16M 1M" /mnt/foo
>   $ xfs_io -c "falloc -k 20M 1M" /mnt/foo
> 
>   # Make sure everything is durably persisted and the transaction is
>   # committed. This makes all created extents to have a generation lower
>   # than the generation of the transaction used by the next write and
>   # fsync.
>   sync
> 
>   # Now overwrite only the first extent, which will result in modifying
>   # only the first leaf of metadata for our inode. Then fsync it. This
>   # fsync will use the slow code path (inode full sync bit is set) because
>   # it's the first fsync since the inode was created/loaded.
>   $ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo
> 
>   # Extent list before power failure.
>   $ xfs_io -c "fiemap -v" /mnt/foo
>   /mnt/foo:
>    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
>      0: [0..7]:          2178048..2178055     8   0x0
>      1: [8..16383]:      26632..43007     16376   0x0
>      2: [16384..32767]:  2156544..2172927 16384   0x0
>      3: [32768..34815]:  2172928..2174975  2048 0x800
>      4: [34816..40959]:  hole              6144
>      5: [40960..43007]:  2174976..2177023  2048 0x801
> 
>   <power fail>
> 
>   # Mount fs again, trigger log replay.
>   $ mount /dev/sdc /mnt
> 
>   # Extent list after power failure and log replay.
>   $ xfs_io -c "fiemap -v" /mnt/foo
>   /mnt/foo:
>    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
>      0: [0..7]:          2178048..2178055     8   0x0
>      1: [8..16383]:      26632..43007     16376   0x0
>      2: [16384..32767]:  2156544..2172927 16384   0x1
> 
>   # The prealloc extents at file offsets 16M and 20M are missing.
> 
> So fix this by calling btrfs_log_prealloc_extents() when we are doing a
> full fsync, so that we always log all prealloc extents beyond eof.
> 
> A test case for fstests will follow soon.
> 
> CC: stable@vger.kernel.org # 4.19+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/tree-log.c | 43 +++++++++++++++++++++++++++++++------------
>  1 file changed, 31 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index a483337e8f41..71a5a961fef7 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4732,7 +4732,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
>  
>  /*
>   * Log all prealloc extents beyond the inode's i_size to make sure we do not
> - * lose them after doing a fast fsync and replaying the log. We scan the
> + * lose them after doing a full/fast fsync and replaying the log. We scan the
>   * subvolume's root instead of iterating the inode's extent map tree because
>   * otherwise we can log incorrect extent items based on extent map conversion.
>   * That can happen due to the fact that extent maps are merged when they
> @@ -5510,6 +5510,7 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
>  				   struct btrfs_log_ctx *ctx,
>  				   bool *need_log_inode_item)
>  {
> +	const u64 i_size = i_size_read(&inode->vfs_inode);
>  	struct btrfs_root *root = inode->root;
>  	int ins_start_slot = 0;
>  	int ins_nr = 0;
> @@ -5530,13 +5531,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
>  		if (min_key->type > max_key->type)
>  			break;
>  
> -		if (min_key->type == BTRFS_INODE_ITEM_KEY)
> +		if (min_key->type == BTRFS_INODE_ITEM_KEY) {
>  			*need_log_inode_item = false;
> -
> -		if ((min_key->type == BTRFS_INODE_REF_KEY ||
> -		     min_key->type == BTRFS_INODE_EXTREF_KEY) &&
> -		    inode->generation == trans->transid &&
> -		    !recursive_logging) {
> +		} else if (min_key->type == BTRFS_EXTENT_DATA_KEY &&
> +			   min_key->offset >= i_size) {
> +			/*
> +			 * Extents at and beyond eof are logged with
> +			 * btrfs_log_prealloc_extents().
> +			 * Only regular files have BTRFS_EXTENT_DATA_KEY keys,
> +			 * and no keys greater than that, so bail out.
> +			 */
> +			break;
> +		} else if ((min_key->type == BTRFS_INODE_REF_KEY ||
> +			    min_key->type == BTRFS_INODE_EXTREF_KEY) &&
> +			   inode->generation == trans->transid &&
> +			   !recursive_logging) {
>  			u64 other_ino = 0;
>  			u64 other_parent = 0;
>  
> @@ -5567,10 +5576,8 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
>  				btrfs_release_path(path);
>  				goto next_key;
>  			}
> -		}
> -
> -		/* Skip xattrs, we log them later with btrfs_log_all_xattrs() */
> -		if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
> +		} else if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
> +			/* Skip xattrs, logged later with btrfs_log_all_xattrs() */
>  			if (ins_nr == 0)
>  				goto next_slot;
>  			ret = copy_items(trans, inode, dst_path, path,
> @@ -5623,9 +5630,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
>  			break;
>  		}
>  	}
> -	if (ins_nr)
> +	if (ins_nr) {
>  		ret = copy_items(trans, inode, dst_path, path, ins_start_slot,
>  				 ins_nr, inode_only, logged_isize);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (inode_only == LOG_INODE_ALL && S_ISREG(inode->vfs_inode.i_mode)) {
> +		/*
> +		 * Release the path because otherwise we might attempt to double
> +		 * lock the same leaf with btrfs_log_prealloc_extents() below.
> +		 */
> +		btrfs_release_path(path);
> +		ret = btrfs_log_prealloc_extents(trans, inode, dst_path);
> +	}
>  
>  	return ret;
>  }
> -- 
> 2.33.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/7] btrfs: fix lost prealloc extents beyond eof after full fsync
  2022-02-21 10:11   ` Wang Yugui
@ 2022-02-21 10:41     ` Filipe Manana
  2022-02-21 23:51       ` Wang Yugui
  0 siblings, 1 reply; 13+ messages in thread
From: Filipe Manana @ 2022-02-21 10:41 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-btrfs

On Mon, Feb 21, 2022 at 10:27 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
>
> Hi,
>
> xfstest btrfs/261 PASSED 40 times without this patch.
>
> so there is some problem in xfstest btrfs/261?

Not that I can see, it always fails for me on an unpatched kernel:

$ ./check btrfs/261
FSTYP         -- btrfs
PLATFORM      -- Linux/x86_64 debian9 5.17.0-rc4-btrfs-next-112 #1 SMP
PREEMPT Wed Feb 16 11:20:50 WET 2022
MKFS_OPTIONS  -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

btrfs/261 8s ... - output mismatch (see
/home/fdmanana/git/hub/xfstests/results//btrfs/261.out.bad)
    --- tests/btrfs/261.out 2022-02-15 11:53:34.273201027 +0000
    +++ /home/fdmanana/git/hub/xfstests/results//btrfs/261.out.bad
2022-02-21 10:34:59.781769394 +0000
    @@ -5,6 +5,3 @@
     XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
     List of extents after power failure:
     0: [0..32767]: data
    -1: [32768..34815]: unwritten
    -2: [34816..40959]: hole
    -3: [40960..43007]: unwritten
    ...
    (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/261.out
/home/fdmanana/git/hub/xfstests/results//btrfs/261.out.bad'  to see
the entire diff)
Ran: btrfs/261
Failures: btrfs/261
Failed 1 of 1 tests

And should fail regardless of the page size, which is the only factor
I think could make a difference.

To figure it out, change the test like this:

diff --git a/tests/btrfs/261 b/tests/btrfs/261
index 8275e6a5..540a2de2 100755
--- a/tests/btrfs/261
+++ b/tests/btrfs/261
@@ -65,7 +65,15 @@ $XFS_IO_PROG -c "pwrite 0 4K" -c "fsync"
$SCRATCH_MNT/foo | _filter_xfs_io

 # Simulate a power failure and then mount again the filesystem to
replay the log
 # tree.
-_flakey_drop_and_remount
+#_flakey_drop_and_remount
+_load_flakey_table $FLAKEY_DROP_WRITES
+_unmount_flakey
+
+$BTRFS_UTIL_PROG inspect-internal dump-tree $SCRATCH_DEV >>$seqres.full 2>&1
+
+_load_flakey_table $FLAKEY_ALLOW_WRITES
+_mount_flakey
+

 # After the power failure we expect that the preallocated extents, beyond the
 # inode's i_size, still exist.

Run it and then provide the contents of the file results/btrfs/261.full

>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/02/21
>
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When doing a full fsync, if we have prealloc extents beyond (or at) eof,
> > and the leaves that contain them were not modified in the current
> > transaction, we end up not logging them. This results in losing those
> > extents when we replay the log after a power failure, since the inode is
> > truncated to the current value of the logged i_size.
> >
> > Just like for the fast fsync path, we need to always log all prealloc
> > extents starting at or beyond i_size. The fast fsync case was fixed in
> > commit 471d557afed155 ("Btrfs: fix loss of prealloc extents past i_size
> > after fsync log replay") but it missed the full fsync path. The problem
> > exists since the very early days, when the log tree was added by
> > commit e02119d5a7b439 ("Btrfs: Add a write ahead tree log to optimize
> > synchronous operations").
> >
> > Example reproducer:
> >
> >   $ mkfs.btrfs -f /dev/sdc
> >   $ mount /dev/sdc /mnt
> >
> >   # Create our test file with many file extent items, so that they span
> >   # several leaves of metadata, even if the node/page size is 64K. Use
> >   # direct IO and not fsync/O_SYNC because it's both faster and it avoids
> >   # clearing the full sync flag from the inode - we want the fsync below
> >   # to trigger the slow full sync code path.
> >   $ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo
> >
> >   # Now add two preallocated extents to our file without extending the
> >   # file's size. One right at i_size, and another further beyond, leaving
> >   # a gap between the two prealloc extents.
> >   $ xfs_io -c "falloc -k 16M 1M" /mnt/foo
> >   $ xfs_io -c "falloc -k 20M 1M" /mnt/foo
> >
> >   # Make sure everything is durably persisted and the transaction is
> >   # committed. This makes all created extents to have a generation lower
> >   # than the generation of the transaction used by the next write and
> >   # fsync.
> >   sync
> >
> >   # Now overwrite only the first extent, which will result in modifying
> >   # only the first leaf of metadata for our inode. Then fsync it. This
> >   # fsync will use the slow code path (inode full sync bit is set) because
> >   # it's the first fsync since the inode was created/loaded.
> >   $ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo
> >
> >   # Extent list before power failure.
> >   $ xfs_io -c "fiemap -v" /mnt/foo
> >   /mnt/foo:
> >    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
> >      0: [0..7]:          2178048..2178055     8   0x0
> >      1: [8..16383]:      26632..43007     16376   0x0
> >      2: [16384..32767]:  2156544..2172927 16384   0x0
> >      3: [32768..34815]:  2172928..2174975  2048 0x800
> >      4: [34816..40959]:  hole              6144
> >      5: [40960..43007]:  2174976..2177023  2048 0x801
> >
> >   <power fail>
> >
> >   # Mount fs again, trigger log replay.
> >   $ mount /dev/sdc /mnt
> >
> >   # Extent list after power failure and log replay.
> >   $ xfs_io -c "fiemap -v" /mnt/foo
> >   /mnt/foo:
> >    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
> >      0: [0..7]:          2178048..2178055     8   0x0
> >      1: [8..16383]:      26632..43007     16376   0x0
> >      2: [16384..32767]:  2156544..2172927 16384   0x1
> >
> >   # The prealloc extents at file offsets 16M and 20M are missing.
> >
> > So fix this by calling btrfs_log_prealloc_extents() when we are doing a
> > full fsync, so that we always log all prealloc extents beyond eof.
> >
> > A test case for fstests will follow soon.
> >
> > CC: stable@vger.kernel.org # 4.19+
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/tree-log.c | 43 +++++++++++++++++++++++++++++++------------
> >  1 file changed, 31 insertions(+), 12 deletions(-)
> >
> > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > index a483337e8f41..71a5a961fef7 100644
> > --- a/fs/btrfs/tree-log.c
> > +++ b/fs/btrfs/tree-log.c
> > @@ -4732,7 +4732,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
> >
> >  /*
> >   * Log all prealloc extents beyond the inode's i_size to make sure we do not
> > - * lose them after doing a fast fsync and replaying the log. We scan the
> > + * lose them after doing a full/fast fsync and replaying the log. We scan the
> >   * subvolume's root instead of iterating the inode's extent map tree because
> >   * otherwise we can log incorrect extent items based on extent map conversion.
> >   * That can happen due to the fact that extent maps are merged when they
> > @@ -5510,6 +5510,7 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> >                                  struct btrfs_log_ctx *ctx,
> >                                  bool *need_log_inode_item)
> >  {
> > +     const u64 i_size = i_size_read(&inode->vfs_inode);
> >       struct btrfs_root *root = inode->root;
> >       int ins_start_slot = 0;
> >       int ins_nr = 0;
> > @@ -5530,13 +5531,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> >               if (min_key->type > max_key->type)
> >                       break;
> >
> > -             if (min_key->type == BTRFS_INODE_ITEM_KEY)
> > +             if (min_key->type == BTRFS_INODE_ITEM_KEY) {
> >                       *need_log_inode_item = false;
> > -
> > -             if ((min_key->type == BTRFS_INODE_REF_KEY ||
> > -                  min_key->type == BTRFS_INODE_EXTREF_KEY) &&
> > -                 inode->generation == trans->transid &&
> > -                 !recursive_logging) {
> > +             } else if (min_key->type == BTRFS_EXTENT_DATA_KEY &&
> > +                        min_key->offset >= i_size) {
> > +                     /*
> > +                      * Extents at and beyond eof are logged with
> > +                      * btrfs_log_prealloc_extents().
> > +                      * Only regular files have BTRFS_EXTENT_DATA_KEY keys,
> > +                      * and no keys greater than that, so bail out.
> > +                      */
> > +                     break;
> > +             } else if ((min_key->type == BTRFS_INODE_REF_KEY ||
> > +                         min_key->type == BTRFS_INODE_EXTREF_KEY) &&
> > +                        inode->generation == trans->transid &&
> > +                        !recursive_logging) {
> >                       u64 other_ino = 0;
> >                       u64 other_parent = 0;
> >
> > @@ -5567,10 +5576,8 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> >                               btrfs_release_path(path);
> >                               goto next_key;
> >                       }
> > -             }
> > -
> > -             /* Skip xattrs, we log them later with btrfs_log_all_xattrs() */
> > -             if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
> > +             } else if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
> > +                     /* Skip xattrs, logged later with btrfs_log_all_xattrs() */
> >                       if (ins_nr == 0)
> >                               goto next_slot;
> >                       ret = copy_items(trans, inode, dst_path, path,
> > @@ -5623,9 +5630,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> >                       break;
> >               }
> >       }
> > -     if (ins_nr)
> > +     if (ins_nr) {
> >               ret = copy_items(trans, inode, dst_path, path, ins_start_slot,
> >                                ins_nr, inode_only, logged_isize);
> > +             if (ret)
> > +                     return ret;
> > +     }
> > +
> > +     if (inode_only == LOG_INODE_ALL && S_ISREG(inode->vfs_inode.i_mode)) {
> > +             /*
> > +              * Release the path because otherwise we might attempt to double
> > +              * lock the same leaf with btrfs_log_prealloc_extents() below.
> > +              */
> > +             btrfs_release_path(path);
> > +             ret = btrfs_log_prealloc_extents(trans, inode, dst_path);
> > +     }
> >
> >       return ret;
> >  }
> > --
> > 2.33.0
>
>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path
  2022-02-17 12:12 [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path fdmanana
                   ` (6 preceding siblings ...)
  2022-02-17 12:12 ` [PATCH 7/7] btrfs: deal with unexpected extent type during reflinking fdmanana
@ 2022-02-21 16:15 ` David Sterba
  7 siblings, 0 replies; 13+ messages in thread
From: David Sterba @ 2022-02-21 16:15 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Thu, Feb 17, 2022 at 12:12:01PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> This fixes a bug (first patch) with preallocated extents beyond eof being
> lost after a full fsync and a power failure. The rest is mostly some
> improvements to the full fsync code path (less IO, use less memory for
> logging checksums, etc), and silence smatch about a possible dereference
> of an uninitialized pointer. More details in the changelogs.
> 
> Filipe Manana (7):
>   btrfs: fix lost prealloc extents beyond eof after full fsync
>   btrfs: stop copying old file extents when doing a full fsync
>   btrfs: hold on to less memory when logging checksums during full fsync
>   btrfs: voluntarily relinquish cpu when doing a full fsync
>   btrfs: reset last_reflink_trans after fsyncing inode
>   btrfs: fix unexpected error path when reflinking an inline extent
>   btrfs: deal with unexpected extent type during reflinking

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/7] btrfs: fix lost prealloc extents beyond eof after full fsync
  2022-02-21 10:41     ` Filipe Manana
@ 2022-02-21 23:51       ` Wang Yugui
  2022-02-22 10:29         ` Filipe Manana
  0 siblings, 1 reply; 13+ messages in thread
From: Wang Yugui @ 2022-02-21 23:51 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

Hi,

> >
> > xfstest btrfs/261 PASSED 40 times without this patch.
> >
> > so there is some problem in xfstest btrfs/261?
> 
> Not that I can see, it always fails for me on an unpatched kernel:

xfstest btrfs/261 always fails on unpatched kernel 5.17.0-rc4 here too.

Thanks a lot.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/02/22


> 
> $ ./check btrfs/261
> FSTYP         -- btrfs
> PLATFORM      -- Linux/x86_64 debian9 5.17.0-rc4-btrfs-next-112 #1 SMP
> PREEMPT Wed Feb 16 11:20:50 WET 2022
> MKFS_OPTIONS  -- /dev/sdc
> MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
> 
> btrfs/261 8s ... - output mismatch (see
> /home/fdmanana/git/hub/xfstests/results//btrfs/261.out.bad)
>     --- tests/btrfs/261.out 2022-02-15 11:53:34.273201027 +0000
>     +++ /home/fdmanana/git/hub/xfstests/results//btrfs/261.out.bad
> 2022-02-21 10:34:59.781769394 +0000
>     @@ -5,6 +5,3 @@
>      XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>      List of extents after power failure:
>      0: [0..32767]: data
>     -1: [32768..34815]: unwritten
>     -2: [34816..40959]: hole
>     -3: [40960..43007]: unwritten
>     ...
>     (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/261.out
> /home/fdmanana/git/hub/xfstests/results//btrfs/261.out.bad'  to see
> the entire diff)
> Ran: btrfs/261
> Failures: btrfs/261
> Failed 1 of 1 tests
> 
> And should fail regardless of the page size, which is the only factor
> I think could make a difference.
> 
> To figure it out, change the test like this:
> 
> diff --git a/tests/btrfs/261 b/tests/btrfs/261
> index 8275e6a5..540a2de2 100755
> --- a/tests/btrfs/261
> +++ b/tests/btrfs/261
> @@ -65,7 +65,15 @@ $XFS_IO_PROG -c "pwrite 0 4K" -c "fsync"
> $SCRATCH_MNT/foo | _filter_xfs_io
> 
>  # Simulate a power failure and then mount again the filesystem to
> replay the log
>  # tree.
> -_flakey_drop_and_remount
> +#_flakey_drop_and_remount
> +_load_flakey_table $FLAKEY_DROP_WRITES
> +_unmount_flakey
> +
> +$BTRFS_UTIL_PROG inspect-internal dump-tree $SCRATCH_DEV >>$seqres.full 2>&1
> +
> +_load_flakey_table $FLAKEY_ALLOW_WRITES
> +_mount_flakey
> +
> 
>  # After the power failure we expect that the preallocated extents, beyond the
>  # inode's i_size, still exist.
> 
> Run it and then provide the contents of the file results/btrfs/261.full
> 
> >
> > Best Regards
> > Wang Yugui (wangyugui@e16-tech.com)
> > 2022/02/21
> >
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > When doing a full fsync, if we have prealloc extents beyond (or at) eof,
> > > and the leaves that contain them were not modified in the current
> > > transaction, we end up not logging them. This results in losing those
> > > extents when we replay the log after a power failure, since the inode is
> > > truncated to the current value of the logged i_size.
> > >
> > > Just like for the fast fsync path, we need to always log all prealloc
> > > extents starting at or beyond i_size. The fast fsync case was fixed in
> > > commit 471d557afed155 ("Btrfs: fix loss of prealloc extents past i_size
> > > after fsync log replay") but it missed the full fsync path. The problem
> > > exists since the very early days, when the log tree was added by
> > > commit e02119d5a7b439 ("Btrfs: Add a write ahead tree log to optimize
> > > synchronous operations").
> > >
> > > Example reproducer:
> > >
> > >   $ mkfs.btrfs -f /dev/sdc
> > >   $ mount /dev/sdc /mnt
> > >
> > >   # Create our test file with many file extent items, so that they span
> > >   # several leaves of metadata, even if the node/page size is 64K. Use
> > >   # direct IO and not fsync/O_SYNC because it's both faster and it avoids
> > >   # clearing the full sync flag from the inode - we want the fsync below
> > >   # to trigger the slow full sync code path.
> > >   $ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo
> > >
> > >   # Now add two preallocated extents to our file without extending the
> > >   # file's size. One right at i_size, and another further beyond, leaving
> > >   # a gap between the two prealloc extents.
> > >   $ xfs_io -c "falloc -k 16M 1M" /mnt/foo
> > >   $ xfs_io -c "falloc -k 20M 1M" /mnt/foo
> > >
> > >   # Make sure everything is durably persisted and the transaction is
> > >   # committed. This makes all created extents to have a generation lower
> > >   # than the generation of the transaction used by the next write and
> > >   # fsync.
> > >   sync
> > >
> > >   # Now overwrite only the first extent, which will result in modifying
> > >   # only the first leaf of metadata for our inode. Then fsync it. This
> > >   # fsync will use the slow code path (inode full sync bit is set) because
> > >   # it's the first fsync since the inode was created/loaded.
> > >   $ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo
> > >
> > >   # Extent list before power failure.
> > >   $ xfs_io -c "fiemap -v" /mnt/foo
> > >   /mnt/foo:
> > >    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
> > >      0: [0..7]:          2178048..2178055     8   0x0
> > >      1: [8..16383]:      26632..43007     16376   0x0
> > >      2: [16384..32767]:  2156544..2172927 16384   0x0
> > >      3: [32768..34815]:  2172928..2174975  2048 0x800
> > >      4: [34816..40959]:  hole              6144
> > >      5: [40960..43007]:  2174976..2177023  2048 0x801
> > >
> > >   <power fail>
> > >
> > >   # Mount fs again, trigger log replay.
> > >   $ mount /dev/sdc /mnt
> > >
> > >   # Extent list after power failure and log replay.
> > >   $ xfs_io -c "fiemap -v" /mnt/foo
> > >   /mnt/foo:
> > >    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
> > >      0: [0..7]:          2178048..2178055     8   0x0
> > >      1: [8..16383]:      26632..43007     16376   0x0
> > >      2: [16384..32767]:  2156544..2172927 16384   0x1
> > >
> > >   # The prealloc extents at file offsets 16M and 20M are missing.
> > >
> > > So fix this by calling btrfs_log_prealloc_extents() when we are doing a
> > > full fsync, so that we always log all prealloc extents beyond eof.
> > >
> > > A test case for fstests will follow soon.
> > >
> > > CC: stable@vger.kernel.org # 4.19+
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > ---
> > >  fs/btrfs/tree-log.c | 43 +++++++++++++++++++++++++++++++------------
> > >  1 file changed, 31 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > > index a483337e8f41..71a5a961fef7 100644
> > > --- a/fs/btrfs/tree-log.c
> > > +++ b/fs/btrfs/tree-log.c
> > > @@ -4732,7 +4732,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
> > >
> > >  /*
> > >   * Log all prealloc extents beyond the inode's i_size to make sure we do not
> > > - * lose them after doing a fast fsync and replaying the log. We scan the
> > > + * lose them after doing a full/fast fsync and replaying the log. We scan the
> > >   * subvolume's root instead of iterating the inode's extent map tree because
> > >   * otherwise we can log incorrect extent items based on extent map conversion.
> > >   * That can happen due to the fact that extent maps are merged when they
> > > @@ -5510,6 +5510,7 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> > >                                  struct btrfs_log_ctx *ctx,
> > >                                  bool *need_log_inode_item)
> > >  {
> > > +     const u64 i_size = i_size_read(&inode->vfs_inode);
> > >       struct btrfs_root *root = inode->root;
> > >       int ins_start_slot = 0;
> > >       int ins_nr = 0;
> > > @@ -5530,13 +5531,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> > >               if (min_key->type > max_key->type)
> > >                       break;
> > >
> > > -             if (min_key->type == BTRFS_INODE_ITEM_KEY)
> > > +             if (min_key->type == BTRFS_INODE_ITEM_KEY) {
> > >                       *need_log_inode_item = false;
> > > -
> > > -             if ((min_key->type == BTRFS_INODE_REF_KEY ||
> > > -                  min_key->type == BTRFS_INODE_EXTREF_KEY) &&
> > > -                 inode->generation == trans->transid &&
> > > -                 !recursive_logging) {
> > > +             } else if (min_key->type == BTRFS_EXTENT_DATA_KEY &&
> > > +                        min_key->offset >= i_size) {
> > > +                     /*
> > > +                      * Extents at and beyond eof are logged with
> > > +                      * btrfs_log_prealloc_extents().
> > > +                      * Only regular files have BTRFS_EXTENT_DATA_KEY keys,
> > > +                      * and no keys greater than that, so bail out.
> > > +                      */
> > > +                     break;
> > > +             } else if ((min_key->type == BTRFS_INODE_REF_KEY ||
> > > +                         min_key->type == BTRFS_INODE_EXTREF_KEY) &&
> > > +                        inode->generation == trans->transid &&
> > > +                        !recursive_logging) {
> > >                       u64 other_ino = 0;
> > >                       u64 other_parent = 0;
> > >
> > > @@ -5567,10 +5576,8 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> > >                               btrfs_release_path(path);
> > >                               goto next_key;
> > >                       }
> > > -             }
> > > -
> > > -             /* Skip xattrs, we log them later with btrfs_log_all_xattrs() */
> > > -             if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
> > > +             } else if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
> > > +                     /* Skip xattrs, logged later with btrfs_log_all_xattrs() */
> > >                       if (ins_nr == 0)
> > >                               goto next_slot;
> > >                       ret = copy_items(trans, inode, dst_path, path,
> > > @@ -5623,9 +5630,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> > >                       break;
> > >               }
> > >       }
> > > -     if (ins_nr)
> > > +     if (ins_nr) {
> > >               ret = copy_items(trans, inode, dst_path, path, ins_start_slot,
> > >                                ins_nr, inode_only, logged_isize);
> > > +             if (ret)
> > > +                     return ret;
> > > +     }
> > > +
> > > +     if (inode_only == LOG_INODE_ALL && S_ISREG(inode->vfs_inode.i_mode)) {
> > > +             /*
> > > +              * Release the path because otherwise we might attempt to double
> > > +              * lock the same leaf with btrfs_log_prealloc_extents() below.
> > > +              */
> > > +             btrfs_release_path(path);
> > > +             ret = btrfs_log_prealloc_extents(trans, inode, dst_path);
> > > +     }
> > >
> > >       return ret;
> > >  }
> > > --
> > > 2.33.0
> >
> >



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/7] btrfs: fix lost prealloc extents beyond eof after full fsync
  2022-02-21 23:51       ` Wang Yugui
@ 2022-02-22 10:29         ` Filipe Manana
  0 siblings, 0 replies; 13+ messages in thread
From: Filipe Manana @ 2022-02-22 10:29 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-btrfs

On Mon, Feb 21, 2022 at 11:52 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
>
> Hi,
>
> > >
> > > xfstest btrfs/261 PASSED 40 times without this patch.
> > >
> > > so there is some problem in xfstest btrfs/261?
> >
> > Not that I can see, it always fails for me on an unpatched kernel:
>
> xfstest btrfs/261 always fails on unpatched kernel 5.17.0-rc4 here too.

I don't get it.
So earlier you said the test passed without the patch, but now you are
saying the opposite?
Did I misunderstand something?

>
> Thanks a lot.
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/02/22
>
>
> >
> > $ ./check btrfs/261
> > FSTYP         -- btrfs
> > PLATFORM      -- Linux/x86_64 debian9 5.17.0-rc4-btrfs-next-112 #1 SMP
> > PREEMPT Wed Feb 16 11:20:50 WET 2022
> > MKFS_OPTIONS  -- /dev/sdc
> > MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
> >
> > btrfs/261 8s ... - output mismatch (see
> > /home/fdmanana/git/hub/xfstests/results//btrfs/261.out.bad)
> >     --- tests/btrfs/261.out 2022-02-15 11:53:34.273201027 +0000
> >     +++ /home/fdmanana/git/hub/xfstests/results//btrfs/261.out.bad
> > 2022-02-21 10:34:59.781769394 +0000
> >     @@ -5,6 +5,3 @@
> >      XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> >      List of extents after power failure:
> >      0: [0..32767]: data
> >     -1: [32768..34815]: unwritten
> >     -2: [34816..40959]: hole
> >     -3: [40960..43007]: unwritten
> >     ...
> >     (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/261.out
> > /home/fdmanana/git/hub/xfstests/results//btrfs/261.out.bad'  to see
> > the entire diff)
> > Ran: btrfs/261
> > Failures: btrfs/261
> > Failed 1 of 1 tests
> >
> > And should fail regardless of the page size, which is the only factor
> > I think could make a difference.
> >
> > To figure it out, change the test like this:
> >
> > diff --git a/tests/btrfs/261 b/tests/btrfs/261
> > index 8275e6a5..540a2de2 100755
> > --- a/tests/btrfs/261
> > +++ b/tests/btrfs/261
> > @@ -65,7 +65,15 @@ $XFS_IO_PROG -c "pwrite 0 4K" -c "fsync"
> > $SCRATCH_MNT/foo | _filter_xfs_io
> >
> >  # Simulate a power failure and then mount again the filesystem to
> > replay the log
> >  # tree.
> > -_flakey_drop_and_remount
> > +#_flakey_drop_and_remount
> > +_load_flakey_table $FLAKEY_DROP_WRITES
> > +_unmount_flakey
> > +
> > +$BTRFS_UTIL_PROG inspect-internal dump-tree $SCRATCH_DEV >>$seqres.full 2>&1
> > +
> > +_load_flakey_table $FLAKEY_ALLOW_WRITES
> > +_mount_flakey
> > +
> >
> >  # After the power failure we expect that the preallocated extents, beyond the
> >  # inode's i_size, still exist.
> >
> > Run it and then provide the contents of the file results/btrfs/261.full
> >
> > >
> > > Best Regards
> > > Wang Yugui (wangyugui@e16-tech.com)
> > > 2022/02/21
> > >
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > When doing a full fsync, if we have prealloc extents beyond (or at) eof,
> > > > and the leaves that contain them were not modified in the current
> > > > transaction, we end up not logging them. This results in losing those
> > > > extents when we replay the log after a power failure, since the inode is
> > > > truncated to the current value of the logged i_size.
> > > >
> > > > Just like for the fast fsync path, we need to always log all prealloc
> > > > extents starting at or beyond i_size. The fast fsync case was fixed in
> > > > commit 471d557afed155 ("Btrfs: fix loss of prealloc extents past i_size
> > > > after fsync log replay") but it missed the full fsync path. The problem
> > > > exists since the very early days, when the log tree was added by
> > > > commit e02119d5a7b439 ("Btrfs: Add a write ahead tree log to optimize
> > > > synchronous operations").
> > > >
> > > > Example reproducer:
> > > >
> > > >   $ mkfs.btrfs -f /dev/sdc
> > > >   $ mount /dev/sdc /mnt
> > > >
> > > >   # Create our test file with many file extent items, so that they span
> > > >   # several leaves of metadata, even if the node/page size is 64K. Use
> > > >   # direct IO and not fsync/O_SYNC because it's both faster and it avoids
> > > >   # clearing the full sync flag from the inode - we want the fsync below
> > > >   # to trigger the slow full sync code path.
> > > >   $ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo
> > > >
> > > >   # Now add two preallocated extents to our file without extending the
> > > >   # file's size. One right at i_size, and another further beyond, leaving
> > > >   # a gap between the two prealloc extents.
> > > >   $ xfs_io -c "falloc -k 16M 1M" /mnt/foo
> > > >   $ xfs_io -c "falloc -k 20M 1M" /mnt/foo
> > > >
> > > >   # Make sure everything is durably persisted and the transaction is
> > > >   # committed. This makes all created extents to have a generation lower
> > > >   # than the generation of the transaction used by the next write and
> > > >   # fsync.
> > > >   sync
> > > >
> > > >   # Now overwrite only the first extent, which will result in modifying
> > > >   # only the first leaf of metadata for our inode. Then fsync it. This
> > > >   # fsync will use the slow code path (inode full sync bit is set) because
> > > >   # it's the first fsync since the inode was created/loaded.
> > > >   $ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo
> > > >
> > > >   # Extent list before power failure.
> > > >   $ xfs_io -c "fiemap -v" /mnt/foo
> > > >   /mnt/foo:
> > > >    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
> > > >      0: [0..7]:          2178048..2178055     8   0x0
> > > >      1: [8..16383]:      26632..43007     16376   0x0
> > > >      2: [16384..32767]:  2156544..2172927 16384   0x0
> > > >      3: [32768..34815]:  2172928..2174975  2048 0x800
> > > >      4: [34816..40959]:  hole              6144
> > > >      5: [40960..43007]:  2174976..2177023  2048 0x801
> > > >
> > > >   <power fail>
> > > >
> > > >   # Mount fs again, trigger log replay.
> > > >   $ mount /dev/sdc /mnt
> > > >
> > > >   # Extent list after power failure and log replay.
> > > >   $ xfs_io -c "fiemap -v" /mnt/foo
> > > >   /mnt/foo:
> > > >    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
> > > >      0: [0..7]:          2178048..2178055     8   0x0
> > > >      1: [8..16383]:      26632..43007     16376   0x0
> > > >      2: [16384..32767]:  2156544..2172927 16384   0x1
> > > >
> > > >   # The prealloc extents at file offsets 16M and 20M are missing.
> > > >
> > > > So fix this by calling btrfs_log_prealloc_extents() when we are doing a
> > > > full fsync, so that we always log all prealloc extents beyond eof.
> > > >
> > > > A test case for fstests will follow soon.
> > > >
> > > > CC: stable@vger.kernel.org # 4.19+
> > > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > > ---
> > > >  fs/btrfs/tree-log.c | 43 +++++++++++++++++++++++++++++++------------
> > > >  1 file changed, 31 insertions(+), 12 deletions(-)
> > > >
> > > > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > > > index a483337e8f41..71a5a961fef7 100644
> > > > --- a/fs/btrfs/tree-log.c
> > > > +++ b/fs/btrfs/tree-log.c
> > > > @@ -4732,7 +4732,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
> > > >
> > > >  /*
> > > >   * Log all prealloc extents beyond the inode's i_size to make sure we do not
> > > > - * lose them after doing a fast fsync and replaying the log. We scan the
> > > > + * lose them after doing a full/fast fsync and replaying the log. We scan the
> > > >   * subvolume's root instead of iterating the inode's extent map tree because
> > > >   * otherwise we can log incorrect extent items based on extent map conversion.
> > > >   * That can happen due to the fact that extent maps are merged when they
> > > > @@ -5510,6 +5510,7 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> > > >                                  struct btrfs_log_ctx *ctx,
> > > >                                  bool *need_log_inode_item)
> > > >  {
> > > > +     const u64 i_size = i_size_read(&inode->vfs_inode);
> > > >       struct btrfs_root *root = inode->root;
> > > >       int ins_start_slot = 0;
> > > >       int ins_nr = 0;
> > > > @@ -5530,13 +5531,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> > > >               if (min_key->type > max_key->type)
> > > >                       break;
> > > >
> > > > -             if (min_key->type == BTRFS_INODE_ITEM_KEY)
> > > > +             if (min_key->type == BTRFS_INODE_ITEM_KEY) {
> > > >                       *need_log_inode_item = false;
> > > > -
> > > > -             if ((min_key->type == BTRFS_INODE_REF_KEY ||
> > > > -                  min_key->type == BTRFS_INODE_EXTREF_KEY) &&
> > > > -                 inode->generation == trans->transid &&
> > > > -                 !recursive_logging) {
> > > > +             } else if (min_key->type == BTRFS_EXTENT_DATA_KEY &&
> > > > +                        min_key->offset >= i_size) {
> > > > +                     /*
> > > > +                      * Extents at and beyond eof are logged with
> > > > +                      * btrfs_log_prealloc_extents().
> > > > +                      * Only regular files have BTRFS_EXTENT_DATA_KEY keys,
> > > > +                      * and no keys greater than that, so bail out.
> > > > +                      */
> > > > +                     break;
> > > > +             } else if ((min_key->type == BTRFS_INODE_REF_KEY ||
> > > > +                         min_key->type == BTRFS_INODE_EXTREF_KEY) &&
> > > > +                        inode->generation == trans->transid &&
> > > > +                        !recursive_logging) {
> > > >                       u64 other_ino = 0;
> > > >                       u64 other_parent = 0;
> > > >
> > > > @@ -5567,10 +5576,8 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> > > >                               btrfs_release_path(path);
> > > >                               goto next_key;
> > > >                       }
> > > > -             }
> > > > -
> > > > -             /* Skip xattrs, we log them later with btrfs_log_all_xattrs() */
> > > > -             if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
> > > > +             } else if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
> > > > +                     /* Skip xattrs, logged later with btrfs_log_all_xattrs() */
> > > >                       if (ins_nr == 0)
> > > >                               goto next_slot;
> > > >                       ret = copy_items(trans, inode, dst_path, path,
> > > > @@ -5623,9 +5630,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> > > >                       break;
> > > >               }
> > > >       }
> > > > -     if (ins_nr)
> > > > +     if (ins_nr) {
> > > >               ret = copy_items(trans, inode, dst_path, path, ins_start_slot,
> > > >                                ins_nr, inode_only, logged_isize);
> > > > +             if (ret)
> > > > +                     return ret;
> > > > +     }
> > > > +
> > > > +     if (inode_only == LOG_INODE_ALL && S_ISREG(inode->vfs_inode.i_mode)) {
> > > > +             /*
> > > > +              * Release the path because otherwise we might attempt to double
> > > > +              * lock the same leaf with btrfs_log_prealloc_extents() below.
> > > > +              */
> > > > +             btrfs_release_path(path);
> > > > +             ret = btrfs_log_prealloc_extents(trans, inode, dst_path);
> > > > +     }
> > > >
> > > >       return ret;
> > > >  }
> > > > --
> > > > 2.33.0
> > >
> > >
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-02-22 10:30 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-17 12:12 [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path fdmanana
2022-02-17 12:12 ` [PATCH 1/7] btrfs: fix lost prealloc extents beyond eof after full fsync fdmanana
2022-02-21 10:11   ` Wang Yugui
2022-02-21 10:41     ` Filipe Manana
2022-02-21 23:51       ` Wang Yugui
2022-02-22 10:29         ` Filipe Manana
2022-02-17 12:12 ` [PATCH 2/7] btrfs: stop copying old file extents when doing a " fdmanana
2022-02-17 12:12 ` [PATCH 3/7] btrfs: hold on to less memory when logging checksums during " fdmanana
2022-02-17 12:12 ` [PATCH 4/7] btrfs: voluntarily relinquish cpu when doing a " fdmanana
2022-02-17 12:12 ` [PATCH 5/7] btrfs: reset last_reflink_trans after fsyncing inode fdmanana
2022-02-17 12:12 ` [PATCH 6/7] btrfs: fix unexpected error path when reflinking an inline extent fdmanana
2022-02-17 12:12 ` [PATCH 7/7] btrfs: deal with unexpected extent type during reflinking fdmanana
2022-02-21 16:15 ` [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).