linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Robbie Ko <robbieko@synology.com>,
	Filipe Manana <fdmanana@suse.com>,
	David Sterba <dsterba@suse.com>,
	Anand Jain <anand.jain@oracle.com>
Subject: [PATCH 4.14 43/70] Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space
Date: Mon, 12 Oct 2020 15:26:59 +0200	[thread overview]
Message-ID: <20201012132632.255115359@linuxfoundation.org> (raw)
In-Reply-To: <20201012132630.201442517@linuxfoundation.org>

From: Robbie Ko <robbieko@synology.com>

commit 8ecebf4d767e2307a946c8905278d6358eda35c3 upstream.

Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") forced
nocow writes to fallback to COW, during writeback, when a snapshot is
created. This resulted in writes made before creating the snapshot to
unexpectedly fail with ENOSPC during writeback when success (0) was
returned to user space through the write system call.

The steps leading to this problem are:

1. When it's not possible to allocate data space for a write, the
   buffered write path checks if a NOCOW write is possible.  If it is,
   it will not reserve space and success (0) is returned to user space.

2. Then when a snapshot is created, the root's will_be_snapshotted
   atomic is incremented and writeback is triggered for all inode's that
   belong to the root being snapshotted. Incrementing that atomic forces
   all previous writes to fallback to COW during writeback (running
   delalloc).

3. This results in the writeback for the inodes to fail and therefore
   setting the ENOSPC error in their mappings, so that a subsequent
   fsync on them will report the error to user space. So it's not a
   completely silent data loss (since fsync will report ENOSPC) but it's
   a very unexpected and undesirable behaviour, because if a clean
   shutdown/unmount of the filesystem happens without previous calls to
   fsync, it is expected to have the data present in the files after
   mounting the filesystem again.

So fix this by adding a new atomic named snapshot_force_cow to the
root structure which prevents this behaviour and works the following way:

1. It is incremented when we start to create a snapshot after triggering
   writeback and before waiting for writeback to finish.

2. This new atomic is now what is used by writeback (running delalloc)
   to decide whether we need to fallback to COW or not. Because we
   incremented this new atomic after triggering writeback in the
   snapshot creation ioctl, we ensure that all buffered writes that
   happened before snapshot creation will succeed and not fallback to
   COW (which would make them fail with ENOSPC).

3. The existing atomic, will_be_snapshotted, is kept because it is used
   to force new buffered writes, that start after we started
   snapshotting, to reserve data space even when NOCOW is possible.
   This makes these writes fail early with ENOSPC when there's no
   available space to allocate, preventing the unexpected behaviour of
   writeback later failing with ENOSPC due to a fallback to COW mode.

Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/btrfs/ctree.h   |    1 +
 fs/btrfs/disk-io.c |    1 +
 fs/btrfs/inode.c   |   25 ++++---------------------
 fs/btrfs/ioctl.c   |   16 ++++++++++++++++
 4 files changed, 22 insertions(+), 21 deletions(-)

--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1257,6 +1257,7 @@ struct btrfs_root {
 	int send_in_progress;
 	struct btrfs_subvolume_writers *subv_writers;
 	atomic_t will_be_snapshotted;
+	atomic_t snapshot_force_cow;
 
 	/* For qgroup metadata space reserve */
 	atomic64_t qgroup_meta_rsv;
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1200,6 +1200,7 @@ static void __setup_root(struct btrfs_ro
 	refcount_set(&root->refs, 1);
 	atomic_set(&root->will_be_snapshotted, 0);
 	atomic64_set(&root->qgroup_meta_rsv, 0);
+	atomic_set(&root->snapshot_force_cow, 0);
 	root->log_transid = 0;
 	root->log_transid_committed = -1;
 	root->last_log_commit = 0;
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1335,7 +1335,7 @@ static noinline int run_delalloc_nocow(s
 	u64 disk_num_bytes;
 	u64 ram_bytes;
 	int extent_type;
-	int ret, err;
+	int ret;
 	int type;
 	int nocow;
 	int check_prev = 1;
@@ -1460,11 +1460,8 @@ next_slot:
 			 * if there are pending snapshots for this root,
 			 * we fall into common COW way.
 			 */
-			if (!nolock) {
-				err = btrfs_start_write_no_snapshotting(root);
-				if (!err)
-					goto out_check;
-			}
+			if (!nolock && atomic_read(&root->snapshot_force_cow))
+				goto out_check;
 			/*
 			 * force cow if csum exists in the range.
 			 * this ensure that csum for a given extent are
@@ -1473,9 +1470,6 @@ next_slot:
 			ret = csum_exist_in_range(fs_info, disk_bytenr,
 						  num_bytes);
 			if (ret) {
-				if (!nolock)
-					btrfs_end_write_no_snapshotting(root);
-
 				/*
 				 * ret could be -EIO if the above fails to read
 				 * metadata.
@@ -1488,11 +1482,8 @@ next_slot:
 				WARN_ON_ONCE(nolock);
 				goto out_check;
 			}
-			if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr)) {
-				if (!nolock)
-					btrfs_end_write_no_snapshotting(root);
+			if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr))
 				goto out_check;
-			}
 			nocow = 1;
 		} else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
 			extent_end = found_key.offset +
@@ -1505,8 +1496,6 @@ next_slot:
 out_check:
 		if (extent_end <= start) {
 			path->slots[0]++;
-			if (!nolock && nocow)
-				btrfs_end_write_no_snapshotting(root);
 			if (nocow)
 				btrfs_dec_nocow_writers(fs_info, disk_bytenr);
 			goto next_slot;
@@ -1528,8 +1517,6 @@ out_check:
 					     end, page_started, nr_written, 1,
 					     NULL);
 			if (ret) {
-				if (!nolock && nocow)
-					btrfs_end_write_no_snapshotting(root);
 				if (nocow)
 					btrfs_dec_nocow_writers(fs_info,
 								disk_bytenr);
@@ -1549,8 +1536,6 @@ out_check:
 					  ram_bytes, BTRFS_COMPRESS_NONE,
 					  BTRFS_ORDERED_PREALLOC);
 			if (IS_ERR(em)) {
-				if (!nolock && nocow)
-					btrfs_end_write_no_snapshotting(root);
 				if (nocow)
 					btrfs_dec_nocow_writers(fs_info,
 								disk_bytenr);
@@ -1589,8 +1574,6 @@ out_check:
 					     EXTENT_CLEAR_DATA_RESV,
 					     PAGE_UNLOCK | PAGE_SET_PRIVATE2);
 
-		if (!nolock && nocow)
-			btrfs_end_write_no_snapshotting(root);
 		cur_offset = extent_end;
 
 		/*
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -655,6 +655,7 @@ static int create_snapshot(struct btrfs_
 	struct btrfs_pending_snapshot *pending_snapshot;
 	struct btrfs_trans_handle *trans;
 	int ret;
+	bool snapshot_force_cow = false;
 
 	if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
 		return -EINVAL;
@@ -671,6 +672,11 @@ static int create_snapshot(struct btrfs_
 		goto free_pending;
 	}
 
+	/*
+	 * Force new buffered writes to reserve space even when NOCOW is
+	 * possible. This is to avoid later writeback (running dealloc) to
+	 * fallback to COW mode and unexpectedly fail with ENOSPC.
+	 */
 	atomic_inc(&root->will_be_snapshotted);
 	smp_mb__after_atomic();
 	btrfs_wait_for_no_snapshotting_writes(root);
@@ -679,6 +685,14 @@ static int create_snapshot(struct btrfs_
 	if (ret)
 		goto dec_and_free;
 
+	/*
+	 * All previous writes have started writeback in NOCOW mode, so now
+	 * we force future writes to fallback to COW mode during snapshot
+	 * creation.
+	 */
+	atomic_inc(&root->snapshot_force_cow);
+	snapshot_force_cow = true;
+
 	btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
 
 	btrfs_init_block_rsv(&pending_snapshot->block_rsv,
@@ -744,6 +758,8 @@ static int create_snapshot(struct btrfs_
 fail:
 	btrfs_subvolume_release_metadata(fs_info, &pending_snapshot->block_rsv);
 dec_and_free:
+	if (snapshot_force_cow)
+		atomic_dec(&root->snapshot_force_cow);
 	if (atomic_dec_and_test(&root->will_be_snapshotted))
 		wake_up_atomic_t(&root->will_be_snapshotted);
 free_pending:



  parent reply	other threads:[~2020-10-12 13:38 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-12 13:26 [PATCH 4.14 00/70] 4.14.201-rc1 review Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 01/70] vsock/virtio: use RCU to avoid use-after-free on the_virtio_vsock Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 02/70] vsock/virtio: stop workers during the .remove() Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 03/70] vsock/virtio: add transport parameter to the virtio_transport_reset_no_sock() Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 04/70] net: virtio_vsock: Enhance connection semantics Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 05/70] USB: gadget: f_ncm: Fix NDP16 datagram validation Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 06/70] gpio: tc35894: fix up tc35894 interrupt configuration Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 07/70] Input: i8042 - add nopnp quirk for Acer Aspire 5 A515 Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 08/70] drm/amdgpu: restore proper ref count in amdgpu_display_crtc_set_config Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 09/70] drivers/net/wan/hdlc_fr: Add needed_headroom for PVC devices Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 10/70] drm/sun4i: mixer: Extend regmap max_register Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 11/70] net: dec: de2104x: Increase receive ring size for Tulip Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 12/70] rndis_host: increase sleep time in the query-response loop Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 13/70] drivers/net/wan/lapbether: Make skb->protocol consistent with the header Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 14/70] drivers/net/wan/hdlc: Set skb->protocol before transmitting Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 15/70] mac80211: do not allow bigger VHT MPDUs than the hardware supports Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 16/70] spi: fsl-espi: Only process interrupts for expected events Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 17/70] nvme-fc: fail new connections to a deleted host or remote port Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 18/70] pinctrl: mvebu: Fix i2c sda definition for 98DX3236 Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 19/70] nfs: Fix security label length not being reset Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 20/70] clk: samsung: exynos4: mark chipid clock as CLK_IGNORE_UNUSED Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 21/70] iommu/exynos: add missing put_device() call in exynos_iommu_of_xlate() Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 22/70] i2c: cpm: Fix i2c_ram structure Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 23/70] Input: trackpoint - enable Synaptics trackpoints Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 24/70] random32: Restore __latent_entropy attribute on net_rand_state Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 25/70] net/packet: fix overflow in tpacket_rcv Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 26/70] epoll: do not insert into poll queues until all sanity checks are done Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 27/70] epoll: replace ->visited/visited_list with generation count Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 28/70] epoll: EPOLL_CTL_ADD: close the race in decision to take fast path Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 29/70] ep_create_wakeup_source(): dentry name can change under you Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 30/70] netfilter: ctnetlink: add a range check for l3/l4 protonum Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 31/70] drm/syncobj: Fix drm_syncobj_handle_to_fd refcount leak Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 32/70] fbdev, newport_con: Move FONT_EXTRA_WORDS macros into linux/font.h Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 33/70] Fonts: Support FONT_EXTRA_WORDS macros for built-in fonts Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 34/70] Revert "ravb: Fixed to be able to unload modules" Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 35/70] fbcon: Fix global-out-of-bounds read in fbcon_get_font() Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 36/70] net: wireless: nl80211: fix out-of-bounds access in nl80211_del_key() Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 37/70] usermodehelper: reset umask to default before executing user process Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 38/70] platform/x86: thinkpad_acpi: initialize tp_nvram_state variable Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 39/70] platform/x86: thinkpad_acpi: re-initialize ACPI buffer size when reuse Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 40/70] driver core: Fix probe_count imbalance in really_probe() Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 41/70] perf top: Fix stdio interface input handling with glibc 2.28+ Greg Kroah-Hartman
2020-10-12 13:26 ` [PATCH 4.14 42/70] mtd: rawnand: sunxi: Fix the probe error path Greg Kroah-Hartman
2020-10-12 13:26 ` Greg Kroah-Hartman [this message]
2020-10-12 13:27 ` [PATCH 4.14 44/70] ftrace: Move RCU is watching check after recursion check Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 45/70] macsec: avoid use-after-free in macsec_handle_frame() Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 46/70] mm/khugepaged: fix filemap page_to_pgoff(page) != offset Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 47/70] cifs: Fix incomplete memory allocation on setxattr path Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 48/70] i2c: meson: fix clock setting overwrite Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 49/70] sctp: fix sctp_auth_init_hmacs() error path Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 50/70] team: set dev->needed_headroom in team_setup_by_port() Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 51/70] net: team: fix memory leak in __team_options_register Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 52/70] openvswitch: handle DNAT tuple collision Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 53/70] drm/amdgpu: prevent double kfree ttm->sg Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 54/70] xfrm: clone XFRMA_REPLAY_ESN_VAL in xfrm_do_migrate Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 55/70] xfrm: clone XFRMA_SEC_CTX " Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 56/70] xfrm: clone whole liftime_cur structure " Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 57/70] net: stmmac: removed enabling eee in EEE set callback Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 58/70] platform/x86: fix kconfig dependency warning for FUJITSU_LAPTOP Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 59/70] xfrm: Use correct address family in xfrm_state_find Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 60/70] bonding: set dev->needed_headroom in bond_setup_by_slave() Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 61/70] mdio: fix mdio-thunder.c dependency & build error Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 62/70] net: usb: ax88179_178a: fix missing stop entry in driver_info Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 63/70] rxrpc: Fix rxkad token xdr encoding Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 64/70] rxrpc: Downgrade the BUG() for unsupported token type in rxrpc_read() Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 65/70] rxrpc: Fix some missing _bh annotations on locking conn->state_lock Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 66/70] rxrpc: Fix server keyring leak Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 67/70] perf: Fix task_function_call() error handling Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 68/70] mmc: core: dont set limits.discard_granularity as 0 Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 69/70] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged Greg Kroah-Hartman
2020-10-12 13:27 ` [PATCH 4.14 70/70] net: usb: rtl8150: set random MAC address when set_ethernet_addr() fails Greg Kroah-Hartman
2020-10-13  6:29 ` [PATCH 4.14 00/70] 4.14.201-rc1 review Naresh Kamboju
2020-10-13 16:39 ` Guenter Roeck
2020-10-14  1:28 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201012132632.255115359@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=anand.jain@oracle.com \
    --cc=dsterba@suse.com \
    --cc=fdmanana@suse.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=robbieko@synology.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).