linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Filipe Manana <fdmanana@suse.com>,
	David Sterba <dsterba@suse.com>, Sasha Levin <sashal@kernel.org>
Subject: [PATCH AUTOSEL 5.0 09/98] Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes
Date: Mon, 22 Apr 2019 15:40:36 -0400	[thread overview]
Message-ID: <20190422194205.10404-9-sashal@kernel.org> (raw)
In-Reply-To: <20190422194205.10404-1-sashal@kernel.org>

From: Filipe Manana <fdmanana@suse.com>

[ Upstream commit 609e804d771f59dc5d45a93e5ee0053c74bbe2bf ]

When we are mixing buffered writes with direct IO writes against the same
file and snapshotting is happening concurrently, we can end up with a
corrupt file content in the snapshot. Example:

1) Inode/file is empty.

2) Snapshotting starts.

2) Buffered write at offset 0 length 256Kb. This updates the i_size of the
   inode to 256Kb, disk_i_size remains zero. This happens after the task
   doing the snapshot flushes all existing delalloc.

3) DIO write at offset 256Kb length 768Kb. Once the ordered extent
   completes it sets the inode's disk_i_size to 1Mb (256Kb + 768Kb) and
   updates the inode item in the fs tree with a size of 1Mb (which is
   the value of disk_i_size).

4) The dealloc for the range [0, 256Kb[ did not start yet.

5) The transaction used in the DIO ordered extent completion, which updated
   the inode item, is committed by the snapshotting task.

6) Snapshot creation completes.

7) Dealloc for the range [0, 256Kb[ is flushed.

After that when reading the file from the snapshot we always get zeroes for
the range [0, 256Kb[, the file has a size of 1Mb and the data written by
the direct IO write is found. From an application's point of view this is
a corruption, since in the source subvolume it could never read a version
of the file that included the data from the direct IO write without the
data from the buffered write included as well. In the snapshot's tree,
file extent items are missing for the range [0, 256Kb[.

The issue, obviously, does not happen when using the -o flushoncommit
mount option.

Fix this by flushing delalloc for all the roots that are about to be
snapshotted when committing a transaction. This guarantees total ordering
when updating the disk_i_size of an inode since the flush for dealloc is
done when a transaction is in the TRANS_STATE_COMMIT_START state and wait
is done once no more external writers exist. This is similar to what we
do when using the flushoncommit mount option, but we do it only if the
transaction has snapshots to create and only for the roots of the
subvolumes to be snapshotted. The bulk of the dealloc is flushed in the
snapshot creation ioctl, so the flush work we do inside the transaction
is minimized.

This issue, involving buffered and direct IO writes with snapshotting, is
often triggered by fstest btrfs/078, and got reported by fsck when not
using the NO_HOLES features, for example:

  $ cat results/btrfs/078.full
  (...)
  _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
  *** fsck.btrfs output ***
  [1/7] checking root items
  [2/7] checking extents
  [3/7] checking free space cache
  [4/7] checking fs roots
  root 258 inode 264 errors 100, file extent discount
  Found file extent holes:
        start: 524288, len: 65536
  ERROR: errors found in fs roots

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
---
 fs/btrfs/transaction.c | 49 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 43 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 4ec2b660d014..7f3ece91a4d0 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1886,8 +1886,10 @@ static void btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans)
        }
 }
 
-static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
+static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle *trans)
 {
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+
 	/*
 	 * We use writeback_inodes_sb here because if we used
 	 * btrfs_start_delalloc_roots we would deadlock with fs freeze.
@@ -1897,15 +1899,50 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
 	 * from already being in a transaction and our join_transaction doesn't
 	 * have to re-take the fs freeze lock.
 	 */
-	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
+	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
 		writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
+	} else {
+		struct btrfs_pending_snapshot *pending;
+		struct list_head *head = &trans->transaction->pending_snapshots;
+
+		/*
+		 * Flush dellaloc for any root that is going to be snapshotted.
+		 * This is done to avoid a corrupted version of files, in the
+		 * snapshots, that had both buffered and direct IO writes (even
+		 * if they were done sequentially) due to an unordered update of
+		 * the inode's size on disk.
+		 */
+		list_for_each_entry(pending, head, list) {
+			int ret;
+
+			ret = btrfs_start_delalloc_snapshot(pending->root);
+			if (ret)
+				return ret;
+		}
+	}
 	return 0;
 }
 
-static inline void btrfs_wait_delalloc_flush(struct btrfs_fs_info *fs_info)
+static inline void btrfs_wait_delalloc_flush(struct btrfs_trans_handle *trans)
 {
-	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+
+	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
 		btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
+	} else {
+		struct btrfs_pending_snapshot *pending;
+		struct list_head *head = &trans->transaction->pending_snapshots;
+
+		/*
+		 * Wait for any dellaloc that we started previously for the roots
+		 * that are going to be snapshotted. This is to avoid a corrupted
+		 * version of files in the snapshots that had both buffered and
+		 * direct IO writes (even if they were done sequentially).
+		 */
+		list_for_each_entry(pending, head, list)
+			btrfs_wait_ordered_extents(pending->root,
+						   U64_MAX, 0, U64_MAX);
+	}
 }
 
 int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
@@ -2024,7 +2061,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 
 	extwriter_counter_dec(cur_trans, trans->type);
 
-	ret = btrfs_start_delalloc_flush(fs_info);
+	ret = btrfs_start_delalloc_flush(trans);
 	if (ret)
 		goto cleanup_transaction;
 
@@ -2040,7 +2077,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	if (ret)
 		goto cleanup_transaction;
 
-	btrfs_wait_delalloc_flush(fs_info);
+	btrfs_wait_delalloc_flush(trans);
 
 	btrfs_scrub_pause(fs_info);
 	/*
-- 
2.19.1


  parent reply	other threads:[~2019-04-22 20:11 UTC|newest]

Thread overview: 104+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-22 19:40 [PATCH AUTOSEL 5.0 01/98] arm64: dts: renesas: r8a77990: Fix SCIF5 DMA channels Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 02/98] ARM: dts: bcm283x: Fix hdmi hpd gpio pull Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 03/98] s390: limit brk randomization to 32MB Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 04/98] mt76x02: fix hdr pointer in write txwi for USB Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 05/98] mt76: mt76x2: fix external LNA gain settings Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 06/98] mt76: mt76x2: fix 2.4 GHz channel " Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 07/98] net: ieee802154: fix a potential NULL pointer dereference Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 08/98] ieee802154: hwsim: propagate genlmsg_reply return code Sasha Levin
2019-04-22 19:40 ` Sasha Levin [this message]
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 10/98] net: stmmac: don't set own bit too early for jumbo frames Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 11/98] net: stmmac: fix jumbo frame sending with non-linear skbs Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 12/98] qlcnic: Avoid potential NULL pointer dereference Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 13/98] xsk: fix umem memory leak on cleanup Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 14/98] staging: axis-fifo: add CONFIG_OF dependency Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 15/98] staging, mt7621-pci: fix build without pci support Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 16/98] netfilter: nft_set_rbtree: check for inactive element after flag mismatch Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 17/98] netfilter: bridge: set skb transport_header before entering NF_INET_PRE_ROUTING Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 18/98] netfilter: fix NETFILTER_XT_TARGET_TEE dependencies Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 19/98] netfilter: ip6t_srh: fix NULL pointer dereferences Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 20/98] netfilter: nf_tables: bogus EBUSY in helper removal from transaction Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 21/98] s390/qeth: fix race when initializing the IP address table Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 22/98] ARM: imx51: fix a leaked reference by adding missing of_node_put Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 23/98] sc16is7xx: missing unregister/delete driver on error in sc16is7xx_init() Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 24/98] serial: ar933x_uart: Fix build failure with disabled console Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 25/98] KVM: arm64: Reset the PMU in preemptible context Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 26/98] arm64: KVM: Always set ICH_HCR_EL2.EN if GICv4 is enabled Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 27/98] KVM: arm/arm64: vgic-its: Take the srcu lock when writing to guest memory Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 28/98] KVM: arm/arm64: vgic-its: Take the srcu lock when parsing the memslots Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 29/98] KVM: arm/arm64: Enforce PTE mappings at stage2 when needed Sasha Levin
2019-04-23  9:27   ` Suzuki K Poulose
2019-05-02 12:50     ` Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 30/98] usb: dwc3: pci: add support for Comet Lake PCH ID Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 31/98] usb: gadget: net2280: Fix overrun of OUT messages Sasha Levin
2019-04-22 19:40 ` [PATCH AUTOSEL 5.0 32/98] usb: gadget: net2280: Fix net2280_dequeue() Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 33/98] usb: gadget: net2272: Fix net2272_dequeue() Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 34/98] ARM: dts: pfla02: increase phy reset duration Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 35/98] i2c: i801: Add support for Intel Comet Lake Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 36/98] KVM: arm/arm64: Fix handling of stage2 huge mappings Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 37/98] net: ks8851: Dequeue RX packets explicitly Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 38/98] net: ks8851: Reassert reset pin if chip ID check fails Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 39/98] net: ks8851: Delay requesting IRQ until opened Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 40/98] net: ks8851: Set initial carrier state to down Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 41/98] staging: rtl8188eu: Fix potential NULL pointer dereference of kcalloc Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 42/98] staging: rtlwifi: rtl8822b: fix to avoid potential NULL pointer dereference Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 43/98] staging: rtl8712: uninitialized memory in read_bbreg_hdl() Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 44/98] staging: rtlwifi: Fix potential NULL pointer dereference of kzalloc Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 45/98] net: phy: Add DP83825I to the DP83822 driver Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 46/98] net: macb: Add null check for PCLK and HCLK Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 47/98] net/sched: don't dereference a->goto_chain to read the chain index Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 48/98] ARM: dts: imx6qdl: Fix typo in imx6qdl-icore-rqs.dtsi Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 49/98] drm/tegra: hub: Fix dereference before check Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 50/98] NFS: Fix a typo in nfs_init_timeout_values() Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 51/98] net: xilinx: fix possible object reference leak Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 52/98] net: ibm: " Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 53/98] net: ethernet: ti: " Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 54/98] drm: Fix drm_release() and device unplug Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 55/98] gpio: aspeed: fix a potential NULL pointer dereference Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 56/98] drm/meson: Fix invalid pointer in meson_drv_unbind() Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 57/98] drm/meson: Uninstall IRQ handler Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 58/98] ARM: davinci: fix build failure with allnoconfig Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 59/98] sbitmap: order READ/WRITE freed instance and setting clear bit Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 60/98] staging: vc04_services: Fix an error code in vchiq_probe() Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 61/98] libceph: fix breakage caused by multipage bvecs Sasha Levin
2019-04-23  8:28   ` Ilya Dryomov
2019-05-02 12:51     ` Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 62/98] scsi: mpt3sas: Fix kernel panic during expander reset Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 63/98] scsi: aacraid: Insure we don't access PCIe space during AER/EEH Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 64/98] scsi: qla4xxx: fix a potential NULL pointer dereference Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 65/98] usb: usb251xb: fix to avoid " Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 66/98] leds: trigger: netdev: fix refcnt leak on interface rename Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 67/98] SUNRPC: fix uninitialized variable warning Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 68/98] x86/realmode: Don't leak the trampoline kernel address Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 69/98] usb: u132-hcd: fix resource leak Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 70/98] tty: fix NULL pointer issue when tty_port ops is not set Sasha Levin
2019-04-22 20:40   ` Greg Kroah-Hartman
2019-05-02 12:53     ` Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 71/98] ceph: fix use-after-free on symlink traversal Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 72/98] scsi: zfcp: reduce flood of fcrscn1 trace records on multi-element RSCN Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 73/98] x86/mm: Don't exceed the valid physical address space Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 74/98] libata: fix using DMA buffers on stack Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 75/98] kbuild: skip parsing pre sub-make code for recursion Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 76/98] afs: Fix StoreData op marshalling Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 77/98] gpio: of: Check propname before applying "cs-gpios" quirks Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 78/98] gpio: of: Check for "spi-cs-high" in child instead of parent node Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 79/98] KVM: nVMX: Do not inherit quadrant and invalid for the root shadow EPT Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 80/98] KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation) Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 81/98] kvm/x86: Move MSR_IA32_ARCH_CAPABILITIES to array emulated_msrs Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 82/98] x86/kvm/hyper-v: avoid spurious pending stimer on vCPU init Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 83/98] KVM: selftests: assert on exit reason in CR4/cpuid sync test Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 84/98] KVM: selftests: explicitly disable PIE for tests Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 85/98] KVM: selftests: disable stack protector for all KVM tests Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 86/98] KVM: selftests: complete IO before migrating guest state Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 87/98] gpio: of: Fix of_gpiochip_add() error path Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 88/98] nvme-multipath: relax ANA state check Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 89/98] nvmet: fix building bvec from sg list Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 90/98] nvmet: fix error flow during ns enable Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 91/98] perf cs-etm: Add missing case value Sasha Levin
2019-04-22 19:41 ` [PATCH AUTOSEL 5.0 92/98] perf machine: Update kernel map address and re-order properly Sasha Levin
2019-04-22 19:42 ` [PATCH AUTOSEL 5.0 93/98] kconfig/[mn]conf: handle backspace (^H) key Sasha Levin
2019-04-22 19:42 ` [PATCH AUTOSEL 5.0 94/98] iommu/amd: Reserve exclusion range in iova-domain Sasha Levin
2019-04-22 19:42 ` [PATCH AUTOSEL 5.0 95/98] kasan: fix variable 'tag' set but not used warning Sasha Levin
2019-04-22 19:42 ` [PATCH AUTOSEL 5.0 96/98] ptrace: take into account saved_sigmask in PTRACE{GET,SET}SIGMASK Sasha Levin
2019-04-22 19:42 ` [PATCH AUTOSEL 5.0 97/98] leds: pca9532: fix a potential NULL pointer dereference Sasha Levin
2019-04-22 19:42 ` [PATCH AUTOSEL 5.0 98/98] leds: trigger: netdev: use memcpy in device_name_store Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190422194205.10404-9-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=dsterba@suse.com \
    --cc=fdmanana@suse.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).