All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jiri Slaby <jslaby@suse.cz>
To: stable@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, Filipe Manana <fdmanana@suse.com>,
	Chris Mason <clm@fb.com>, Jiri Slaby <jslaby@suse.cz>
Subject: [PATCH 3.12 75/78] Btrfs: fix fs corruption on transaction abort if device supports discard
Date: Fri,  9 Jan 2015 11:32:24 +0100	[thread overview]
Message-ID: <bc5e18c1dc407bd245a190076d362ecf7975e64a.1420799385.git.jslaby@suse.cz> (raw)
In-Reply-To: <72002f1f248c28d1715d10454190e209d5a20fe1.1420799385.git.jslaby@suse.cz>
In-Reply-To: <cover.1420799385.git.jslaby@suse.cz>

From: Filipe Manana <fdmanana@suse.com>

3.12-stable review patch.  If anyone has any objections, please let me know.

===============

commit 678886bdc6378c1cbd5072da2c5a3035000214e3 upstream.

When we abort a transaction we iterate over all the ranges marked as dirty
in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
from those trees, add them back (unpin) to the free space caches and, if
the fs was mounted with "-o discard", perform a discard on those regions.
Also, after adding the regions to the free space caches, a fitrim ioctl call
can see those ranges in a block group's free space cache and perform a discard
on the ranges, so the same issue can happen without "-o discard" as well.

This causes corruption, affecting one or multiple btree nodes (in the worst
case leaving the fs unmountable) because some of those ranges (the ones in
the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
referred by the last committed super block - breaking the rule that anything
that was committed by a transaction is untouched until the next transaction
commits successfully.

I ran into this while running in a loop (for several hours) the fstest that
I recently submitted:

  [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim

The corruption always happened when a transaction aborted and then fsck complained
like this:

   _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
   *** fsck.btrfs output ***
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   read block failed check_tree_block
   Couldn't open file system

In this case 94945280 corresponded to the root of a tree.
Using frace what I observed was the following sequence of steps happened:

   1) transaction N started, fs_info->pinned_extents pointed to
      fs_info->freed_extents[0];

   2) node/eb 94945280 is created;

   3) eb is persisted to disk;

   4) transaction N commit starts, fs_info->pinned_extents now points to
      fs_info->freed_extents[1], and transaction N completes;

   5) transaction N + 1 starts;

   6) eb is COWed, and btrfs_free_tree_block() called for this eb;

   7) eb range (94945280 to 94945280 + 16Kb) is added to
      fs_info->pinned_extents (fs_info->freed_extents[1]);

   8) Something goes wrong in transaction N + 1, like hitting ENOSPC
      for example, and the transaction is aborted, turning the fs into
      readonly mode. The stack trace I got for example:

      [112065.253935]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
      [112065.254271]  [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
      [112065.254567]  [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
      [112065.261674]  [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
      [112065.261922]  [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
      [112065.262211]  [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
      [112065.262545]  [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
      [112065.262771]  [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
      [112065.263105]  [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
      (...)
      [112065.264493] ---[ end trace dd7903a975a31a08 ]---
      [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
      [112065.264997] BTRFS info (device sdc): forced readonly

   9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
      fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
      turn calls btrfs_destroy_pinned_extent();

   10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
       marked as dirty in fs_info->freed_extents[], and for each one
       it calls discard, if the fs was mounted with "-o discard", and
       adds the range to the free space cache of the respective block
       group;

   11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
       sees the free space entries and performs a discard;

   12) After an umount and mount (or fsck), our eb's location on disk was full
       of zeroes, and it should have been untouched, because it was marked as
       dirty in the fs_info->pinned_extents tree, and therefore used by the
       trees that the last committed superblock points to.

Fix this by not performing a discard and not adding the ranges to the free space
caches - it's useless from this point since the fs is now in readonly mode and
we won't write free space caches to disk anymore (otherwise we would leak space)
nor any new superblock. By not adding the ranges to the free space caches, it
prevents other code paths from allocating that space and write to it as well,
therefore being safer and simpler.

This isn't a new problem, as it's been present since 2011 (git commit
acce952b0263825da32cf10489413dec78053347).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 fs/btrfs/disk-io.c     |  6 ------
 fs/btrfs/extent-tree.c | 10 ++++++----
 2 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8964b59fee92..f46ad53626be 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3995,12 +3995,6 @@ again:
 		if (ret)
 			break;
 
-		/* opt_discard */
-		if (btrfs_test_opt(root, DISCARD))
-			ret = btrfs_error_discard_extent(root, start,
-							 end + 1 - start,
-							 NULL);
-
 		clear_extent_dirty(unpin, start, end, GFP_NOFS);
 		btrfs_error_unpin_extent_range(root, start, end);
 		cond_resched();
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 63ee604efa6c..b1c6e490379c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5476,7 +5476,8 @@ void btrfs_prepare_extent_commit(struct btrfs_trans_handle *trans,
 	update_global_block_rsv(fs_info);
 }
 
-static int unpin_extent_range(struct btrfs_root *root, u64 start, u64 end)
+static int unpin_extent_range(struct btrfs_root *root, u64 start, u64 end,
+			      const bool return_free_space)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_block_group_cache *cache = NULL;
@@ -5500,7 +5501,8 @@ static int unpin_extent_range(struct btrfs_root *root, u64 start, u64 end)
 
 		if (start < cache->last_byte_to_unpin) {
 			len = min(len, cache->last_byte_to_unpin - start);
-			btrfs_add_free_space(cache, start, len);
+			if (return_free_space)
+				btrfs_add_free_space(cache, start, len);
 		}
 
 		start += len;
@@ -5563,7 +5565,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
 						   end + 1 - start, NULL);
 
 		clear_extent_dirty(unpin, start, end, GFP_NOFS);
-		unpin_extent_range(root, start, end);
+		unpin_extent_range(root, start, end, true);
 		cond_resched();
 	}
 
@@ -8809,7 +8811,7 @@ out:
 
 int btrfs_error_unpin_extent_range(struct btrfs_root *root, u64 start, u64 end)
 {
-	return unpin_extent_range(root, start, end);
+	return unpin_extent_range(root, start, end, false);
 }
 
 int btrfs_error_discard_extent(struct btrfs_root *root, u64 bytenr,
-- 
2.2.1


  parent reply	other threads:[~2015-01-09 10:33 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-09 10:30 [PATCH 3.12 00/78] 3.12.36-stable review Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 01/78] ipv6: gre: fix wrong skb->protocol in WCCP Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 02/78] Fix race condition between vxlan_sock_add and vxlan_sock_release Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 03/78] tg3: fix ring init when there are more TX than RX channels Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 04/78] net/mlx4_core: Limit count field to 24 bits in qp_alloc_res Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 05/78] rtnetlink: release net refcnt on error in do_setlink() Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 06/78] xen-netfront: Remove BUGs on paged skb data which crosses a page boundary Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 07/78] net: mvneta: fix Tx interrupt delay Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 08/78] net: mvneta: fix race condition in mvneta_tx() Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 09/78] net: sctp: use MAX_HEADER for headroom reserve in output path Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 10/78] ceph: fix null pointer dereference in discard_cap_releases() Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 11/78] perf/x86/intel: Protect LBR and extra_regs against KVM lying Jiri Slaby
2015-01-10 11:24   ` Dongsu Park
2015-01-10 11:42     ` Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 12/78] s390/3215: fix hanging console issue Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 13/78] s390/3215: fix tty output containing tabs Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 14/78] usb: gadget: at91_udc: move prepare clk into process context Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 15/78] tty: Fix pty master poll() after slave closes v2 Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 16/78] mm: frontswap: invalidate expired data on a dup-store failure Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 17/78] mm/vmpressure.c: fix race in vmpressure_work_fn() Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 18/78] mm: fix swapoff hang after page migration and fork Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 19/78] mm: fix anon_vma_clone() error treatment Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 20/78] i2c: omap: fix NACK and Arbitration Lost irq handling Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 21/78] i2c: omap: fix i207 errata handling Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 22/78] i2c: davinci: generate STP always when NACK is received Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 23/78] drm/radeon: kernel panic in drm_calc_vbltimestamp_from_scanoutpos with 3.18.0-rc6 Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 24/78] drm/i915: More cautious with pch fifo underruns Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 25/78] drm/i915: Unlock panel even when LVDS is disabled Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 26/78] media: smiapp: Only some selection targets are settable Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 27/78] USB: xhci: Reset a halted endpoint immediately when we encounter a stall Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 28/78] AHCI: Add DeviceIDs for Sunrise Point-LP SATA controller Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 29/78] ahci: disable MSI on SAMSUNG 0xa800 SSD Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 30/78] sata_fsl: fix error handling of irq_of_parse_and_map Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 31/78] igb: bring link up when PHY is powered up Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 32/78] powerpc: 32 bit getcpu VDSO function uses 64 bit instructions Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 33/78] ALSA: hda - Add EAPD fixup for ASUS Z99He laptop Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 34/78] ALSA: hda - Fix built-in mic at resume on Lenovo Ideapad S210 Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 35/78] ALSA: usb-audio: Don't resubmit pending URBs at MIDI error recovery Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 36/78] isofs: Fix infinite looping over CE entries Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 37/78] x86/tls: Validate TLS entries to protect espfix Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 38/78] x86/tls: Disallow unusual TLS segments Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 39/78] x86, kvm: Clear paravirt_enabled on KVM guests for espfix32's benefit Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 40/78] mfd: tc6393xb: Fail ohci suspend if full state restore is required Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 41/78] mmc: block: add newline to sysfs display of force_ro Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 42/78] megaraid_sas: corrected return of wait_event from abort frame path Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 43/78] scsi: correct return values for .eh_abort_handler implementations Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 44/78] nfs41: fix nfs4_proc_layoutget error handling Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 45/78] dm bufio: fix memleak when using a dm_buffer's inline bio Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 46/78] dm space map metadata: fix sm_bootstrap_get_nr_blocks() Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 47/78] x86/tls: Don't validate lm in set_thread_area() after all Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 48/78] audit: change decimal constant to macro for invalid uid Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 49/78] isofs: Fix unchecked printing of ER records Jiri Slaby
2015-01-09 10:31 ` [PATCH 3.12 50/78] KEYS: Fix stale key registration at error path Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 51/78] mac80211: fix multicast LED blinking and counter Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 52/78] mac80211: free management frame keys when removing station Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 53/78] thermal: Fix error path in thermal_init() Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 54/78] mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 55/78] mnt: Update unprivileged remount test Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 56/78] umount: Disallow unprivileged mount force Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 57/78] groups: Consolidate the setgroups permission checks Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 58/78] userns: Document what the invariant required for safe unprivileged mappings Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 59/78] userns: Don't allow setgroups until a gid mapping has been setablished Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 60/78] userns: Don't allow unprivileged creation of gid mappings Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 61/78] userns: Check euid no fsuid when establishing an unprivileged uid mapping Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 62/78] userns: Only allow the creator of the userns unprivileged mappings Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 63/78] userns: Rename id_map_mutex to userns_state_mutex Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 64/78] userns: Add a knob to disable setgroups on a per user namespace basis Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 65/78] userns: Allow setting gid_maps without privilege when setgroups is disabled Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 66/78] userns: Unbreak the unprivileged remount tests Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 67/78] audit: restore AUDIT_LOGINUID unset ABI Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 68/78] crypto: af_alg - fix backlog handling Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 69/78] ncpfs: return proper error from NCP_IOC_SETROOT ioctl Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 70/78] exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 71/78] udf: Verify symlink size before loading it Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 72/78] eCryptfs: Force RO mount when encrypted view is enabled Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 73/78] eCryptfs: Remove buggy and unnecessary write in file name decode routine Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 74/78] Btrfs: do not move em to modified list when unpinning Jiri Slaby
2015-01-09 10:32 ` Jiri Slaby [this message]
2015-01-09 10:32 ` [PATCH 3.12 76/78] mfd: stmpe: Fix STMPE24xx GPMR LSB Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 77/78] mfd: viperboard: Fix platform-device id collision Jiri Slaby
2015-01-09 10:32 ` [PATCH 3.12 78/78] mm: let mm_find_pmd fix buggy race with THP fault Jiri Slaby
2015-01-10  5:01   ` Hugh Dickins
2015-01-12 10:01     ` Jiri Slaby
2015-01-12 11:13       ` Kirill A. Shutemov
2015-01-12 23:13         ` Hugh Dickins
2015-01-09 17:59 ` [PATCH 3.12 00/78] 3.12.36-stable review Guenter Roeck
2015-01-11  3:40   ` Satoru Takeuchi
2015-01-12 10:35     ` Jiri Slaby
2015-01-12 18:00 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bc5e18c1dc407bd245a190076d362ecf7975e64a.1420799385.git.jslaby@suse.cz \
    --to=jslaby@suse.cz \
    --cc=clm@fb.com \
    --cc=fdmanana@suse.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.