All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] Btrfs: fix csum tree corruption, duplicate and outdated checksums
  2014-08-09 20:22 [PATCH] Btrfs: fix csum tree corruption, duplicate and outdated checksums Filipe Manana
@ 2014-08-09 19:40 ` Josef Bacik
  2014-08-09 20:10 ` Marc MERLIN
  2014-08-09 20:25 ` Chris Mason
  2 siblings, 0 replies; 4+ messages in thread
From: Josef Bacik @ 2014-08-09 19:40 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, stable

I'm getting on a plane right now to kiss you, be prepared.  Thanks,

Josef

Filipe Manana <fdmanana@suse.com> wrote:


Under rare circumstances we can end up leaving 2 versions of a checksum
for the same file extent range.

The reason for this is that after calling btrfs_next_leaf we process
slot 0 of the leaf it returns, instead of processing the slot set in
path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
btrfs_next_leaf() releases the path and before it searches for the next
leaf, another task might cause a split of the next leaf, which migrates
some of its keys to the leaf we were processing before calling
btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
same leaf but with path->slots[0] having a slot number corresponding
to the first new key it got, that is, a slot number that didn't exist
before calling btrfs_next_leaf(), as the leaf now has more keys than
it had before. So we must really process the returned leaf starting at
path->slots[0] always, as it isn't always 0, and the key at slot 0 can
have an offset much lower than our search offset/bytenr.

For example, consider the following scenario, where we have:

sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472

  Leaf N:

    slot = 0                           slot = btrfs_header_nritems() - 1
  |-------------------------------------------------------------------|
  | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
  |-------------------------------------------------------------------|

  Leaf N + 1:

      slot = 0                          slot = btrfs_header_nritems() - 1
  |--------------------------------------------------------------------|
  | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
  |--------------------------------------------------------------------|

Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
find the next highest key, which releases the current path and then searches
for that next key. However after releasing the path and before finding that
next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
btrfs_next_leaf() will returns us a path again with leaf N but with the slot
pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
is then:

    slot = 0                        slot = btrfs_header_nritems() - 2  slot = btrfs_header_nritems() - 1
  |----------------------------------------------------------------------------------------------------|
  | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4]  [(CSUM CSUM 40161280), size 32] |
  |----------------------------------------------------------------------------------------------------|

And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
into the "insert:" label, which will set tmp to:

    tmp = min((sums->len - total_bytes) >> blocksize_bits,
        (next_offset - file_key.offset) >> blocksize_bits) =
    min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
    min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4

and

   ins_size = csum_size * tmp = 4 * 4 = 16 bytes.

In other words, we insert a new csum item in the tree with key
(CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
because the item with key (CSUM CSUM 40161280) (the one that was moved from
leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
bytes of our data and won't get those old checksums removed.

So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
and breaks the logical rule:

   Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover

An obvious bad effect of this is that a subsequent csum tree lookup to get
the checksum of any of the blocks with logical offset of 40161280, 40165376
or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file-item.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index a1f97de..7897dcd 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -746,7 +746,7 @@ again:
                                found_next = 1;
                        if (ret != 0)
                                goto insert;
-                       slot = 0;
+                       slot = path->slots[0];
                }
                btrfs_item_key_to_cpu(path->nodes[0], &found_key, slot);
                if (found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  https://urldefense.proofpoint.com/v1/url?u=http://vger.kernel.org/majordomo-info.html&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=cKCbChRKsMpTX8ybrSkonQ%3D%3D%0A&m=%2F%2BJGihlBY9P4TZ77yUay5e6r3fawI30oSQXNPGxdVvc%3D%0A&s=32396751f1503bfc1bd4b63ca15fcf656cf42bacc3b09aaadbcc8e394bc65733

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] Btrfs: fix csum tree corruption, duplicate and outdated checksums
  2014-08-09 20:22 [PATCH] Btrfs: fix csum tree corruption, duplicate and outdated checksums Filipe Manana
  2014-08-09 19:40 ` Josef Bacik
@ 2014-08-09 20:10 ` Marc MERLIN
  2014-08-09 20:25 ` Chris Mason
  2 siblings, 0 replies; 4+ messages in thread
From: Marc MERLIN @ 2014-08-09 20:10 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Sat, Aug 09, 2014 at 09:22:27PM +0100, Filipe Manana wrote:

(100 lines of detailled explanations snipped)

> -			slot = 0;
> +			slot = path->slots[0];

And this is why, trying to rank kernel contributions by number of
lines or characters is a very poor guide of the actual work accomplished
and owed credit.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] Btrfs: fix csum tree corruption, duplicate and outdated checksums
@ 2014-08-09 20:22 Filipe Manana
  2014-08-09 19:40 ` Josef Bacik
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Filipe Manana @ 2014-08-09 20:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana, stable

Under rare circumstances we can end up leaving 2 versions of a checksum
for the same file extent range.

The reason for this is that after calling btrfs_next_leaf we process
slot 0 of the leaf it returns, instead of processing the slot set in
path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
btrfs_next_leaf() releases the path and before it searches for the next
leaf, another task might cause a split of the next leaf, which migrates
some of its keys to the leaf we were processing before calling
btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
same leaf but with path->slots[0] having a slot number corresponding
to the first new key it got, that is, a slot number that didn't exist
before calling btrfs_next_leaf(), as the leaf now has more keys than
it had before. So we must really process the returned leaf starting at
path->slots[0] always, as it isn't always 0, and the key at slot 0 can
have an offset much lower than our search offset/bytenr.

For example, consider the following scenario, where we have:

sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472

  Leaf N:

    slot = 0                           slot = btrfs_header_nritems() - 1
  |-------------------------------------------------------------------|
  | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
  |-------------------------------------------------------------------|

  Leaf N + 1:

      slot = 0                          slot = btrfs_header_nritems() - 1
  |--------------------------------------------------------------------|
  | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
  |--------------------------------------------------------------------|

Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
find the next highest key, which releases the current path and then searches
for that next key. However after releasing the path and before finding that
next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
btrfs_next_leaf() will returns us a path again with leaf N but with the slot
pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
is then:

    slot = 0                        slot = btrfs_header_nritems() - 2  slot = btrfs_header_nritems() - 1
  |----------------------------------------------------------------------------------------------------|
  | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4]  [(CSUM CSUM 40161280), size 32] |
  |----------------------------------------------------------------------------------------------------|

And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
into the "insert:" label, which will set tmp to:

    tmp = min((sums->len - total_bytes) >> blocksize_bits,
        (next_offset - file_key.offset) >> blocksize_bits) =
    min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
    min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4

and

   ins_size = csum_size * tmp = 4 * 4 = 16 bytes.

In other words, we insert a new csum item in the tree with key
(CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
because the item with key (CSUM CSUM 40161280) (the one that was moved from
leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
bytes of our data and won't get those old checksums removed.

So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
and breaks the logical rule:

   Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover

An obvious bad effect of this is that a subsequent csum tree lookup to get
the checksum of any of the blocks with logical offset of 40161280, 40165376
or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file-item.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index a1f97de..7897dcd 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -746,7 +746,7 @@ again:
 				found_next = 1;
 			if (ret != 0)
 				goto insert;
-			slot = 0;
+			slot = path->slots[0];
 		}
 		btrfs_item_key_to_cpu(path->nodes[0], &found_key, slot);
 		if (found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] Btrfs: fix csum tree corruption, duplicate and outdated checksums
  2014-08-09 20:22 [PATCH] Btrfs: fix csum tree corruption, duplicate and outdated checksums Filipe Manana
  2014-08-09 19:40 ` Josef Bacik
  2014-08-09 20:10 ` Marc MERLIN
@ 2014-08-09 20:25 ` Chris Mason
  2 siblings, 0 replies; 4+ messages in thread
From: Chris Mason @ 2014-08-09 20:25 UTC (permalink / raw)
  To: Filipe Manana, linux-btrfs; +Cc: stable



On 08/09/2014 04:22 PM, Filipe Manana wrote:
> Under rare circumstances we can end up leaving 2 versions of a checksum
> for the same file extent range.
> 
> The reason for this is that after calling btrfs_next_leaf we process
> slot 0 of the leaf it returns, instead of processing the slot set in
> path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
> btrfs_next_leaf() releases the path and before it searches for the next
> leaf, another task might cause a split of the next leaf, which migrates
> some of its keys to the leaf we were processing before calling
> btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
> same leaf but with path->slots[0] having a slot number corresponding
> to the first new key it got, that is, a slot number that didn't exist
> before calling btrfs_next_leaf(), as the leaf now has more keys than
> it had before. So we must really process the returned leaf starting at
> path->slots[0] always, as it isn't always 0, and the key at slot 0 can
> have an offset much lower than our search offset/bytenr.

And the bug goes all the way back to 2007.  I'd like to blame Yan Zheng,
but it was in my original code too.

Great find and explanation, I've added this to my merge window pull.
Thanks!

-chris

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-08-09 20:25 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-09 20:22 [PATCH] Btrfs: fix csum tree corruption, duplicate and outdated checksums Filipe Manana
2014-08-09 19:40 ` Josef Bacik
2014-08-09 20:10 ` Marc MERLIN
2014-08-09 20:25 ` Chris Mason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.