All of lore.kernel.org
 help / color / mirror / Atom feed
* Newbie questions on some of btrfs code...
@ 2012-05-18 11:21 Alex Lyakas
  2012-05-18 11:50 ` Hugo Mills
  2012-05-21 10:44 ` Jan Schmidt
  0 siblings, 2 replies; 17+ messages in thread
From: Alex Lyakas @ 2012-05-18 11:21 UTC (permalink / raw)
  To: linux-btrfs

Greetings everybody,
I have been studying some of the btrfs code and the developer
documentation on the wiki. My primary interest at this point, is to be
able to search within fs tree of a btrfs subvolume, which was created
as a snapshot of another subvolume. For that I have been using the
debug-tree tool plus the References.png diagram on the wiki. I realize
that my knowledge of btrfs is very rudimentary at this point, so
please bear with me.

# How can I navigate from EXTENT_DATA within the fs tree to
appropriate CHUNK_ITEM in the chunk tree? I am basically trying to
find where the file data resides on disk. For example, I have an
EXTENT_DATA like this:
        item 30 key (265 EXTENT_DATA 0) itemoff 888 itemsize 53
                extent data disk byte 12648448 nr 8192
                extent data offset 0 nr 8192 ram 8192
                extent compression 0
I can navigate from here to EXTENT_ITEM within the extent tree, using
btrfs_file_extent_item::disk_bytenr/disk_num_bytes as key for search:
        item 3 key (12648448 EXTENT_ITEM 8192) itemoff 3870 itemsize 53
                extent refs 1 gen 8 flags 1
                extent data backref root 5 objectid 265 offset 0 count 1
But from there how can I reach the relevant CHUNK_ITEM?

# Once I have reached the CHUNK_ITEM, I assume that
btrfs_file_extent_item::offset/num_bytes fields will provide the exact
location of the data on disk. Is that correct? For now I assume that
btrfs was created on a single device and raid0 is used for data, so I
totally ignore mirroring/striping at this point.

# I have been trying to follow btrfs_fiemap(), which seems to do the
job, but it looks like it returns the disk_bytenr/disk_num_bytes
fields without following to CHUNK_ITEMs. Maybe I am wrong.

Some general questions on the ctree code.

# I saw that slot==0 is special. My understanding is that btrfs
maintains the property that the parent of each node/leaf has a key
pointing to that node/leaf, which must be equal to the key in the
slot==0 of this node/leaf. That's what fixup_low_keys() tries to
maintain. Is this correct?

# If my understanding in the previous bullet is correct: Is that the
reason that in btrfs_prev_leaf() it is assumed that if there is a
lesser key, btrfs_search_slot() will never bring us to the slot==0 of
the current leaf?

# btrfs_search_slot(): how can it happen that b becomes NULL, and we
exit the while loop? (and set ret=1)

# btrfs_insert_empty_items(): if nr>1, then an array of keys is
expected to be passed. But btrfs_search_slot() is searching only for
the first key. What happens if the first key does not exist (as
expected), but the next key in the array exists?

# Do my questions make sense?

Thanks!
Alex.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Newbie questions on some of btrfs code...
  2012-05-18 11:21 Newbie questions on some of btrfs code Alex Lyakas
@ 2012-05-18 11:50 ` Hugo Mills
  2012-05-18 13:32   ` Alex Lyakas
  2012-05-21 10:44 ` Jan Schmidt
  1 sibling, 1 reply; 17+ messages in thread
From: Hugo Mills @ 2012-05-18 11:50 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4272 bytes --]

On Fri, May 18, 2012 at 02:21:59PM +0300, Alex Lyakas wrote:
> Greetings everybody,
> I have been studying some of the btrfs code and the developer
> documentation on the wiki. My primary interest at this point, is to be
> able to search within fs tree of a btrfs subvolume, which was created
> as a snapshot of another subvolume. For that I have been using the
> debug-tree tool plus the References.png diagram on the wiki. I realize
> that my knowledge of btrfs is very rudimentary at this point, so
> please bear with me.
> 
> # How can I navigate from EXTENT_DATA within the fs tree to
> appropriate CHUNK_ITEM in the chunk tree? I am basically trying to
> find where the file data resides on disk. For example, I have an
> EXTENT_DATA like this:
>         item 30 key (265 EXTENT_DATA 0) itemoff 888 itemsize 53
>                 extent data disk byte 12648448 nr 8192
>                 extent data offset 0 nr 8192 ram 8192
>                 extent compression 0
> I can navigate from here to EXTENT_ITEM within the extent tree, using
> btrfs_file_extent_item::disk_bytenr/disk_num_bytes as key for search:
>         item 3 key (12648448 EXTENT_ITEM 8192) itemoff 3870 itemsize 53
>                 extent refs 1 gen 8 flags 1
>                 extent data backref root 5 objectid 265 offset 0 count 1
> But from there how can I reach the relevant CHUNK_ITEM?

   CHUNK_ITEMs are indexed by the start address of the chunk, so for
the extent at $e, you need to search for the chunk item immediately
before the key (FIRST_CHUNK_TREE, CHUNK_ITEM, $e).

> # Once I have reached the CHUNK_ITEM, I assume that
> btrfs_file_extent_item::offset/num_bytes fields will provide the exact
> location of the data on disk. Is that correct? For now I assume that
> btrfs was created on a single device and raid0 is used for data, so I
> totally ignore mirroring/striping at this point.

   If you want to find the physical position of a given byte in a file
on disk (and repeating some of what you already know):

 - The FS tree holds the directory structure, so you use that to find
   the inode number of the file by name.

 - With the inode number, you can look in the FS tree again to get the
   set of extents which make up the file. These extents are a mapping
   from [byte offset within the file] to [byte offset in virtual
   address space].

 - The extent tree then holds extent info, indexed by virtual address.
   There are two main types of extent: the extents holding file data
   (EXTENT_ITEM), and, overlapping with them, extents representing the
   block groups (BLOCK_GROUP_ITEM), which are the high-level
   allocation units of the FS.

 - For any given file extent (EXTENT_ITEM), you can use the tree
   search API to look in the chunk tree for the chunks holding this
   virtual data extent. (For any non-single RAID level, there will be
   multiple chunks involved). You do this by simply finding CHUNK_ITEM
   items in the tree with a "start" value immediately less than or
   equal to the virtual-address offset of your file extent.

 - With any replicating RAID (-1 or -10) there will be multiple
   entries in the chunk tree for any given virtual address offset,
   representing the multiple mirrors. For any striped RAID level (-0,
   -10), each chunk record in the tree will have several btrfs_stripe
   records in its array.
   Each btrfs_stripe record that you end up with (duplicate copies
   from the RAID-1/-10, and stripes from the RAID-0/-10) will then
   reference the device tree, which gives you the physical location of
   that btrfs_stripe on a specific disk.

   Note that in the btrfs internal terminology, a "stripe" is a
contiguous (256MiB or 1GiB) sequence of bytes on a single disk. RAID
stripes (e.g. RAID-0, -10) are actually called "sub-stripes" in the
btrfs code. There's also no clearly-defined use of the terms "chunk"
and "block group".

   HTH,
   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
    --- How do you become King?  You stand in the marketplace and ---    
          announce you're going to tax everyone. If you get out          
                           alive, you're King.                           

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Newbie questions on some of btrfs code...
  2012-05-18 11:50 ` Hugo Mills
@ 2012-05-18 13:32   ` Alex Lyakas
  2012-05-18 13:59     ` Hugo Mills
  2012-05-21  1:59     ` Liu Bo
  0 siblings, 2 replies; 17+ messages in thread
From: Alex Lyakas @ 2012-05-18 13:32 UTC (permalink / raw)
  To: Hugo Mills, Alex Lyakas, linux-btrfs

Thank you, Hugo, for the detailed explanation. I am now able to find
the CHUNK_ITEMs and to successfully locate the file data on disk.
Can you maybe address several follow-up questions I have?

# When looking for CHUNK_ITEMs, should I check that their
btrfs_chunk::type==BTRFS_BLOCK_GROUP_DATA (and not SYSTEM/METADATA
etc)? Or file extent should always be mapped to BTRFS_BLOCK_GROUP_DATA
chunk?

# It looks like I don't even need to bother with the extent tree at
this point, because from EXTENT_DATA in fs tree I can navigate
directly to CHUNK_ITEM in chunk tree, correct?

# For replicating RAID levels, you said there will be multiple
CHUNK_ITEMs. How do I find them then? Should I know in advance how
much there should be, and look for them, considering only
btrfs_chunk::type==BTRFS_BLOCK_GROUP_DATA? (I don't bother for
replication at this point, though).

# If I find in the fs tree an EXTENT_DATA of type
BTRFS_FILE_EXTENT_PREALLOC, how should I treat it? What does it mean?
(BTRFS_FILE_EXTENT_INLINE are easy to treat).

# One of my files has two EXTENT_DATAs, like this:
	item 14 key (270 EXTENT_DATA 0) itemoff 1812 itemsize 53
		extent data disk byte 432508928 nr 1474560
		extent data offset 0 nr 1470464 ram 1474560
		extent compression 0
	item 15 key (270 EXTENT_DATA 1470464) itemoff 1759 itemsize 53
		extent data disk byte 432082944 nr 126976
		extent data offset 0 nr 126976 ram 126976
		extent compression 0
Summing btrfs_file_extent_item::num_bytes gives
1470464+126976=1597440. (I know that I should not be summing
btrfs_file_extent_item::disk_num_bytes, but num_bytes).
However, it's INODE_ITEM gives size of 1593360, which is less:
	item 11 key (270 INODE_ITEM 0) itemoff 1970 itemsize 160
		inode generation 26 size 1593360 block group 0 mode 100700 links 1

Is this a valid situation, or I should always consider size in
INODE_ITEM as the correct one?

Thanks again,
Alex.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Newbie questions on some of btrfs code...
  2012-05-18 13:32   ` Alex Lyakas
@ 2012-05-18 13:59     ` Hugo Mills
  2012-05-20  7:40       ` Alex Lyakas
  2012-05-21  1:59     ` Liu Bo
  1 sibling, 1 reply; 17+ messages in thread
From: Hugo Mills @ 2012-05-18 13:59 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3235 bytes --]

On Fri, May 18, 2012 at 04:32:09PM +0300, Alex Lyakas wrote:
> Thank you, Hugo, for the detailed explanation. I am now able to find
> the CHUNK_ITEMs and to successfully locate the file data on disk.
> Can you maybe address several follow-up questions I have?
> 
> # When looking for CHUNK_ITEMs, should I check that their
> btrfs_chunk::type==BTRFS_BLOCK_GROUP_DATA (and not SYSTEM/METADATA
> etc)? Or file extent should always be mapped to BTRFS_BLOCK_GROUP_DATA
> chunk?

   File extents will either be mapped to a data chunk, _or_ the file
data will live inline in the metadata area, following the
btrfs_extent_item. This is probably the trickiest piece of the on-disk
data format to figure out, and I fear that I didn't document it well
enough. Basically, it's non-obvious where inline extents are
calculated, because there's all sorts of awkward-looking type casting
to get to the data.

> # It looks like I don't even need to bother with the extent tree at
> this point, because from EXTENT_DATA in fs tree I can navigate
> directly to CHUNK_ITEM in chunk tree, correct?

   Mmm... possibly. Again, I'm not sure how this interacts with inline
extents.

> # For replicating RAID levels, you said there will be multiple
> CHUNK_ITEMs. How do I find them then? Should I know in advance how
> much there should be, and look for them, considering only
> btrfs_chunk::type==BTRFS_BLOCK_GROUP_DATA? (I don't bother for
> replication at this point, though).

   Actually, thinking about it, there's a single CHUNK_ITEM, and the
stripe[] array holds all of the per-disk allocations that correspond
to that block group. So, for RAID-1, you'll have precisely two
elements in the stripe[] array. Sorry for getting it wrong earlier.

> # If I find in the fs tree an EXTENT_DATA of type
> BTRFS_FILE_EXTENT_PREALLOC, how should I treat it? What does it mean?
> (BTRFS_FILE_EXTENT_INLINE are easy to treat).

   I don't know, sorry.

> # One of my files has two EXTENT_DATAs, like this:
> 	item 14 key (270 EXTENT_DATA 0) itemoff 1812 itemsize 53
> 		extent data disk byte 432508928 nr 1474560
> 		extent data offset 0 nr 1470464 ram 1474560
> 		extent compression 0
> 	item 15 key (270 EXTENT_DATA 1470464) itemoff 1759 itemsize 53
> 		extent data disk byte 432082944 nr 126976
> 		extent data offset 0 nr 126976 ram 126976
> 		extent compression 0
> Summing btrfs_file_extent_item::num_bytes gives
> 1470464+126976=1597440. (I know that I should not be summing
> btrfs_file_extent_item::disk_num_bytes, but num_bytes).
> However, it's INODE_ITEM gives size of 1593360, which is less:
> 	item 11 key (270 INODE_ITEM 0) itemoff 1970 itemsize 160
> 		inode generation 26 size 1593360 block group 0 mode 100700 links 1
> 
> Is this a valid situation, or I should always consider size in
> INODE_ITEM as the correct one?

   Again, I don't know off the top of my head. It's been some time
since I dug into these kinds of details, sorry.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
      --- In my day, we didn't have fancy high numbers.  We had ---      
               "nothing", "one", "twain" and "multitudes".               

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Newbie questions on some of btrfs code...
  2012-05-18 13:59     ` Hugo Mills
@ 2012-05-20  7:40       ` Alex Lyakas
  0 siblings, 0 replies; 17+ messages in thread
From: Alex Lyakas @ 2012-05-20  7:40 UTC (permalink / raw)
  To: Hugo Mills, Alex Lyakas, linux-btrfs

Hugo,
thanks for helping out!
Hopefully, somebody else will address the rest of my questions.

Alex.


On Fri, May 18, 2012 at 4:59 PM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Fri, May 18, 2012 at 04:32:09PM +0300, Alex Lyakas wrote:
>> Thank you, Hugo, for the detailed explanation. I am now able to find
>> the CHUNK_ITEMs and to successfully locate the file data on disk.
>> Can you maybe address several follow-up questions I have?
>>
>> # When looking for CHUNK_ITEMs, should I check that their
>> btrfs_chunk::type==BTRFS_BLOCK_GROUP_DATA (and not SYSTEM/METADATA
>> etc)? Or file extent should always be mapped to BTRFS_BLOCK_GROUP_DATA
>> chunk?
>
>   File extents will either be mapped to a data chunk, _or_ the file
> data will live inline in the metadata area, following the
> btrfs_extent_item. This is probably the trickiest piece of the on-disk
> data format to figure out, and I fear that I didn't document it well
> enough. Basically, it's non-obvious where inline extents are
> calculated, because there's all sorts of awkward-looking type casting
> to get to the data.
>
>> # It looks like I don't even need to bother with the extent tree at
>> this point, because from EXTENT_DATA in fs tree I can navigate
>> directly to CHUNK_ITEM in chunk tree, correct?
>
>   Mmm... possibly. Again, I'm not sure how this interacts with inline
> extents.
>
>> # For replicating RAID levels, you said there will be multiple
>> CHUNK_ITEMs. How do I find them then? Should I know in advance how
>> much there should be, and look for them, considering only
>> btrfs_chunk::type==BTRFS_BLOCK_GROUP_DATA? (I don't bother for
>> replication at this point, though).
>
>   Actually, thinking about it, there's a single CHUNK_ITEM, and the
> stripe[] array holds all of the per-disk allocations that correspond
> to that block group. So, for RAID-1, you'll have precisely two
> elements in the stripe[] array. Sorry for getting it wrong earlier.
>
>> # If I find in the fs tree an EXTENT_DATA of type
>> BTRFS_FILE_EXTENT_PREALLOC, how should I treat it? What does it mean?
>> (BTRFS_FILE_EXTENT_INLINE are easy to treat).
>
>   I don't know, sorry.
>
>> # One of my files has two EXTENT_DATAs, like this:
>>       item 14 key (270 EXTENT_DATA 0) itemoff 1812 itemsize 53
>>               extent data disk byte 432508928 nr 1474560
>>               extent data offset 0 nr 1470464 ram 1474560
>>               extent compression 0
>>       item 15 key (270 EXTENT_DATA 1470464) itemoff 1759 itemsize 53
>>               extent data disk byte 432082944 nr 126976
>>               extent data offset 0 nr 126976 ram 126976
>>               extent compression 0
>> Summing btrfs_file_extent_item::num_bytes gives
>> 1470464+126976=1597440. (I know that I should not be summing
>> btrfs_file_extent_item::disk_num_bytes, but num_bytes).
>> However, it's INODE_ITEM gives size of 1593360, which is less:
>>       item 11 key (270 INODE_ITEM 0) itemoff 1970 itemsize 160
>>               inode generation 26 size 1593360 block group 0 mode 100700 links 1
>>
>> Is this a valid situation, or I should always consider size in
>> INODE_ITEM as the correct one?
>
>   Again, I don't know off the top of my head. It's been some time
> since I dug into these kinds of details, sorry.
>
>   Hugo.
>
> --
> === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
>  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
>      --- In my day, we didn't have fancy high numbers.  We had ---
>               "nothing", "one", "twain" and "multitudes".

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Newbie questions on some of btrfs code...
  2012-05-18 13:32   ` Alex Lyakas
  2012-05-18 13:59     ` Hugo Mills
@ 2012-05-21  1:59     ` Liu Bo
  2012-05-21  8:20       ` Alex Lyakas
  1 sibling, 1 reply; 17+ messages in thread
From: Liu Bo @ 2012-05-21  1:59 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: Hugo Mills, linux-btrfs

On 05/18/2012 09:32 PM, Alex Lyakas wrote:

> Thank you, Hugo, for the detailed explanation. I am now able to find
> the CHUNK_ITEMs and to successfully locate the file data on disk.
> Can you maybe address several follow-up questions I have?
> 
> # When looking for CHUNK_ITEMs, should I check that their
> btrfs_chunk::type==BTRFS_BLOCK_GROUP_DATA (and not SYSTEM/METADATA
> etc)? Or file extent should always be mapped to BTRFS_BLOCK_GROUP_DATA
> chunk?
> 
> # It looks like I don't even need to bother with the extent tree at
> this point, because from EXTENT_DATA in fs tree I can navigate
> directly to CHUNK_ITEM in chunk tree, correct?
> 
> # For replicating RAID levels, you said there will be multiple
> CHUNK_ITEMs. How do I find them then? Should I know in advance how
> much there should be, and look for them, considering only
> btrfs_chunk::type==BTRFS_BLOCK_GROUP_DATA? (I don't bother for
> replication at this point, though).
> 
> # If I find in the fs tree an EXTENT_DATA of type
> BTRFS_FILE_EXTENT_PREALLOC, how should I treat it? What does it mean?
> (BTRFS_FILE_EXTENT_INLINE are easy to treat).
> 
> # One of my files has two EXTENT_DATAs, like this:
> 	item 14 key (270 EXTENT_DATA 0) itemoff 1812 itemsize 53
> 		extent data disk byte 432508928 nr 1474560
> 		extent data offset 0 nr 1470464 ram 1474560
> 		extent compression 0
> 	item 15 key (270 EXTENT_DATA 1470464) itemoff 1759 itemsize 53
> 		extent data disk byte 432082944 nr 126976
> 		extent data offset 0 nr 126976 ram 126976
> 		extent compression 0
> Summing btrfs_file_extent_item::num_bytes gives
> 1470464+126976=1597440. (I know that I should not be summing
> btrfs_file_extent_item::disk_num_bytes, but num_bytes).
> However, it's INODE_ITEM gives size of 1593360, which is less:
> 	item 11 key (270 INODE_ITEM 0) itemoff 1970 itemsize 160
> 		inode generation 26 size 1593360 block group 0 mode 100700 links 1
> 
> Is this a valid situation, or I should always consider size in
> INODE_ITEM as the correct one?
> 


Hi Alex,

Have you tried btrfsck on this 'inode size mismatch' box?

And I'm interest in if it can be reproduced and how?


thanks,
liubo

> Thanks again,
> Alex.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Newbie questions on some of btrfs code...
  2012-05-21  1:59     ` Liu Bo
@ 2012-05-21  8:20       ` Alex Lyakas
  2012-05-21  9:33         ` Liu Bo
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Lyakas @ 2012-05-21  8:20 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs

Hi Liu,
do you think that this should not happen? I see this all the time, and
I am not doing any stress tests. Just creating a file and writing some
data at different offsets, to create "holes" in the file offset space.
btrfsck does not produce any errors.
I am using kernel 3.3.6 and btrfs-progrs compiled from
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git,
as advised by wiki.

For example, I have now the following file:
	item 20 key (266 INODE_ITEM 0) itemoff 2369 itemsize 160
		inode generation 64 size 200005 block group 0 mode 100644 links 1
	item 21 key (266 INODE_REF 256) itemoff 2348 itemsize 21
		inode ref index 10 namelen 11 name: sparse_file
	item 22 key (266 EXTENT_DATA 0) itemoff 2322 itemsize 26
		inline extent data size 5 ram 5 compress 0
	item 23 key (266 EXTENT_DATA 4096) itemoff 2269 itemsize 53
		extent data disk byte 0 nr 0
		extent data offset 0 nr 4096 ram 8192
		extent compression 0
	item 24 key (266 EXTENT_DATA 8192) itemoff 2216 itemsize 53
		extent data disk byte 432013312 nr 4096
		extent data offset 0 nr 4096 ram 4096
		extent compression 0
	item 25 key (266 EXTENT_DATA 12288) itemoff 2163 itemsize 53
		extent data disk byte 0 nr 0
		extent data offset 0 nr 86016 ram 90112
		extent compression 0
	item 26 key (266 EXTENT_DATA 98304) itemoff 2110 itemsize 53
		extent data disk byte 432017408 nr 4096
		extent data offset 0 nr 4096 ram 4096
		extent compression 0
	item 27 key (266 EXTENT_DATA 102400) itemoff 2057 itemsize 53
		extent data disk byte 0 nr 0
		extent data offset 0 nr 94208 ram 98304
		extent compression 0
	item 28 key (266 EXTENT_DATA 196608) itemoff 2004 itemsize 53
		extent data disk byte 432021504 nr 4096
		extent data offset 0 nr 4096 ram 4096
		extent compression 0

Some observations for it:
# There is a real "hole" between first two extents, because the length
of first extent is 5 bytes, but second extent starts at offset 4096.
Is this expected? I see this all the time.
# There are several extents with
btrfs_file_extent_item::disk_bytenr==0. According to some hints within
the kernel btrfs code, I presume that these are zero-extents. So when
I see disk_bytenr==0, I should not try looking up this extent in
extent tree or in chunk tree, I should assume that this extent should
be filled by zeros. Is my understanding correct?
# The last extent has offset=196608 and size=4096. Adding them up
gives 200704. However, the file size within INODE_ITEM is 200005. So
this is the issue you asked about.

I have some more pesky questions, which hopefully you or some other
devs can help with. Or at least point me at a relevant code to look
at.

# What is BTRFS_FILE_EXTENT_PREALLOC? How should I treat
btrfs_file_extent_item of such type?

# Why btrfs_previous_item() in btrfs-progs in different from kernel
code? In kernel code, there are additional checks like this:
		nritems = btrfs_header_nritems(leaf);
		if (nritems == 0)
			return 1;
		if (path->slots[0] == nritems)
			path->slots[0]--;

# What is the btrfs_dir_item::data_len value is used for? I saw it
appearing in XATTR_ITEM, but not in DIR_INDEX/DIR_ITEM

Thanks!
Alex.





On Mon, May 21, 2012 at 4:59 AM, Liu Bo <liubo2009@cn.fujitsu.com> wrote:
> On 05/18/2012 09:32 PM, Alex Lyakas wrote:
>
>> Thank you, Hugo, for the detailed explanation. I am now able to find
>> the CHUNK_ITEMs and to successfully locate the file data on disk.
>> Can you maybe address several follow-up questions I have?
>>
>> # When looking for CHUNK_ITEMs, should I check that their
>> btrfs_chunk::type==BTRFS_BLOCK_GROUP_DATA (and not SYSTEM/METADATA
>> etc)? Or file extent should always be mapped to BTRFS_BLOCK_GROUP_DATA
>> chunk?
>>
>> # It looks like I don't even need to bother with the extent tree at
>> this point, because from EXTENT_DATA in fs tree I can navigate
>> directly to CHUNK_ITEM in chunk tree, correct?
>>
>> # For replicating RAID levels, you said there will be multiple
>> CHUNK_ITEMs. How do I find them then? Should I know in advance how
>> much there should be, and look for them, considering only
>> btrfs_chunk::type==BTRFS_BLOCK_GROUP_DATA? (I don't bother for
>> replication at this point, though).
>>
>> # If I find in the fs tree an EXTENT_DATA of type
>> BTRFS_FILE_EXTENT_PREALLOC, how should I treat it? What does it mean?
>> (BTRFS_FILE_EXTENT_INLINE are easy to treat).
>>
>> # One of my files has two EXTENT_DATAs, like this:
>>       item 14 key (270 EXTENT_DATA 0) itemoff 1812 itemsize 53
>>               extent data disk byte 432508928 nr 1474560
>>               extent data offset 0 nr 1470464 ram 1474560
>>               extent compression 0
>>       item 15 key (270 EXTENT_DATA 1470464) itemoff 1759 itemsize 53
>>               extent data disk byte 432082944 nr 126976
>>               extent data offset 0 nr 126976 ram 126976
>>               extent compression 0
>> Summing btrfs_file_extent_item::num_bytes gives
>> 1470464+126976=1597440. (I know that I should not be summing
>> btrfs_file_extent_item::disk_num_bytes, but num_bytes).
>> However, it's INODE_ITEM gives size of 1593360, which is less:
>>       item 11 key (270 INODE_ITEM 0) itemoff 1970 itemsize 160
>>               inode generation 26 size 1593360 block group 0 mode 100700 links 1
>>
>> Is this a valid situation, or I should always consider size in
>> INODE_ITEM as the correct one?
>>
>
>
> Hi Alex,
>
> Have you tried btrfsck on this 'inode size mismatch' box?
>
> And I'm interest in if it can be reproduced and how?
>
>
> thanks,
> liubo
>
>> Thanks again,
>> Alex.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Newbie questions on some of btrfs code...
  2012-05-21  8:20       ` Alex Lyakas
@ 2012-05-21  9:33         ` Liu Bo
  2012-05-21 10:05           ` Alex Lyakas
  0 siblings, 1 reply; 17+ messages in thread
From: Liu Bo @ 2012-05-21  9:33 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On 05/21/2012 04:20 PM, Alex Lyakas wrote:

> Hi Liu,
> do you think that this should not happen? I see this all the time, and
> I am not doing any stress tests. Just creating a file and writing some
> data at different offsets, to create "holes" in the file offset space.
> btrfsck does not produce any errors.


I happen to know how it works :)

This comes from our COW feature, when we rewrite a file extent from its middle part,
we will find another space for the new data and leave the original extent alone:

So for the following situation:
> 	item 23 key (266 EXTENT_DATA 4096) itemoff 2269 itemsize 53
> 		extent data disk byte 0 nr 0
> 		extent data offset 0 nr 4096 ram 8192
> 		extent compression 0

As your case, after the first 'size 5' inline extent is written,
"nr 4096 < ram 8192" could come from:
1) dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k seek=12 count=4 conv=notrunc;sync
2) dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k seek=8 count=4 conv=notrunc;sync

1) makes
> 	item 23 key (266 EXTENT_DATA 4096) itemoff 2269 itemsize 53
> 		extent data disk byte 0 nr 0
> 		extent data offset 0 nr 8192 ram 8192
> 		extent compression 0

2) makes
> 	item 23 key (266 EXTENT_DATA 4096) itemoff 2269 itemsize 53
> 		extent data disk byte 0 nr 0
> 		extent data offset 0 nr 4096 ram 8192
> 		extent compression 0

> I am using kernel 3.3.6 and btrfs-progrs compiled from
> git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git,
> as advised by wiki.
> 
> For example, I have now the following file:
> 	item 20 key (266 INODE_ITEM 0) itemoff 2369 itemsize 160
> 		inode generation 64 size 200005 block group 0 mode 100644 links 1
> 	item 21 key (266 INODE_REF 256) itemoff 2348 itemsize 21
> 		inode ref index 10 namelen 11 name: sparse_file
> 	item 22 key (266 EXTENT_DATA 0) itemoff 2322 itemsize 26
> 		inline extent data size 5 ram 5 compress 0
> 	item 23 key (266 EXTENT_DATA 4096) itemoff 2269 itemsize 53
> 		extent data disk byte 0 nr 0
> 		extent data offset 0 nr 4096 ram 8192
> 		extent compression 0
> 	item 24 key (266 EXTENT_DATA 8192) itemoff 2216 itemsize 53
> 		extent data disk byte 432013312 nr 4096
> 		extent data offset 0 nr 4096 ram 4096
> 		extent compression 0
> 	item 25 key (266 EXTENT_DATA 12288) itemoff 2163 itemsize 53
> 		extent data disk byte 0 nr 0
> 		extent data offset 0 nr 86016 ram 90112
> 		extent compression 0
> 	item 26 key (266 EXTENT_DATA 98304) itemoff 2110 itemsize 53
> 		extent data disk byte 432017408 nr 4096
> 		extent data offset 0 nr 4096 ram 4096
> 		extent compression 0
> 	item 27 key (266 EXTENT_DATA 102400) itemoff 2057 itemsize 53
> 		extent data disk byte 0 nr 0
> 		extent data offset 0 nr 94208 ram 98304
> 		extent compression 0
> 	item 28 key (266 EXTENT_DATA 196608) itemoff 2004 itemsize 53
> 		extent data disk byte 432021504 nr 4096
> 		extent data offset 0 nr 4096 ram 4096
> 		extent compression 0
> 
> Some observations for it:
> # There is a real "hole" between first two extents, because the length
> of first extent is 5 bytes, but second extent starts at offset 4096.
> Is this expected? I see this all the time.


Yup, our extents are sectorsize aligned, say 4096.


> # There are several extents with
> btrfs_file_extent_item::disk_bytenr==0. According to some hints within
> the kernel btrfs code, I presume that these are zero-extents. So when
> I see disk_bytenr==0, I should not try looking up this extent in
> extent tree or in chunk tree, I should assume that this extent should
> be filled by zeros. Is my understanding correct?


'disk_bytenr == 0' means dummy extents, which has no data.


> # The last extent has offset=196608 and size=4096. Adding them up
> gives 200704. However, the file size within INODE_ITEM is 200005. So
> this is the issue you asked about.
> 


Given the sectorsize aligned stuff, the file size of INODE_ITEM is correct, 200005 here.


> I have some more pesky questions, which hopefully you or some other
> devs can help with. Or at least point me at a relevant code to look
> at.
> 
> # What is BTRFS_FILE_EXTENT_PREALLOC? How should I treat
> btrfs_file_extent_item of such type?
> 


IIRC, PREALLOC comes from fallocate or something like that, which means we allocate the
space in advance, and will use it in the future.


> # Why btrfs_previous_item() in btrfs-progs in different from kernel
> code? In kernel code, there are additional checks like this:
> 		nritems = btrfs_header_nritems(leaf);
> 		if (nritems == 0)
> 			return 1;
> 		if (path->slots[0] == nritems)
> 			path->slots[0]--;
> 


The kernel side is more careful, it's ok.


> # What is the btrfs_dir_item::data_len value is used for? I saw it
> appearing in XATTR_ITEM, but not in DIR_INDEX/DIR_ITEM
> 


data_len is xattr relative, plz check the source code: btrfs_set_acl()


thanks,
liubo

> Thanks!
> Alex.
> 
> 
> 
> 
> 
> On Mon, May 21, 2012 at 4:59 AM, Liu Bo <liubo2009@cn.fujitsu.com> wrote:
>> On 05/18/2012 09:32 PM, Alex Lyakas wrote:
>>
>>> Thank you, Hugo, for the detailed explanation. I am now able to find
>>> the CHUNK_ITEMs and to successfully locate the file data on disk.
>>> Can you maybe address several follow-up questions I have?
>>>
>>> # When looking for CHUNK_ITEMs, should I check that their
>>> btrfs_chunk::type==BTRFS_BLOCK_GROUP_DATA (and not SYSTEM/METADATA
>>> etc)? Or file extent should always be mapped to BTRFS_BLOCK_GROUP_DATA
>>> chunk?
>>>
>>> # It looks like I don't even need to bother with the extent tree at
>>> this point, because from EXTENT_DATA in fs tree I can navigate
>>> directly to CHUNK_ITEM in chunk tree, correct?
>>>
>>> # For replicating RAID levels, you said there will be multiple
>>> CHUNK_ITEMs. How do I find them then? Should I know in advance how
>>> much there should be, and look for them, considering only
>>> btrfs_chunk::type==BTRFS_BLOCK_GROUP_DATA? (I don't bother for
>>> replication at this point, though).
>>>
>>> # If I find in the fs tree an EXTENT_DATA of type
>>> BTRFS_FILE_EXTENT_PREALLOC, how should I treat it? What does it mean?
>>> (BTRFS_FILE_EXTENT_INLINE are easy to treat).
>>>
>>> # One of my files has two EXTENT_DATAs, like this:
>>>       item 14 key (270 EXTENT_DATA 0) itemoff 1812 itemsize 53
>>>               extent data disk byte 432508928 nr 1474560
>>>               extent data offset 0 nr 1470464 ram 1474560
>>>               extent compression 0
>>>       item 15 key (270 EXTENT_DATA 1470464) itemoff 1759 itemsize 53
>>>               extent data disk byte 432082944 nr 126976
>>>               extent data offset 0 nr 126976 ram 126976
>>>               extent compression 0
>>> Summing btrfs_file_extent_item::num_bytes gives
>>> 1470464+126976=1597440. (I know that I should not be summing
>>> btrfs_file_extent_item::disk_num_bytes, but num_bytes).
>>> However, it's INODE_ITEM gives size of 1593360, which is less:
>>>       item 11 key (270 INODE_ITEM 0) itemoff 1970 itemsize 160
>>>               inode generation 26 size 1593360 block group 0 mode 100700 links 1
>>>
>>> Is this a valid situation, or I should always consider size in
>>> INODE_ITEM as the correct one?
>>>
>>
>> Hi Alex,
>>
>> Have you tried btrfsck on this 'inode size mismatch' box?
>>
>> And I'm interest in if it can be reproduced and how?
>>
>>
>> thanks,
>> liubo
>>
>>> Thanks again,
>>> Alex.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Newbie questions on some of btrfs code...
  2012-05-21  9:33         ` Liu Bo
@ 2012-05-21 10:05           ` Alex Lyakas
  2012-05-22  1:42             ` Liu Bo
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Lyakas @ 2012-05-21 10:05 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs

Hi Liu,
thanks for the clarifications.

I did not understand the dd example of yours, though.

> So for the following situation:
>>       item 23 key (266 EXTENT_DATA 4096) itemoff 2269 itemsize 53
>>               extent data disk byte 0 nr 0
>>               extent data offset 0 nr 4096 ram 8192
>>               extent compression 0
>
> As your case, after the first 'size 5' inline extent is written,
> "nr 4096 < ram 8192" could come from:
> 1) dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k seek=12 count=4 conv=notrunc;sync
> 2) dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k seek=8 count=4 conv=notrunc;sync
>
> 1) makes
>>       item 23 key (266 EXTENT_DATA 4096) itemoff 2269 itemsize 53
>>               extent data disk byte 0 nr 0
>>               extent data offset 0 nr 8192 ram 8192
>>               extent compression 0
>
> 2) makes
>>       item 23 key (266 EXTENT_DATA 4096) itemoff 2269 itemsize 53
>>               extent data disk byte 0 nr 0
>>               extent data offset 0 nr 4096 ram 8192
>>               extent compression 0

You talk about the "ram_bytes" field. But do I need to look at it, if
I don't use compression or another encoding? Shouldn't I always look
at btrfs_file_extent_item::offset/num_bytes for the real data, and at
btrfs_file_extent_item::disk_bytenr/disk_num_bytes for finding
CHUNK_ITEM? Any reason I should be aware of "ram_bytes" field?

The first dd created a 4k extent at offset 12k. How did we end up with
"nr 8192 ram 8192" and offset 4k?
The second dd added a 4k extent at 8k offset. But still EXTENT_DATA
has 4k offset.
So now we should have have twp 4k extents or one 8k extent. What am I missing?

Alex.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Newbie questions on some of btrfs code...
  2012-05-18 11:21 Newbie questions on some of btrfs code Alex Lyakas
  2012-05-18 11:50 ` Hugo Mills
@ 2012-05-21 10:44 ` Jan Schmidt
  2012-05-22  8:07   ` Alex Lyakas
  1 sibling, 1 reply; 17+ messages in thread
From: Jan Schmidt @ 2012-05-21 10:44 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

Hi Alex,

On Fri, May 18, 2012 at 13:21 (+0200), Alex Lyakas wrote:
> Some general questions on the ctree code.
> 
> # I saw that slot==0 is special. My understanding is that btrfs
> maintains the property that the parent of each node/leaf has a key
> pointing to that node/leaf, which must be equal to the key in the
> slot==0 of this node/leaf. That's what fixup_low_keys() tries to
> maintain. Is this correct?

Yes. I'm not 100% sure if the key in the parent node must match exactly
the first key of the child node. It is probably allowed that it's less
or equal than the first key. It is guaranteed to be larger than the
largest of the previous (left) node, though.

And yes, that's what fixup_low_keys is correcting.

> # If my understanding in the previous bullet is correct: Is that the
> reason that in btrfs_prev_leaf() it is assumed that if there is a
> lesser key, btrfs_search_slot() will never bring us to the slot==0 of
> the current leaf?

It's quite straight: We look for a key smaller than the first (slot 0)
of the current leaf. If we find the current leaf again (because
btrfs_search_slot returns the place where such a key would have be
inserted), then there's no previous leaf. No preconditions or
assumptions on nodes in levels needed.

> # btrfs_search_slot(): how can it happen that b becomes NULL, and we
> exit the while loop? (and set ret=1)

Good question, probably a leftover. b can't be NULL in the beginning,
because there's always a root node and a commit_root. b
setup_nodes_for_search only sets b NULL when it returns EAGAIN. And the
regular loop exit is in the "if (level == 0)" path.

> # btrfs_insert_empty_items(): if nr>1, then an array of keys is
> expected to be passed. But btrfs_search_slot() is searching only for
> the first key. What happens if the first key does not exist (as
> expected), but the next key in the array exists?

The reason here is that key ordering == insertion order. That function
only has two callers: once from log replay (where either all the items
from the log will exist in the leaf or none) and once from inode.c,
inserting INODE_ITEM and INODE_REF, which never come without the other.
But you are right, there seems to be no safety net in this function, a
comment might be useful.

-Jan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Newbie questions on some of btrfs code...
  2012-05-21 10:05           ` Alex Lyakas
@ 2012-05-22  1:42             ` Liu Bo
  2012-05-22  7:48               ` Alex Lyakas
  0 siblings, 1 reply; 17+ messages in thread
From: Liu Bo @ 2012-05-22  1:42 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On 05/21/2012 06:05 PM, Alex Lyakas wrote:

> Hi Liu,
> thanks for the clarifications.
> 
> I did not understand the dd example of yours, though.
> 
>> So for the following situation:
>>>       item 23 key (266 EXTENT_DATA 4096) itemoff 2269 itemsize 53
>>>               extent data disk byte 0 nr 0
>>>               extent data offset 0 nr 4096 ram 8192
>>>               extent compression 0
>> As your case, after the first 'size 5' inline extent is written,
>> "nr 4096 < ram 8192" could come from:
>> 1) dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k seek=12 count=4 conv=notrunc;sync
>> 2) dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k seek=8 count=4 conv=notrunc;sync
>>
>> 1) makes
>>>       item 23 key (266 EXTENT_DATA 4096) itemoff 2269 itemsize 53
>>>               extent data disk byte 0 nr 0
>>>               extent data offset 0 nr 8192 ram 8192
>>>               extent compression 0
>> 2) makes
>>>       item 23 key (266 EXTENT_DATA 4096) itemoff 2269 itemsize 53
>>>               extent data disk byte 0 nr 0
>>>               extent data offset 0 nr 4096 ram 8192
>>>               extent compression 0
> 
> You talk about the "ram_bytes" field. But do I need to look at it, if
> I don't use compression or another encoding? Shouldn't I always look
> at btrfs_file_extent_item::offset/num_bytes for the real data, and at
> btrfs_file_extent_item::disk_bytenr/disk_num_bytes for finding
> CHUNK_ITEM? Any reason I should be aware of "ram_bytes" field?
> 

> The first dd created a 4k extent at offset 12k. How did we end up with
> "nr 8192 ram 8192" and offset 4k?
> The second dd added a 4k extent at 8k offset. But still EXTENT_DATA
> has 4k offset.
> So now we should have have twp 4k extents or one 8k extent. What am I missing?
> 
> Alex.
> 


As I mentioned, disk_bytenr == 0 means dummy extents, which we have not yet allocate
a range of space for it.

After your first 'size=5' inline extent, we'll start allocating extents from _4096_, cause
it is _4k aligned_.

>> 1) dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k seek=12 count=4 conv=notrunc;sync
: we need a dummy extent for [4k, 12k], which starts from 4096, and nr is 8192

>> 2) dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k seek=8 count=4 conv=notrunc;sync
: we break [4k, 12k] into a dummy one [4k, 8k] and a real one [8k, 12k].

More details, plz refer to btrfs_drop_extents();

thanks,
liubo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Newbie questions on some of btrfs code...
  2012-05-22  1:42             ` Liu Bo
@ 2012-05-22  7:48               ` Alex Lyakas
  0 siblings, 0 replies; 17+ messages in thread
From: Alex Lyakas @ 2012-05-22  7:48 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs

Thanks, Liu, that clarifies.

Alex.

On Tue, May 22, 2012 at 4:42 AM, Liu Bo <liubo2009@cn.fujitsu.com> wrote:
> On 05/21/2012 06:05 PM, Alex Lyakas wrote:
>
>> Hi Liu,
>> thanks for the clarifications.
>>
>> I did not understand the dd example of yours, though.
>>
>>> So for the following situation:
>>>>       item 23 key (266 EXTENT_DATA 4096) itemoff 2269 itemsize 53
>>>>               extent data disk byte 0 nr 0
>>>>               extent data offset 0 nr 4096 ram 8192
>>>>               extent compression 0
>>> As your case, after the first 'size 5' inline extent is written,
>>> "nr 4096 < ram 8192" could come from:
>>> 1) dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k seek=12 count=4 conv=notrunc;sync
>>> 2) dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k seek=8 count=4 conv=notrunc;sync
>>>
>>> 1) makes
>>>>       item 23 key (266 EXTENT_DATA 4096) itemoff 2269 itemsize 53
>>>>               extent data disk byte 0 nr 0
>>>>               extent data offset 0 nr 8192 ram 8192
>>>>               extent compression 0
>>> 2) makes
>>>>       item 23 key (266 EXTENT_DATA 4096) itemoff 2269 itemsize 53
>>>>               extent data disk byte 0 nr 0
>>>>               extent data offset 0 nr 4096 ram 8192
>>>>               extent compression 0
>>
>> You talk about the "ram_bytes" field. But do I need to look at it, if
>> I don't use compression or another encoding? Shouldn't I always look
>> at btrfs_file_extent_item::offset/num_bytes for the real data, and at
>> btrfs_file_extent_item::disk_bytenr/disk_num_bytes for finding
>> CHUNK_ITEM? Any reason I should be aware of "ram_bytes" field?
>>
>
>> The first dd created a 4k extent at offset 12k. How did we end up with
>> "nr 8192 ram 8192" and offset 4k?
>> The second dd added a 4k extent at 8k offset. But still EXTENT_DATA
>> has 4k offset.
>> So now we should have have twp 4k extents or one 8k extent. What am I missing?
>>
>> Alex.
>>
>
>
> As I mentioned, disk_bytenr == 0 means dummy extents, which we have not yet allocate
> a range of space for it.
>
> After your first 'size=5' inline extent, we'll start allocating extents from _4096_, cause
> it is _4k aligned_.
>
>>> 1) dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k seek=12 count=4 conv=notrunc;sync
> : we need a dummy extent for [4k, 12k], which starts from 4096, and nr is 8192
>
>>> 2) dd if=/dev/zero of=/mnt/btrfs/foobar bs=1k seek=8 count=4 conv=notrunc;sync
> : we break [4k, 12k] into a dummy one [4k, 8k] and a real one [8k, 12k].
>
> More details, plz refer to btrfs_drop_extents();
>
> thanks,
> liubo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Newbie questions on some of btrfs code...
  2012-05-21 10:44 ` Jan Schmidt
@ 2012-05-22  8:07   ` Alex Lyakas
  2012-05-22 22:08     ` Jan Schmidt
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Lyakas @ 2012-05-22  8:07 UTC (permalink / raw)
  To: Jan Schmidt; +Cc: linux-btrfs

Hi Jan,

>> # I saw that slot==0 is special. My understanding is that btrfs
>> maintains the property that the parent of each node/leaf has a key
>> pointing to that node/leaf, which must be equal to the key in the
>> slot==0 of this node/leaf. That's what fixup_low_keys() tries to
>> maintain. Is this correct?
>
> Yes. I'm not 100% sure if the key in the parent node must match exactly
> the first key of the child node. It is probably allowed that it's less
> or equal than the first key. It is guaranteed to be larger than the
> largest of the previous (left) node, though.
>
> And yes, that's what fixup_low_keys is correcting.
>
>> # If my understanding in the previous bullet is correct: Is that the
>> reason that in btrfs_prev_leaf() it is assumed that if there is a
>> lesser key, btrfs_search_slot() will never bring us to the slot==0 of
>> the current leaf?
>
> It's quite straight: We look for a key smaller than the first (slot 0)
> of the current leaf. If we find the current leaf again (because
> btrfs_search_slot returns the place where such a key would have be
> inserted), then there's no previous leaf. No preconditions or
> assumptions on nodes in levels needed.

Let's say that slot[0] of the current leaf (A) has key=10. And let's
say that its parent node (N) has key=5 (and not 10). Let's say we have
a previous leaf (B), whose last slot has key=2.
If such tree is valid, then: btrfs_prev_leaf() will search for key==9.
Then btrfs_search_slot() would bring us node N and leaf A again,
wouldn't it? Because key(N)<=9. So we will receive leaf A back, and
will think that there is no previous leaf, while there is.
What am I missing here?

Thanks for your help,
Alex.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Newbie questions on some of btrfs code...
  2012-05-22  8:07   ` Alex Lyakas
@ 2012-05-22 22:08     ` Jan Schmidt
  2012-05-28 18:45       ` Alex Lyakas
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Schmidt @ 2012-05-22 22:08 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On 22.05.2012 10:07, Alex Lyakas wrote:
>>> # If my understanding in the previous bullet is correct: Is that the
>>> reason that in btrfs_prev_leaf() it is assumed that if there is a
>>> lesser key, btrfs_search_slot() will never bring us to the slot==0 of
>>> the current leaf?
>>
>> It's quite straight: We look for a key smaller than the first (slot 0)
>> of the current leaf. If we find the current leaf again (because
>> btrfs_search_slot returns the place where such a key would have be
>> inserted), then there's no previous leaf. No preconditions or
>> assumptions on nodes in levels needed.
> 
> Let's say that slot[0] of the current leaf (A) has key=10. And let's
> say that its parent node (N) has key=5 (and not 10). Let's say we have
> a previous leaf (B), whose last slot has key=2.
> If such tree is valid, then: btrfs_prev_leaf() will search for key==9.
> Then btrfs_search_slot() would bring us node N and leaf A again,
> wouldn't it? Because key(N)<=9. So we will receive leaf A back, and
> will think that there is no previous leaf, while there is.
> What am I missing here?

It wouldn't. btrfs_search_slot always sets up the path such that it
points to the position where such an key would be inserted. And we never
insert at the beginning of a leaf. So in your example, this would be at
the end of leaf B: your path object will have nodes[1] = N, nodes[0] = B
and slots[0] = number_of_slots_used_in_B + 1.

Your example sounds like a good explanation why the key in the parent
node should really be an exact match. It sounds reasonable that it's not
allowed to be <= than the first key of its child. If it was, extra
lookups would be required to setup the path correctly for your example
above (which I haven't seen so far).

-Jan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Newbie questions on some of btrfs code...
  2012-05-22 22:08     ` Jan Schmidt
@ 2012-05-28 18:45       ` Alex Lyakas
  2012-05-29  9:13         ` Jan Schmidt
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Lyakas @ 2012-05-28 18:45 UTC (permalink / raw)
  To: Jan Schmidt; +Cc: linux-btrfs

Hi Jan,

>> Let's say that slot[0] of the current leaf (A) has key=10. And let's
>> say that its parent node (N) has key=5 (and not 10). Let's say we have
>> a previous leaf (B), whose last slot has key=2.
>> If such tree is valid, then: btrfs_prev_leaf() will search for key==9.
>> Then btrfs_search_slot() would bring us node N and leaf A again,
>> wouldn't it? Because key(N)<=9. So we will receive leaf A back, and
>> will think that there is no previous leaf, while there is.
>> What am I missing here?
>
> It wouldn't. btrfs_search_slot always sets up the path such that it
> points to the position where such an key would be inserted. And we never
> insert at the beginning of a leaf. So in your example, this would be at
> the end of leaf B: your path object will have nodes[1] = N, nodes[0] = B
> and slots[0] = number_of_slots_used_in_B + 1.

I have re-looked at btrfs_search_slot, and don't see how it would end
up in leaf B. The bin_search() function will clearly return the slot
*after* the slot of N that has key==5 (which is the parent slot of A).
So then the following code:
		if (level != 0) {
			int dec = 0;
			if (ret && slot > 0) {
				dec = 1;
				slot -= 1;
			}
will bring us back into the slot of N with key=5. And we will go to
leaf A. While if key(N) of that slot was 10, we would never have ended
up in that slot, unless there is no lesser key in the tree. Actually,
it looks like "no lesser key" is the only case when we can get ret==1
and slot==0. Except perhaps an empty leaf, which I am not sure can
happen.
But I may be way off here, as my understanding is still pretty limited.

Thanks!
Alex.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Newbie questions on some of btrfs code...
  2012-05-28 18:45       ` Alex Lyakas
@ 2012-05-29  9:13         ` Jan Schmidt
  2012-05-29 11:27           ` Alex Lyakas
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Schmidt @ 2012-05-29  9:13 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On Mon, May 28, 2012 at 20:45 (+0200), Alex Lyakas wrote:
> I have re-looked at btrfs_search_slot, and don't see how it would end
> up in leaf B. The bin_search() function will clearly return the slot
> *after* the slot of N that has key==5 (which is the parent slot of A).
> So then the following code:
> 		if (level != 0) {
> 			int dec = 0;
> 			if (ret && slot > 0) {
> 				dec = 1;
> 				slot -= 1;
> 			}
> will bring us back into the slot of N with key=5. And we will go to
> leaf A. While if key(N) of that slot was 10, we would never have ended
> up in that slot, unless there is no lesser key in the tree.

Yes, that's right. As already said in my previous mail (in the paragraph
you didn't quote), the key in the leaf must be an exact match. The key
in N pointing to A will be 10.

> Actually, it looks like "no lesser key" is the only case when we can
> get ret==1 and slot==0.

Correct.

> Except perhaps an empty leaf, which I am not sure can happen.

It can't.

-Jan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Newbie questions on some of btrfs code...
  2012-05-29  9:13         ` Jan Schmidt
@ 2012-05-29 11:27           ` Alex Lyakas
  0 siblings, 0 replies; 17+ messages in thread
From: Alex Lyakas @ 2012-05-29 11:27 UTC (permalink / raw)
  To: Jan Schmidt; +Cc: linux-btrfs

Thank you Jan, Hugo & Lio,
for taking time answering my questions.

Alex.
P.S.: I have dug in some more, so probably more questions will arrive:)


On Tue, May 29, 2012 at 12:13 PM, Jan Schmidt <list.btrfs@jan-o-sch.net> wrote:
> On Mon, May 28, 2012 at 20:45 (+0200), Alex Lyakas wrote:
>> I have re-looked at btrfs_search_slot, and don't see how it would end
>> up in leaf B. The bin_search() function will clearly return the slot
>> *after* the slot of N that has key==5 (which is the parent slot of A).
>> So then the following code:
>>               if (level != 0) {
>>                       int dec = 0;
>>                       if (ret && slot > 0) {
>>                               dec = 1;
>>                               slot -= 1;
>>                       }
>> will bring us back into the slot of N with key=5. And we will go to
>> leaf A. While if key(N) of that slot was 10, we would never have ended
>> up in that slot, unless there is no lesser key in the tree.
>
> Yes, that's right. As already said in my previous mail (in the paragraph
> you didn't quote), the key in the leaf must be an exact match. The key
> in N pointing to A will be 10.
>
>> Actually, it looks like "no lesser key" is the only case when we can
>> get ret==1 and slot==0.
>
> Correct.
>
>> Except perhaps an empty leaf, which I am not sure can happen.
>
> It can't.
>
> -Jan

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2012-05-29 11:27 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-18 11:21 Newbie questions on some of btrfs code Alex Lyakas
2012-05-18 11:50 ` Hugo Mills
2012-05-18 13:32   ` Alex Lyakas
2012-05-18 13:59     ` Hugo Mills
2012-05-20  7:40       ` Alex Lyakas
2012-05-21  1:59     ` Liu Bo
2012-05-21  8:20       ` Alex Lyakas
2012-05-21  9:33         ` Liu Bo
2012-05-21 10:05           ` Alex Lyakas
2012-05-22  1:42             ` Liu Bo
2012-05-22  7:48               ` Alex Lyakas
2012-05-21 10:44 ` Jan Schmidt
2012-05-22  8:07   ` Alex Lyakas
2012-05-22 22:08     ` Jan Schmidt
2012-05-28 18:45       ` Alex Lyakas
2012-05-29  9:13         ` Jan Schmidt
2012-05-29 11:27           ` Alex Lyakas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.