All of lore.kernel.org
 help / color / mirror / Atom feed
* Regarding handling of file renames in Btrfs
@ 2017-09-09 23:50 Rohan Kadekodi
  2017-09-10  1:32 ` Duncan
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Rohan Kadekodi @ 2017-09-09 23:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Vijaychidambaram Velayudhan Pillai, Jayashree Mohan

Hello,

I was trying to understand how file renames are handled in Btrfs. I
read the code documentation, but had a problem understanding a few
things.

During a file rename, btrfs_commit_transaction() is called which is
because Btrfs has to commit the whole FS before storing the
information related to the new renamed file. It has to commit the FS
because a rename first does an unlink, which is not recorded in the
btrfs_rename() transaction and so is not logged in the log tree. Is my
understanding correct? If yes, my questions are as follows:

1. What does committing the whole FS mean? Blktrace shows that there
are 2       256KB writes, which are essentially writes to the data of
the root directory of the file system (which I found out through
btrfs-debug-tree). Is this equivalent to doing a shell sync, as the
same block groups are written during a shell sync too? Also, does it
imply that all the metadata held by the log tree is now checkpointed
to the respective trees?

2. Why are there 2 complete writes to the data held by the root
directory and not just 1? These writes are 256KB each, which is the
size of the extent allocated to the root directory

3. Why are the writes being done to the root directory of the file
system / subvolume and not just the parent directory where the unlink
happened?

It would be great if I could get the answers to these questions.

Thanks,
Rohan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regarding handling of file renames in Btrfs
  2017-09-09 23:50 Regarding handling of file renames in Btrfs Rohan Kadekodi
@ 2017-09-10  1:32 ` Duncan
  2017-09-10  6:41 ` Qu Wenruo
  2017-09-16 12:27 ` Hans van Kranenburg
  2 siblings, 0 replies; 11+ messages in thread
From: Duncan @ 2017-09-10  1:32 UTC (permalink / raw)
  To: linux-btrfs

Rohan Kadekodi posted on Sat, 09 Sep 2017 18:50:09 -0500 as excerpted:

> Hello,
> 
> I was trying to understand how file renames are handled in Btrfs. I read
> the code documentation, but had a problem understanding a few things.
> 
> During a file rename, btrfs_commit_transaction() is called which is
> because Btrfs has to commit the whole FS before storing the information
> related to the new renamed file. It has to commit the FS because a
> rename first does an unlink, which is not recorded in the btrfs_rename()
> transaction and so is not logged in the log tree. Is my understanding
> correct? If yes, my questions are as follows:

I'm not a dev, but am a btrfs user and list regular, and can try my hand 
at answering... and if I'm wrong, a dev's reply can correct my 
misconceptions as well. =:^)

> 1. What does committing the whole FS mean? Blktrace shows that there are
> 2  256KB writes, which are essentially writes to the data of the
> root directory of the file system (which I found out through
> btrfs-debug-tree). Is this equivalent to doing a shell sync, as the same
> block groups are written during a shell sync too? Also, does it imply
> that all the metadata held by the log tree is now checkpointed to the
> respective trees?

A btrfs commit is the equivalent of a *single* filesystem sync, yes.  The 
difference compared to the sync(1) command is that sync applies to all 
filesystems of all types, not just a single btrfs filesystem.  See also 
the btrfs filesystem sync command (btrfs-filesystem(8) manpage), which 
applies to a a single btrfs, but also triggers deleted subvolume cleanup.

But these are not writes to the /data/ of the root directory.  In btrfs, 
data and metadata are separated, and these are writes to the /metadata/ 
of the filesystem, including writing a new filesystem top-level (aka 
root) block and the superblock and its backups.

Yes, the log is synced too.

But regarding the log, in btrfs, because btrfs is atomic cow-based (copy-
on-write), at each commit the filesystem is designed to be entirely self-
consistent, with the result being that most actions don't need to be and 
are not logged.  At a crash and later remount, the filesystem as of the 
last atomically-written root-block state will be mounted, and anything 
being written at the time of the crash will either have been entirely 
written and committed (the top-level root tree block will have been 
updated to reflect it), or that update will not have happened yet, so the 
state of the filesystem will be that of the last root tree block commit, 
with newer in-process actions lost.

The btrfs log is an exception, a compromise in the interest of fsync 
speed.  The only thing it logs are fsyncs (filesyncs, as opposed to whole 
filesystem syncs) that would otherwise not return until the next commit 
(with commits on a 30-second timer by default), since the filesystem 
would otherwise be unable to guarantee that the fsync had been entirely 
written to permanent media and thus should survive a crash.  The log 
ensures the fsynced file's new data (if any) is written to its new 
location on the media (cow so new block location), updates the metadata 
(also cow so written to a new location), then logs the metadata update so 
it can be committed at log replay if necessary, and returns.  If a crash 
happens before the next full filesystem atomic commit, the fsync can be 
replayed from the log, thus satisfying the fsync guarantee without 
forcing a wait for a full atomic commit.  But once that full filesystem 
atomic commit happens (again, with a 30-second default timeout), all 
updates are now reflected in the new filesystem state as registered in 
the new root tree block, and the previous log is now dead/unreferenced on 
the media (because the new root block doesn't refer to it any longer, 
referring instead to a new log).

> 2. Why are there 2 complete writes to the data held by the root
> directory and not just 1? These writes are 256KB each, which is the size
> of the extent allocated to the root directory

I'm not sure on this one, hopefully a btrfs dev can clarify, but at a 
guess, you may be seeing writes to the superblock and its backup -- on a 
large enough filesystem there's two backups, but your filesystem may be 
small enough to have just one backup.

It's also possible you're seeing the new copy of the metadata tree being 
written out, then the root block and superblocks (and backups) being 
updated.

> 3. Why are the writes being done to the root directory of the file
> system / subvolume and not just the parent directory where the unlink
> happened?

Remember, everything's in trees, and updates are cowed, with updates at 
lower levels of the tree not reflected in the atomic state of the 
filesystem until they've recursed up the tree and a new root tree block 
is written, pointing at the new trees instead of the old ones, with the 
superblock and backups then updated to point at the new root tree block.

So nothing's local-only.  First, the old (meta)data along with any 
updates to it is written to a new location, then higher tree entries must 
be updated and written to new locations, all the way to the top.  Until 
that top entry is updated, the state of the filesystem reflects the old 
state, without any in-process changes -- it's as if your rename hasn't 
happened yet because the atomic filesystem state doesn't point to the 
newly written location yet.  Once the updates reach the top, the new 
state is reflected.

Of course with the fsync log being an exception, as mentioned above, but 
it too is renewed by the full filesystem commit, with the old log freed 
to be garbage-collected, and a new initially empty log pointed at by the 
newly written root block, which is in turn pointed at by the newly 
rewritten superblock and its backups.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regarding handling of file renames in Btrfs
  2017-09-09 23:50 Regarding handling of file renames in Btrfs Rohan Kadekodi
  2017-09-10  1:32 ` Duncan
@ 2017-09-10  6:41 ` Qu Wenruo
  2017-09-10  6:45   ` Qu Wenruo
  2017-09-16 12:27 ` Hans van Kranenburg
  2 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2017-09-10  6:41 UTC (permalink / raw)
  To: Rohan Kadekodi, linux-btrfs
  Cc: Vijaychidambaram Velayudhan Pillai, Jayashree Mohan



On 2017年09月10日 07:50, Rohan Kadekodi wrote:
> Hello,
> 
> I was trying to understand how file renames are handled in Btrfs. I
> read the code documentation, but had a problem understanding a few
> things.
> 
> During a file rename, btrfs_commit_transaction() is called which is
> because Btrfs has to commit the whole FS before storing the
> information related to the new renamed file. It has to commit the FS
> because a rename first does an unlink, which is not recorded in the
> btrfs_rename() transaction and so is not logged in the log tree. Is my
> understanding correct? If yes, my questions are as follows:

Not familiar with rename kernel code, so not much help for rename opeartion.

> 
> 1. What does committing the whole FS mean?

Committing the whole fs means a lot of things, but generally speaking, 
it makes that the on-disk data is inconsistent with each other.

For obvious part, it writes modified fs/subvolume trees to disk (with 
handling of tree operations so no half modified trees).

Also other trees like extent tree (very hot since every CoW will update 
it, and the most complicated one), csum tree if modified.

After transaction is committed, the on-disk btrfs will represent the 
states when commit trans is called, and every tree should match each other.

Despite of this, after a transaction is committed, generation of the fs 
get increased and modified tree blocks will have the same generation number.

> Blktrace shows that there
> are 2       256KB writes, which are essentially writes to the data of
> the root directory of the file system (which I found out through
> btrfs-debug-tree).

I'd say you didn't check btrfs-debug-tree output carefully enough.
I strongly recommend to do vimdiff to get what tree is modified.

At least the following trees are modified:

1) fs/subvolume tree
    Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and
    updated inode time.
    So fs/subvolume tree must be CoWed.

2) extent tree
    CoW of above metadata operation will definitely cause extent
    allocation and freeing, extent tree will also get updated.

3) root tree
    Both extent tree and fs/subvolume tree modified, their root bytenr
    needs to be updated and root tree must be updated.

And finally superblocks.

I just verified the behavior with empty btrfs created on a 1G file, only 
one file to do the rename.

In that case (with 4K sectorsize and 16K nodesize), the total IO should 
be (3 * 16K) * 2 + 4K * 2 = 104K.

"3" = number of tree blocks get modified
"16K" = nodesize
1st "*2" = DUP profile for metadata
"4K" = superblock size
2nd "*2" = 2 superblocks for 1G fs.

If your extent/root/fs trees have higher level, then more tree blocks 
needs to be updated.
And if your fs is very large, you may have 3 superblocks.

> Is this equivalent to doing a shell sync, as the
> same block groups are written during a shell sync too?

For shell "sync" the difference is that, "sync" will write all dirty 
data pages to disk, and then commit transaction.
While only calling btrfs_commit_transacation() doesn't trigger dirty 
page writeback.

So there is a difference.

And furthermore, if there is nothing to modified at all, sync will just 
skip the fs, so btrfs_commit_transaction() is not ensured if you call 
"sync".

> Also, does it
> imply that all the metadata held by the log tree is now checkpointed
> to the respective trees?

Log tree part is a little tricky, as the log tree is not really a 
journal for btrfs.
Btrfs uses CoW for metadata so in theory (and in fact) btrfs doesn't 
need any journal.

Log tree is mainly used for enhancing btrfs fsync performance.
You can totally disable log tree by notreelog mount option and btrfs 
will behave just fine.

And furthermore, I'm not very familiar with log tree, I need to verify 
the code to see if log tree is used in rename, so I can't say much right 
now.

But to make things easy, I strongly recommend to ignore log tree for now.

> 
> 2. Why are there 2 complete writes to the data held by the root
> directory and not just 1? These writes are 256KB each, which is the
> size of the extent allocated to the root directory

Check my first calculation and verify the debug-tree output before and 
after rename.

I think there is some extra factors affecting the number, from the tree 
height to your fs tree organization.

> 
> 3. Why are the writes being done to the root directory of the file
> system / subvolume and not just the parent directory where the unlink
> happened?

That's why I strongly recommend to understand btrfs on-disk format first.
A lot of things can be answered after understanding the on-disk layout, 
without asking any other guys.

The short answer is, btrfs puts all its child dir/inode info into one 
tree for one subvolume.
(And the term "root directory" here is a little confusing, are you 
talking about the fs tree root or the root tree?)

Not the common one tree for one inode layout.

So if you rename one file in a subvolume, the subvolume tree get CoWed, 
which means from the leaf containing the key/item you want to modify, to 
the tree root will be CoWed.

Thanks,
Qu
> 
> It would be great if I could get the answers to these questions.
> 
> Thanks,
> Rohan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regarding handling of file renames in Btrfs
  2017-09-10  6:41 ` Qu Wenruo
@ 2017-09-10  6:45   ` Qu Wenruo
  2017-09-10 14:32     ` Rohan Kadekodi
  2017-09-10 14:34     ` Martin Raiber
  0 siblings, 2 replies; 11+ messages in thread
From: Qu Wenruo @ 2017-09-10  6:45 UTC (permalink / raw)
  To: Rohan Kadekodi, linux-btrfs
  Cc: Vijaychidambaram Velayudhan Pillai, Jayashree Mohan



On 2017年09月10日 14:41, Qu Wenruo wrote:
> 
> 
> On 2017年09月10日 07:50, Rohan Kadekodi wrote:
>> Hello,
>>
>> I was trying to understand how file renames are handled in Btrfs. I
>> read the code documentation, but had a problem understanding a few
>> things.
>>
>> During a file rename, btrfs_commit_transaction() is called which is
>> because Btrfs has to commit the whole FS before storing the
>> information related to the new renamed file. It has to commit the FS
>> because a rename first does an unlink, which is not recorded in the
>> btrfs_rename() transaction and so is not logged in the log tree. Is my
>> understanding correct? If yes, my questions are as follows:
> 
> Not familiar with rename kernel code, so not much help for rename 
> opeartion.
> 
>>
>> 1. What does committing the whole FS mean?
> 
> Committing the whole fs means a lot of things, but generally speaking, 
> it makes that the on-disk data is inconsistent with each other.
                                     ^consistent
Sorry for the typo.

Thanks,
Qu
> 
> For obvious part, it writes modified fs/subvolume trees to disk (with 
> handling of tree operations so no half modified trees).
> 
> Also other trees like extent tree (very hot since every CoW will update 
> it, and the most complicated one), csum tree if modified.
> 
> After transaction is committed, the on-disk btrfs will represent the 
> states when commit trans is called, and every tree should match each other.
> 
> Despite of this, after a transaction is committed, generation of the fs 
> get increased and modified tree blocks will have the same generation 
> number.
> 
>> Blktrace shows that there
>> are 2       256KB writes, which are essentially writes to the data of
>> the root directory of the file system (which I found out through
>> btrfs-debug-tree).
> 
> I'd say you didn't check btrfs-debug-tree output carefully enough.
> I strongly recommend to do vimdiff to get what tree is modified.
> 
> At least the following trees are modified:
> 
> 1) fs/subvolume tree
>     Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and
>     updated inode time.
>     So fs/subvolume tree must be CoWed.
> 
> 2) extent tree
>     CoW of above metadata operation will definitely cause extent
>     allocation and freeing, extent tree will also get updated.
> 
> 3) root tree
>     Both extent tree and fs/subvolume tree modified, their root bytenr
>     needs to be updated and root tree must be updated.
> 
> And finally superblocks.
> 
> I just verified the behavior with empty btrfs created on a 1G file, only 
> one file to do the rename.
> 
> In that case (with 4K sectorsize and 16K nodesize), the total IO should 
> be (3 * 16K) * 2 + 4K * 2 = 104K.
> 
> "3" = number of tree blocks get modified
> "16K" = nodesize
> 1st "*2" = DUP profile for metadata
> "4K" = superblock size
> 2nd "*2" = 2 superblocks for 1G fs.
> 
> If your extent/root/fs trees have higher level, then more tree blocks 
> needs to be updated.
> And if your fs is very large, you may have 3 superblocks.
> 
>> Is this equivalent to doing a shell sync, as the
>> same block groups are written during a shell sync too?
> 
> For shell "sync" the difference is that, "sync" will write all dirty 
> data pages to disk, and then commit transaction.
> While only calling btrfs_commit_transacation() doesn't trigger dirty 
> page writeback.
> 
> So there is a difference.
> 
> And furthermore, if there is nothing to modified at all, sync will just 
> skip the fs, so btrfs_commit_transaction() is not ensured if you call 
> "sync".
> 
>> Also, does it
>> imply that all the metadata held by the log tree is now checkpointed
>> to the respective trees?
> 
> Log tree part is a little tricky, as the log tree is not really a 
> journal for btrfs.
> Btrfs uses CoW for metadata so in theory (and in fact) btrfs doesn't 
> need any journal.
> 
> Log tree is mainly used for enhancing btrfs fsync performance.
> You can totally disable log tree by notreelog mount option and btrfs 
> will behave just fine.
> 
> And furthermore, I'm not very familiar with log tree, I need to verify 
> the code to see if log tree is used in rename, so I can't say much right 
> now.
> 
> But to make things easy, I strongly recommend to ignore log tree for now.
> 
>>
>> 2. Why are there 2 complete writes to the data held by the root
>> directory and not just 1? These writes are 256KB each, which is the
>> size of the extent allocated to the root directory
> 
> Check my first calculation and verify the debug-tree output before and 
> after rename.
> 
> I think there is some extra factors affecting the number, from the tree 
> height to your fs tree organization.
> 
>>
>> 3. Why are the writes being done to the root directory of the file
>> system / subvolume and not just the parent directory where the unlink
>> happened?
> 
> That's why I strongly recommend to understand btrfs on-disk format first.
> A lot of things can be answered after understanding the on-disk layout, 
> without asking any other guys.
> 
> The short answer is, btrfs puts all its child dir/inode info into one 
> tree for one subvolume.
> (And the term "root directory" here is a little confusing, are you 
> talking about the fs tree root or the root tree?)
> 
> Not the common one tree for one inode layout.
> 
> So if you rename one file in a subvolume, the subvolume tree get CoWed, 
> which means from the leaf containing the key/item you want to modify, to 
> the tree root will be CoWed.
> 
> Thanks,
> Qu
>>
>> It would be great if I could get the answers to these questions.
>>
>> Thanks,
>> Rohan
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regarding handling of file renames in Btrfs
  2017-09-10  6:45   ` Qu Wenruo
@ 2017-09-10 14:32     ` Rohan Kadekodi
  2017-09-11  1:48       ` Qu Wenruo
  2017-09-10 14:34     ` Martin Raiber
  1 sibling, 1 reply; 11+ messages in thread
From: Rohan Kadekodi @ 2017-09-10 14:32 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: linux-btrfs, Vijaychidambaram Velayudhan Pillai, Jayashree Mohan

Thank you for the prompt and elaborate answers! However, I think I was
unclear in my questions, and I apologize for the confusion.

What I meant was that for a file rename, when I check the blktrace
output, there are 2 writes of 256KB each starting from byte number:
13373440

When I check btrfs-debug-tree, I see that the following items are related to it:

1) root tree:
     key (256 EXTENT_DATA 0) itemoff 13649 itemsize 53
     extent data disk byte 13373440 nr 262144
     extent data offset 0 nr 262144 ram 262144
     extent compression 0

2) extent tree:
     key (13373440 EXTENT_ITEM 262144) itemoff 15040 itemsize 53
     extent refs 1 gen 12 flags DATA
     extent data backref root 1 objectid 256 offset 0 count 1

So this means that the extent allocated to the root folder (mount
point) is getting written twice right? Here I am not talking about any
metadata, but the data in the extent allocated to the root folder,
that is inode number 256.

When I was analyzing the code, I saw that these writes happened from
btrfs_start_dirty_block_groups() which is in
btrfs_commit_transaction(). This is the same thing that is getting
written on a filesystem commit.

So my questions were:
1) Why are there 2 256KB writes happening during a filesystem commit
to the same location instead of just 1? Also, what exactly is written
in the root folder of the file system? Again, I am talking about the
data held in the extent allocated inode 256 and not about any metadata
or any tree.

2) I understand by the on-disk format that all the child dir/inode
info in one subvolume are in the same tree, but these writes that I am
talking about are not to any tree, they to the data held in inode 256,
which happens to be the mount point. So by root directory, I mean the
mount point or the inode 256 (not any tree). And even though metadata
wise there is no hierarchy as such in the file system, each folder
data will only contain the data belonging to its children right? Hence
my question was that why does the data in the extent allocated to
inode 256 need to be rewritten instead of just the parent folder for a
rename?

Thanks,
Rohan

On 10 September 2017 at 01:45, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
> On 2017年09月10日 14:41, Qu Wenruo wrote:
>>
>>
>>
>> On 2017年09月10日 07:50, Rohan Kadekodi wrote:
>>>
>>> Hello,
>>>
>>> I was trying to understand how file renames are handled in Btrfs. I
>>> read the code documentation, but had a problem understanding a few
>>> things.
>>>
>>> During a file rename, btrfs_commit_transaction() is called which is
>>> because Btrfs has to commit the whole FS before storing the
>>> information related to the new renamed file. It has to commit the FS
>>> because a rename first does an unlink, which is not recorded in the
>>> btrfs_rename() transaction and so is not logged in the log tree. Is my
>>> understanding correct? If yes, my questions are as follows:
>>
>>
>> Not familiar with rename kernel code, so not much help for rename
>> opeartion.
>>
>>>
>>> 1. What does committing the whole FS mean?
>>
>>
>> Committing the whole fs means a lot of things, but generally speaking, it
>> makes that the on-disk data is inconsistent with each other.
>
>                                     ^consistent
> Sorry for the typo.
>
> Thanks,
> Qu
>
>>
>> For obvious part, it writes modified fs/subvolume trees to disk (with
>> handling of tree operations so no half modified trees).
>>
>> Also other trees like extent tree (very hot since every CoW will update
>> it, and the most complicated one), csum tree if modified.
>>
>> After transaction is committed, the on-disk btrfs will represent the
>> states when commit trans is called, and every tree should match each other.
>>
>> Despite of this, after a transaction is committed, generation of the fs
>> get increased and modified tree blocks will have the same generation number.
>>
>>> Blktrace shows that there
>>> are 2       256KB writes, which are essentially writes to the data of
>>> the root directory of the file system (which I found out through
>>> btrfs-debug-tree).
>>
>>
>> I'd say you didn't check btrfs-debug-tree output carefully enough.
>> I strongly recommend to do vimdiff to get what tree is modified.
>>
>> At least the following trees are modified:
>>
>> 1) fs/subvolume tree
>>     Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and
>>     updated inode time.
>>     So fs/subvolume tree must be CoWed.
>>
>> 2) extent tree
>>     CoW of above metadata operation will definitely cause extent
>>     allocation and freeing, extent tree will also get updated.
>>
>> 3) root tree
>>     Both extent tree and fs/subvolume tree modified, their root bytenr
>>     needs to be updated and root tree must be updated.
>>
>> And finally superblocks.
>>
>> I just verified the behavior with empty btrfs created on a 1G file, only
>> one file to do the rename.
>>
>> In that case (with 4K sectorsize and 16K nodesize), the total IO should be
>> (3 * 16K) * 2 + 4K * 2 = 104K.
>>
>> "3" = number of tree blocks get modified
>> "16K" = nodesize
>> 1st "*2" = DUP profile for metadata
>> "4K" = superblock size
>> 2nd "*2" = 2 superblocks for 1G fs.
>>
>> If your extent/root/fs trees have higher level, then more tree blocks
>> needs to be updated.
>> And if your fs is very large, you may have 3 superblocks.
>>
>>> Is this equivalent to doing a shell sync, as the
>>> same block groups are written during a shell sync too?
>>
>>
>> For shell "sync" the difference is that, "sync" will write all dirty data
>> pages to disk, and then commit transaction.
>> While only calling btrfs_commit_transacation() doesn't trigger dirty page
>> writeback.
>>
>> So there is a difference.
>>
>> And furthermore, if there is nothing to modified at all, sync will just
>> skip the fs, so btrfs_commit_transaction() is not ensured if you call
>> "sync".
>>
>>> Also, does it
>>> imply that all the metadata held by the log tree is now checkpointed
>>> to the respective trees?
>>
>>
>> Log tree part is a little tricky, as the log tree is not really a journal
>> for btrfs.
>> Btrfs uses CoW for metadata so in theory (and in fact) btrfs doesn't need
>> any journal.
>>
>> Log tree is mainly used for enhancing btrfs fsync performance.
>> You can totally disable log tree by notreelog mount option and btrfs will
>> behave just fine.
>>
>> And furthermore, I'm not very familiar with log tree, I need to verify the
>> code to see if log tree is used in rename, so I can't say much right now.
>>
>> But to make things easy, I strongly recommend to ignore log tree for now.
>>
>>>
>>> 2. Why are there 2 complete writes to the data held by the root
>>> directory and not just 1? These writes are 256KB each, which is the
>>> size of the extent allocated to the root directory
>>
>>
>> Check my first calculation and verify the debug-tree output before and
>> after rename.
>>
>> I think there is some extra factors affecting the number, from the tree
>> height to your fs tree organization.
>>
>>>
>>> 3. Why are the writes being done to the root directory of the file
>>> system / subvolume and not just the parent directory where the unlink
>>> happened?
>>
>>
>> That's why I strongly recommend to understand btrfs on-disk format first.
>> A lot of things can be answered after understanding the on-disk layout,
>> without asking any other guys.
>>
>> The short answer is, btrfs puts all its child dir/inode info into one tree
>> for one subvolume.
>> (And the term "root directory" here is a little confusing, are you talking
>> about the fs tree root or the root tree?)
>>
>> Not the common one tree for one inode layout.
>>
>> So if you rename one file in a subvolume, the subvolume tree get CoWed,
>> which means from the leaf containing the key/item you want to modify, to the
>> tree root will be CoWed.
>>
>> Thanks,
>> Qu
>>>
>>>
>>> It would be great if I could get the answers to these questions.
>>>
>>> Thanks,
>>> Rohan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regarding handling of file renames in Btrfs
  2017-09-10  6:45   ` Qu Wenruo
  2017-09-10 14:32     ` Rohan Kadekodi
@ 2017-09-10 14:34     ` Martin Raiber
  2017-09-11  5:22       ` Qu Wenruo
  1 sibling, 1 reply; 11+ messages in thread
From: Martin Raiber @ 2017-09-10 14:34 UTC (permalink / raw)
  To: Qu Wenruo, Rohan Kadekodi, linux-btrfs
  Cc: Vijaychidambaram Velayudhan Pillai, Jayashree Mohan

Hi,

On 10.09.2017 08:45 Qu Wenruo wrote:
>
>
> On 2017年09月10日 14:41, Qu Wenruo wrote:
>>
>>
>> On 2017年09月10日 07:50, Rohan Kadekodi wrote:
>>> Hello,
>>>
>>> I was trying to understand how file renames are handled in Btrfs. I
>>> read the code documentation, but had a problem understanding a few
>>> things.
>>>
>>> During a file rename, btrfs_commit_transaction() is called which is
>>> because Btrfs has to commit the whole FS before storing the
>>> information related to the new renamed file. It has to commit the FS
>>> because a rename first does an unlink, which is not recorded in the
>>> btrfs_rename() transaction and so is not logged in the log tree. Is my
>>> understanding correct? If yes, my questions are as follows:
>>
>> Not familiar with rename kernel code, so not much help for rename
>> opeartion.
>>
>>>
>>> 1. What does committing the whole FS mean?
>>
>> Committing the whole fs means a lot of things, but generally
>> speaking, it makes that the on-disk data is inconsistent with each
>> other.
>
>> For obvious part, it writes modified fs/subvolume trees to disk (with
>> handling of tree operations so no half modified trees).
>>
>> Also other trees like extent tree (very hot since every CoW will
>> update it, and the most complicated one), csum tree if modified.
>>
>> After transaction is committed, the on-disk btrfs will represent the
>> states when commit trans is called, and every tree should match each
>> other.
>>
>> Despite of this, after a transaction is committed, generation of the
>> fs get increased and modified tree blocks will have the same
>> generation number.
>>
>>> Blktrace shows that there
>>> are 2       256KB writes, which are essentially writes to the data of
>>> the root directory of the file system (which I found out through
>>> btrfs-debug-tree).
>>
>> I'd say you didn't check btrfs-debug-tree output carefully enough.
>> I strongly recommend to do vimdiff to get what tree is modified.
>>
>> At least the following trees are modified:
>>
>> 1) fs/subvolume tree
>>     Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and
>>     updated inode time.
>>     So fs/subvolume tree must be CoWed.
>>
>> 2) extent tree
>>     CoW of above metadata operation will definitely cause extent
>>     allocation and freeing, extent tree will also get updated.
>>
>> 3) root tree
>>     Both extent tree and fs/subvolume tree modified, their root bytenr
>>     needs to be updated and root tree must be updated.
>>
>> And finally superblocks.
>>
>> I just verified the behavior with empty btrfs created on a 1G file,
>> only one file to do the rename.
>>
>> In that case (with 4K sectorsize and 16K nodesize), the total IO
>> should be (3 * 16K) * 2 + 4K * 2 = 104K.
>>
>> "3" = number of tree blocks get modified
>> "16K" = nodesize
>> 1st "*2" = DUP profile for metadata
>> "4K" = superblock size
>> 2nd "*2" = 2 superblocks for 1G fs.
>>
>> If your extent/root/fs trees have higher level, then more tree blocks
>> needs to be updated.
>> And if your fs is very large, you may have 3 superblocks.
>>
>>> Is this equivalent to doing a shell sync, as the
>>> same block groups are written during a shell sync too?
>>
>> For shell "sync" the difference is that, "sync" will write all dirty
>> data pages to disk, and then commit transaction.
>> While only calling btrfs_commit_transacation() doesn't trigger dirty
>> page writeback.
>>
>> So there is a difference.

this conversation made me realize why btrfs has sub-optimal meta-data
performance. Cow b-trees are not the best data structure for such small
changes. In my application I have multiple operations (e.g. renames)
which can be bundles up and (mostly) one writer.
I guess using BTRFS_IOC_TRANS_START and BTRFS_IOC_TRANS_END would be one
way to reduce the cow overhead, but those are dangerous wrt. to ENOSPC
and there have been discussions about removing them.
Best would be if there were delayed metadata, where metadata is handled
the same as delayed allocations and data changes, i.e. commit on fsync,
commit interval or fssync. I assumed this was already the case...

Please correct me if I got this wrong.

Regards,
Martin Raiber

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regarding handling of file renames in Btrfs
  2017-09-10 14:32     ` Rohan Kadekodi
@ 2017-09-11  1:48       ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2017-09-11  1:48 UTC (permalink / raw)
  To: Rohan Kadekodi
  Cc: linux-btrfs, Vijaychidambaram Velayudhan Pillai, Jayashree Mohan



On 2017年09月10日 22:32, Rohan Kadekodi wrote:
> Thank you for the prompt and elaborate answers! However, I think I was
> unclear in my questions, and I apologize for the confusion.
> 
> What I meant was that for a file rename, when I check the blktrace
> output, there are 2 writes of 256KB each starting from byte number:
> 13373440
> 
> When I check btrfs-debug-tree, I see that the following items are related to it:
> 
> 1) root tree:
>       key (256 EXTENT_DATA 0) itemoff 13649 itemsize 53
>       extent data disk byte 13373440 nr 262144
>       extent data offset 0 nr 262144 ram 262144
>       extent compression 0
> 
> 2) extent tree:
>       key (13373440 EXTENT_ITEM 262144) itemoff 15040 itemsize 53
>       extent refs 1 gen 12 flags DATA
>       extent data backref root 1 objectid 256 offset 0 count 1
> 
> So this means that the extent allocated to the root folder (mount
> point) is getting written twice right? Here I am not talking about any
> metadata, but the data in the extent allocated to the root folder,
> that is inode number 256.

Such extent data is used by free space cache.

If using nospace_cache or space_cache=v2 mount option, there will no 
such thing.

Free space cache is used for recording free and used space for each 
chunk (or block group, which is mostly the same thing).
Since CoW happens for metadata chunk, its used/free space mapping get 
modified and then free space cache will also be updated.

BTW, some term usage difference makes me a little confused.
Personally speaking, we call root 1 "tree root" or "root tree", not root 
directory.
As in fact such tree doesn't contain any real file/directory.
> 
> When I was analyzing the code, I saw that these writes happened from
> btrfs_start_dirty_block_groups() which is in
> btrfs_commit_transaction(). This is the same thing that is getting
> written on a filesystem commit.
> 
> So my questions were:
> 1) Why are there 2 256KB writes happening during a filesystem commit
> to the same location instead of just 1? Also, what exactly is written
> in the root folder of the file system? Again, I am talking about the
> data held in the extent allocated inode 256 and not about any metadata
> or any tree.

As stated above, EXTENT_DATA in root tree is for space cache (v1).
Which uses NoCOW file extent as file to record free space.

And such space cache is for each block group.

Furthermore, since it's EXTENT_DATA, it counts as DATA, so it follows 
your data profile (default to single for single device and RAID0 for 
multi device).

If not using DUP1 as data profile, then you have 2 block groups get 
modified.

> 
> 2) I understand by the on-disk format that all the child dir/inode
> info in one subvolume are in the same tree, but these writes that I am
> talking about are not to any tree, they to the data held in inode 256,
> which happens to be the mount point. So by root directory, I mean the
> mount point or the inode 256 (not any tree).

As mentioned before, it's better to call it "root tree" as it doesn't 
really represents a directory.

> And even though metadata
> wise there is no hierarchy as such in the file system, each folder
> data will only contain the data belonging to its children right?

The sentence is confusing to me now.
By "folder" did you mean normal directory? And how do you define "data 
belonging to its children"?

As stated before, there is no real boundary for an inode (including 
normal file and directory).
All inode data (including EXTENT_DATA for regular file and DIR_INDEX/DIR 
for directory inode) are just sequential keys (with its data) in a 
subvolume.

So without your definition of "belonging to" I can't get the point.

> Hence
> my question was that why does the data in the extent allocated to
> inode 256 need to be rewritten instead of just the parent folder for a
> rename?

My first paragraph explained this.

BTW, for your concerned EXTENT_DATA in root 1 (root tree), it's used by 
the following sequence: (BTRFS_ prefix omitted, all keys are in root 1)

(FREE_SPACE_OBJECTID, 0, <Block group bytenr>)
Its structure, btrfs_free_space_header, contains a key referring to an 
inode, which is a regular file inode.
The inode key will be (<ino>, INODE_ITEM, 0)

Then still in tree root (rootid 1), search using the (<ino>, INODE_ITEM, 
0) key, to locate the free space cache inode.

Finally btrfs will just read data stored for this inode.
Using its (<ino>, EXTENT_DATA, <offset>) to locate its real data on 
disk, and read it out.

For details like how the space cache looks like, you need to check the 
free space cache code then.
(And for short, it's a mess, so we have space_cache=v2, which uses 
normal btrfs Btree to store such info, and btrfs-debug-tree can show it 
easily)

And of course, for transaction commit, each dirty block group will need 
to update its free space cache, and its free space cache file has 
NODATACOW flag, so free space cache itself has some checksum mechanism, 
so normally the whole free space cache file is updated.

Thanks,
Qu

> 
> Thanks,
> Rohan
> 
> On 10 September 2017 at 01:45, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>> On 2017年09月10日 14:41, Qu Wenruo wrote:
>>>
>>>
>>>
>>> On 2017年09月10日 07:50, Rohan Kadekodi wrote:
>>>>
>>>> Hello,
>>>>
>>>> I was trying to understand how file renames are handled in Btrfs. I
>>>> read the code documentation, but had a problem understanding a few
>>>> things.
>>>>
>>>> During a file rename, btrfs_commit_transaction() is called which is
>>>> because Btrfs has to commit the whole FS before storing the
>>>> information related to the new renamed file. It has to commit the FS
>>>> because a rename first does an unlink, which is not recorded in the
>>>> btrfs_rename() transaction and so is not logged in the log tree. Is my
>>>> understanding correct? If yes, my questions are as follows:
>>>
>>>
>>> Not familiar with rename kernel code, so not much help for rename
>>> opeartion.
>>>
>>>>
>>>> 1. What does committing the whole FS mean?
>>>
>>>
>>> Committing the whole fs means a lot of things, but generally speaking, it
>>> makes that the on-disk data is inconsistent with each other.
>>
>>                                      ^consistent
>> Sorry for the typo.
>>
>> Thanks,
>> Qu
>>
>>>
>>> For obvious part, it writes modified fs/subvolume trees to disk (with
>>> handling of tree operations so no half modified trees).
>>>
>>> Also other trees like extent tree (very hot since every CoW will update
>>> it, and the most complicated one), csum tree if modified.
>>>
>>> After transaction is committed, the on-disk btrfs will represent the
>>> states when commit trans is called, and every tree should match each other.
>>>
>>> Despite of this, after a transaction is committed, generation of the fs
>>> get increased and modified tree blocks will have the same generation number.
>>>
>>>> Blktrace shows that there
>>>> are 2       256KB writes, which are essentially writes to the data of
>>>> the root directory of the file system (which I found out through
>>>> btrfs-debug-tree).
>>>
>>>
>>> I'd say you didn't check btrfs-debug-tree output carefully enough.
>>> I strongly recommend to do vimdiff to get what tree is modified.
>>>
>>> At least the following trees are modified:
>>>
>>> 1) fs/subvolume tree
>>>      Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and
>>>      updated inode time.
>>>      So fs/subvolume tree must be CoWed.
>>>
>>> 2) extent tree
>>>      CoW of above metadata operation will definitely cause extent
>>>      allocation and freeing, extent tree will also get updated.
>>>
>>> 3) root tree
>>>      Both extent tree and fs/subvolume tree modified, their root bytenr
>>>      needs to be updated and root tree must be updated.
>>>
>>> And finally superblocks.
>>>
>>> I just verified the behavior with empty btrfs created on a 1G file, only
>>> one file to do the rename.
>>>
>>> In that case (with 4K sectorsize and 16K nodesize), the total IO should be
>>> (3 * 16K) * 2 + 4K * 2 = 104K.
>>>
>>> "3" = number of tree blocks get modified
>>> "16K" = nodesize
>>> 1st "*2" = DUP profile for metadata
>>> "4K" = superblock size
>>> 2nd "*2" = 2 superblocks for 1G fs.
>>>
>>> If your extent/root/fs trees have higher level, then more tree blocks
>>> needs to be updated.
>>> And if your fs is very large, you may have 3 superblocks.
>>>
>>>> Is this equivalent to doing a shell sync, as the
>>>> same block groups are written during a shell sync too?
>>>
>>>
>>> For shell "sync" the difference is that, "sync" will write all dirty data
>>> pages to disk, and then commit transaction.
>>> While only calling btrfs_commit_transacation() doesn't trigger dirty page
>>> writeback.
>>>
>>> So there is a difference.
>>>
>>> And furthermore, if there is nothing to modified at all, sync will just
>>> skip the fs, so btrfs_commit_transaction() is not ensured if you call
>>> "sync".
>>>
>>>> Also, does it
>>>> imply that all the metadata held by the log tree is now checkpointed
>>>> to the respective trees?
>>>
>>>
>>> Log tree part is a little tricky, as the log tree is not really a journal
>>> for btrfs.
>>> Btrfs uses CoW for metadata so in theory (and in fact) btrfs doesn't need
>>> any journal.
>>>
>>> Log tree is mainly used for enhancing btrfs fsync performance.
>>> You can totally disable log tree by notreelog mount option and btrfs will
>>> behave just fine.
>>>
>>> And furthermore, I'm not very familiar with log tree, I need to verify the
>>> code to see if log tree is used in rename, so I can't say much right now.
>>>
>>> But to make things easy, I strongly recommend to ignore log tree for now.
>>>
>>>>
>>>> 2. Why are there 2 complete writes to the data held by the root
>>>> directory and not just 1? These writes are 256KB each, which is the
>>>> size of the extent allocated to the root directory
>>>
>>>
>>> Check my first calculation and verify the debug-tree output before and
>>> after rename.
>>>
>>> I think there is some extra factors affecting the number, from the tree
>>> height to your fs tree organization.
>>>
>>>>
>>>> 3. Why are the writes being done to the root directory of the file
>>>> system / subvolume and not just the parent directory where the unlink
>>>> happened?
>>>
>>>
>>> That's why I strongly recommend to understand btrfs on-disk format first.
>>> A lot of things can be answered after understanding the on-disk layout,
>>> without asking any other guys.
>>>
>>> The short answer is, btrfs puts all its child dir/inode info into one tree
>>> for one subvolume.
>>> (And the term "root directory" here is a little confusing, are you talking
>>> about the fs tree root or the root tree?)
>>>
>>> Not the common one tree for one inode layout.
>>>
>>> So if you rename one file in a subvolume, the subvolume tree get CoWed,
>>> which means from the leaf containing the key/item you want to modify, to the
>>> tree root will be CoWed.
>>>
>>> Thanks,
>>> Qu
>>>>
>>>>
>>>> It would be great if I could get the answers to these questions.
>>>>
>>>> Thanks,
>>>> Rohan
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regarding handling of file renames in Btrfs
  2017-09-10 14:34     ` Martin Raiber
@ 2017-09-11  5:22       ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2017-09-11  5:22 UTC (permalink / raw)
  To: Martin Raiber, Rohan Kadekodi, linux-btrfs
  Cc: Vijaychidambaram Velayudhan Pillai, Jayashree Mohan



On 2017年09月10日 22:34, Martin Raiber wrote:
> Hi,
> 
> On 10.09.2017 08:45 Qu Wenruo wrote:
>>
>>
>> On 2017年09月10日 14:41, Qu Wenruo wrote:
>>>
>>>
>>> On 2017年09月10日 07:50, Rohan Kadekodi wrote:
>>>> Hello,
>>>>
>>>> I was trying to understand how file renames are handled in Btrfs. I
>>>> read the code documentation, but had a problem understanding a few
>>>> things.
>>>>
>>>> During a file rename, btrfs_commit_transaction() is called which is
>>>> because Btrfs has to commit the whole FS before storing the
>>>> information related to the new renamed file. It has to commit the FS
>>>> because a rename first does an unlink, which is not recorded in the
>>>> btrfs_rename() transaction and so is not logged in the log tree. Is my
>>>> understanding correct? If yes, my questions are as follows:
>>>
>>> Not familiar with rename kernel code, so not much help for rename
>>> opeartion.
>>>
>>>>
>>>> 1. What does committing the whole FS mean?
>>>
>>> Committing the whole fs means a lot of things, but generally
>>> speaking, it makes that the on-disk data is inconsistent with each
>>> other.
>>
>>> For obvious part, it writes modified fs/subvolume trees to disk (with
>>> handling of tree operations so no half modified trees).
>>>
>>> Also other trees like extent tree (very hot since every CoW will
>>> update it, and the most complicated one), csum tree if modified.
>>>
>>> After transaction is committed, the on-disk btrfs will represent the
>>> states when commit trans is called, and every tree should match each
>>> other.
>>>
>>> Despite of this, after a transaction is committed, generation of the
>>> fs get increased and modified tree blocks will have the same
>>> generation number.
>>>
>>>> Blktrace shows that there
>>>> are 2       256KB writes, which are essentially writes to the data of
>>>> the root directory of the file system (which I found out through
>>>> btrfs-debug-tree).
>>>
>>> I'd say you didn't check btrfs-debug-tree output carefully enough.
>>> I strongly recommend to do vimdiff to get what tree is modified.
>>>
>>> At least the following trees are modified:
>>>
>>> 1) fs/subvolume tree
>>>      Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and
>>>      updated inode time.
>>>      So fs/subvolume tree must be CoWed.
>>>
>>> 2) extent tree
>>>      CoW of above metadata operation will definitely cause extent
>>>      allocation and freeing, extent tree will also get updated.
>>>
>>> 3) root tree
>>>      Both extent tree and fs/subvolume tree modified, their root bytenr
>>>      needs to be updated and root tree must be updated.
>>>
>>> And finally superblocks.
>>>
>>> I just verified the behavior with empty btrfs created on a 1G file,
>>> only one file to do the rename.
>>>
>>> In that case (with 4K sectorsize and 16K nodesize), the total IO
>>> should be (3 * 16K) * 2 + 4K * 2 = 104K.
>>>
>>> "3" = number of tree blocks get modified
>>> "16K" = nodesize
>>> 1st "*2" = DUP profile for metadata
>>> "4K" = superblock size
>>> 2nd "*2" = 2 superblocks for 1G fs.
>>>
>>> If your extent/root/fs trees have higher level, then more tree blocks
>>> needs to be updated.
>>> And if your fs is very large, you may have 3 superblocks.
>>>
>>>> Is this equivalent to doing a shell sync, as the
>>>> same block groups are written during a shell sync too?
>>>
>>> For shell "sync" the difference is that, "sync" will write all dirty
>>> data pages to disk, and then commit transaction.
>>> While only calling btrfs_commit_transacation() doesn't trigger dirty
>>> page writeback.
>>>
>>> So there is a difference.
> 
> this conversation made me realize why btrfs has sub-optimal meta-data
> performance. Cow b-trees are not the best data structure for such small
> changes. In my application I have multiple operations (e.g. renames)
> which can be bundles up and (mostly) one writer.

Things are more complicated in fact.

For example, even you are only renaming/moving one file.
But in fact you're going to at least modify 6 items, they are:

1) Removing DIR_INDEX of original parent dir inode
    Assume the original parent dir inode number is 300.
    We are removing (300 DIR_INDEX <seq>).

2) Removing DIR_ITEM of original parent dir inode
    We are removing (300 DIR_ITEM <crc32 of the old filename>)

3) Removing INODE_REF of the renamed inode
    Assume the renamed inode number is 400
    We are removing (400 INODE_REF 300).

4) Adding new DIR_INDEX to new parent dir inode
    Assume the new parent dir inode number is 500.
    We are adding (500 DIR_INDEX <seq>)

5) Adding new DIR_ITEM to new parent dir inode
    We are adding (500 DIR_ITEM <crc32 of the new filename>)

6) Adding new INODE_REF to renamed inode
    We are adding (400 INODE_REF 500)

As you can see, there are 6 keys modification, and we can't ensure they 
are all in one leaf.
In worst case, we need to CoW the tree 6 times for different leaves.
(Although CoWed tree won't be CoWed again until written to disk, which 
reduces overhead)

And even more, if you modified one tree, you must also modify the 
ROOT_ITEM pointing the tree, which leads to root tree CoW.


I have a crazy idea to double buffering tree blocks.
That's to say, one tree block is actually consisted of 2 real tree blocks.

And when CoW happens, just switch to the other tree block.
So that we don't really need to update its parent pointer, so we can 
limit the CoW affected range to minimal.

But it's trading space for IO (although metadata space is relatively 
small), and it will definitely cause LARGE on-disk format change.

> I guess using BTRFS_IOC_TRANS_START and BTRFS_IOC_TRANS_END would be one
> way to reduce the cow overhead, but those are dangerous wrt. to ENOSPC
> and there have been discussions about removing them.

Nope. in current Btrfs behavior, only longer transaction can reduce 
overhead.
As already CoWed and unwritten tree block will not be CoWed again, but 
just modified in memory.

So you should try to avoid such ioctl and let btrfs to handle 
transaction by itself.

> Best would be if there were delayed metadata, where metadata is handled
> the same as delayed allocations and data changes, i.e. commit on fsync,
> commit interval or fssync. I assumed this was already the case...

Already delayed, as CoWed but not written tree block will not be CoWed 
again.

And we even have double delay for extent tree update to improve performance.

But don't forget that such *optimization* itself is trading robust for 
performance.
(More code always means more bugs, and delayed-ref for extent tree is 
already bug-prone)

Thanks,
Qu

> 
> Please correct me if I got this wrong.
> 
> Regards,
> Martin Raiber
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regarding handling of file renames in Btrfs
  2017-09-09 23:50 Regarding handling of file renames in Btrfs Rohan Kadekodi
  2017-09-10  1:32 ` Duncan
  2017-09-10  6:41 ` Qu Wenruo
@ 2017-09-16 12:27 ` Hans van Kranenburg
  2017-09-16 12:40   ` Martin Raiber
  2 siblings, 1 reply; 11+ messages in thread
From: Hans van Kranenburg @ 2017-09-16 12:27 UTC (permalink / raw)
  To: Rohan Kadekodi, linux-btrfs
  Cc: Vijaychidambaram Velayudhan Pillai, Jayashree Mohan

Hi,

On 09/10/2017 01:50 AM, Rohan Kadekodi wrote:
> 
> I was trying to understand how file renames are handled in Btrfs. I
> read the code documentation, but had a problem understanding a few
> things.
> 
> During a file rename, btrfs_commit_transaction() is called which is
> because Btrfs has to commit the whole FS before storing the
> information related to the new renamed file.

Can you point to which lines of code you're looking at?

> It has to commit the FS
> because a rename first does an unlink, which is not recorded in the
> btrfs_rename() transaction and so is not logged in the log tree. Is my
> understanding correct? [...]

Can you also point to where exactly you see this happening? I'd also
like to understand more about this.

The whole mail thread following this message continues about what a
transaction commit is and does etc, but the above question is never
answered I think.

And I think it's an interesting question. Is a rename a "heavier"
operation relative to other file operations?

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regarding handling of file renames in Btrfs
  2017-09-16 12:27 ` Hans van Kranenburg
@ 2017-09-16 12:40   ` Martin Raiber
  2017-09-16 12:45     ` Hans van Kranenburg
  0 siblings, 1 reply; 11+ messages in thread
From: Martin Raiber @ 2017-09-16 12:40 UTC (permalink / raw)
  To: Hans van Kranenburg, Rohan Kadekodi, linux-btrfs
  Cc: Vijaychidambaram Velayudhan Pillai, Jayashree Mohan

Hi,

On 16.09.2017 14:27 Hans van Kranenburg wrote:
> On 09/10/2017 01:50 AM, Rohan Kadekodi wrote:
>> I was trying to understand how file renames are handled in Btrfs. I
>> read the code documentation, but had a problem understanding a few
>> things.
>>
>> During a file rename, btrfs_commit_transaction() is called which is
>> because Btrfs has to commit the whole FS before storing the
>> information related to the new renamed file.
> Can you point to which lines of code you're looking at?
>
>> It has to commit the FS
>> because a rename first does an unlink, which is not recorded in the
>> btrfs_rename() transaction and so is not logged in the log tree. Is my
>> understanding correct? [...]
> Can you also point to where exactly you see this happening? I'd also
> like to understand more about this.
>
> The whole mail thread following this message continues about what a
> transaction commit is and does etc, but the above question is never
> answered I think.
>
> And I think it's an interesting question. Is a rename a "heavier"
> operation relative to other file operations?
>
as far as I can see it only uses the log tree in some cases where the
log tree was already used for the file or the parent directory. The
cases are documented here
https://github.com/torvalds/linux/blob/master/fs/btrfs/tree-log.c#L45 .
So rename isn't much heavier than unlink+create.

Regards,
Martin Raiber


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regarding handling of file renames in Btrfs
  2017-09-16 12:40   ` Martin Raiber
@ 2017-09-16 12:45     ` Hans van Kranenburg
  0 siblings, 0 replies; 11+ messages in thread
From: Hans van Kranenburg @ 2017-09-16 12:45 UTC (permalink / raw)
  To: Martin Raiber, Rohan Kadekodi, linux-btrfs
  Cc: Vijaychidambaram Velayudhan Pillai, Jayashree Mohan

On 09/16/2017 02:40 PM, Martin Raiber wrote:
> Hi,
> 
> On 16.09.2017 14:27 Hans van Kranenburg wrote:
>> On 09/10/2017 01:50 AM, Rohan Kadekodi wrote:
>>> I was trying to understand how file renames are handled in Btrfs. I
>>> read the code documentation, but had a problem understanding a few
>>> things.
>>>
>>> During a file rename, btrfs_commit_transaction() is called which is
>>> because Btrfs has to commit the whole FS before storing the
>>> information related to the new renamed file.
>> Can you point to which lines of code you're looking at?
>>
>>> It has to commit the FS
>>> because a rename first does an unlink, which is not recorded in the
>>> btrfs_rename() transaction and so is not logged in the log tree. Is my
>>> understanding correct? [...]
>> Can you also point to where exactly you see this happening? I'd also
>> like to understand more about this.
>>
>> The whole mail thread following this message continues about what a
>> transaction commit is and does etc, but the above question is never
>> answered I think.
>>
>> And I think it's an interesting question. Is a rename a "heavier"
>> operation relative to other file operations?
>>
> as far as I can see it only uses the log tree in some cases where the
> log tree was already used for the file or the parent directory. The
> cases are documented here
> https://github.com/torvalds/linux/blob/master/fs/btrfs/tree-log.c#L45 .
> So rename isn't much heavier than unlink+create.

Ah. I also see that the difficult situations are about moving a file to
another directory.

So, if I just rename a file in the same directory, that's even simpler.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-09-16 12:45 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-09 23:50 Regarding handling of file renames in Btrfs Rohan Kadekodi
2017-09-10  1:32 ` Duncan
2017-09-10  6:41 ` Qu Wenruo
2017-09-10  6:45   ` Qu Wenruo
2017-09-10 14:32     ` Rohan Kadekodi
2017-09-11  1:48       ` Qu Wenruo
2017-09-10 14:34     ` Martin Raiber
2017-09-11  5:22       ` Qu Wenruo
2017-09-16 12:27 ` Hans van Kranenburg
2017-09-16 12:40   ` Martin Raiber
2017-09-16 12:45     ` Hans van Kranenburg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.