On 2019/1/8 上午2:59, David Sterba wrote:
> On Thu, Dec 20, 2018 at 04:01:37PM +0800, Qu Wenruo wrote:
>> Btrfs needs to read out all block group (bg) items to fill its bg
>> caches.
>>
>> However such bg caches are only needed for read-write mount, and makes
>> no sense for RO mount.
>>
>> So this patch introduce new mount option, skip_bg, to skip block group
>> items scan.
>>
>> This new 'skip_bg' mount option can only be used with TRUE read-only
>> mount, which needs the following dependency:
>> - RO mount
>>   Obviously.
>>
>> - No log tree or notreelog mount option
>>
>> - No way to remoutn RW
>>   Similar to notreelog mount option.
>>
>> - No chunk <-> bg <-> dev extents restrict check
>>
>> This option should only be used as kernel equivalent of btrfs-restore.
>>
>> With this patch, we can even mount a btrfs whose extent root is
>> completely corrupted.
> 
> So it's a last-resort rescue option, I'd suggest to make that more
> explicit. Something like rescue=skip-bg. We can add all sorts of other
> values that would relax some checks. Adding a separate mount option
> would be quite impractical.

Nice suggestion, I'm also not satisfied with current mount option name.
I'll add new rescue mount option, and convert some existing options to it.

> 
> This would also align with the constraints you mention above, eg. no way
> to remount RW. This is fine for the corrupted extent root. I wonder what
> kind of metadata damage support would still make sense.

E.g. one leaf corrupted while containing the block group item.
Since we're going to read all block group items at mount time, such
corruption will reject mount immediately, no matter what mount option
we're using.

> a 'completely
> corrupted extent root' means you never know what you get from the
> filesystem.

Not exactly.
Just extent root node corrupted could reject mount, while fs tree could
be completely fine.

Normally we would go backup root and hopes we could get an good old
extent root.
But with this option, we should be able to access fs tree without problem.

> 
> The in-kernel checks and interconnection of the structures would have to
> be ready for missing metadata or more sanity checks would need to be
> added.

If fact, as mentioned, extent tree only affects write operation.

For fs tree read operations, current code is more or less good enough to
handle corruption, at least much robust than extent tree corruption.

> 
> I think that all the restore and rescue functionality is better suited
> for userspace where the unpredictable corruptions that cannot be parsed
> do not lead to kernel crashes or silent memory overwrites.

That's true.
Although btrfs-restore still can't provide everything, like
snapshot/subvolume structure, so such rescue option may still make sense.

> 
>> But can also be an option to test if btrfs_read_block_groups() is the
>> major cause for slow btrfs mount.
> 
> We have a debugging/testing -only mount option 'fragment', so we may
> consider adding more.

For this part, in fact it has better way to verify the cause, without
any modification to the kernel.

We could just use ftrace to get the non-inline function execution time,
like:
# perf ftrace -t function_graph -T open_ctree \
	-T btrfs_read_block_groups \
	-T check_chunk_block_group_mappings \
	-T btrfs_read_chunk_tree \
	-T btrfs_verify_dev_extents \
	mount /dev/test/test /mnt

Thanks,
Qu