* So, does btrfs check lowmem take days? weeks?
@ 2018-06-29 4:27 Marc MERLIN
2018-06-29 5:07 ` Qu Wenruo
0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29 4:27 UTC (permalink / raw)
To: linux-btrfs
Regular btrfs check --repair has a nice progress option. It wasn't
perfect, but it showed something.
But then it also takes all your memory quicker than the linux kernel can
defend itself and reliably completely kills my 32GB server quicker than
it can OOM anything.
lowmem repair seems to be going still, but it's been days and -p seems
to do absolutely nothing.
My filesystem is "only" 10TB or so, albeit with a lot of files.
2 things that come to mind
1) can lowmem have some progress working so that I know if I'm looking
at days, weeks, or even months before it will be done?
2) non lowmem is more efficient obviously when it doesn't completely
crash your machine, but could lowmem be given an amount of memory to use
for caching, or maybe use some heuristics based on RAM free so that it's
not so excrutiatingly slow?
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 4:27 So, does btrfs check lowmem take days? weeks? Marc MERLIN
@ 2018-06-29 5:07 ` Qu Wenruo
2018-06-29 5:28 ` Marc MERLIN
2018-06-29 5:35 ` Su Yue
0 siblings, 2 replies; 65+ messages in thread
From: Qu Wenruo @ 2018-06-29 5:07 UTC (permalink / raw)
To: Marc MERLIN, linux-btrfs
[-- Attachment #1.1: Type: text/plain, Size: 1790 bytes --]
On 2018年06月29日 12:27, Marc MERLIN wrote:
> Regular btrfs check --repair has a nice progress option. It wasn't
> perfect, but it showed something.
>
> But then it also takes all your memory quicker than the linux kernel can
> defend itself and reliably completely kills my 32GB server quicker than
> it can OOM anything.
>
> lowmem repair seems to be going still, but it's been days and -p seems
> to do absolutely nothing.
I'm a afraid you hit a bug in lowmem repair code.
By all means, --repair shouldn't really be used unless you're pretty
sure the problem is something btrfs check can handle.
That's also why --repair is still marked as dangerous.
Especially when it's combined with experimental lowmem mode.
>
> My filesystem is "only" 10TB or so, albeit with a lot of files.
Unless you have tons of snapshots and reflinked (deduped) files, it
shouldn't take so long.
>
> 2 things that come to mind
> 1) can lowmem have some progress working so that I know if I'm looking
> at days, weeks, or even months before it will be done?
It's hard to estimate, especially when every cross check involves a lot
of disk IO.
But at least, we could add such indicator to show we're doing something.
>
> 2) non lowmem is more efficient obviously when it doesn't completely
> crash your machine, but could lowmem be given an amount of memory to use
> for caching, or maybe use some heuristics based on RAM free so that it's
> not so excrutiatingly slow?
IIRC recent commit has added the ability.
a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers")
That's already included in btrfs-progs v4.13.2.
So it should be a dead loop which lowmem repair code can't handle.
Thanks,
Qu
>
> Thanks,
> Marc
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 5:07 ` Qu Wenruo
@ 2018-06-29 5:28 ` Marc MERLIN
2018-06-29 5:48 ` Qu Wenruo
2018-06-29 6:02 ` Su Yue
2018-06-29 5:35 ` Su Yue
1 sibling, 2 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29 5:28 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Fri, Jun 29, 2018 at 01:07:20PM +0800, Qu Wenruo wrote:
> > lowmem repair seems to be going still, but it's been days and -p seems
> > to do absolutely nothing.
>
> I'm a afraid you hit a bug in lowmem repair code.
> By all means, --repair shouldn't really be used unless you're pretty
> sure the problem is something btrfs check can handle.
>
> That's also why --repair is still marked as dangerous.
> Especially when it's combined with experimental lowmem mode.
Understood, but btrfs got corrupted (by itself or not, I don't know)
I cannot mount the filesystem read/write
I cannot btrfs check --repair it since that code will kill my machine
What do I have left?
> > My filesystem is "only" 10TB or so, albeit with a lot of files.
>
> Unless you have tons of snapshots and reflinked (deduped) files, it
> shouldn't take so long.
I may have a fair amount.
gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2
enabling repair mode
WARNING: low-memory mode repair support is only partial
Checking filesystem on /dev/mapper/dshelf2
UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
Fixed 0 roots.
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4
Created new chunk [18457780224000 1073741824]
Delete backref in extent [84302495744 69632]
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4
Delete backref in extent [84302495744 69632]
ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240
Delete backref in extent [125712527360 12214272]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431
Delete backref in extent [129952120832 20242432]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433
Delete backref in extent [129952120832 20242432]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249
Delete backref in extent [147895111680 12345344]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251
Delete backref in extent [147895111680 12345344]
ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 348, have: 418
Delete backref in extent [150850146304 17522688]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 555, have: 1449
Deleted root 2 item[156909494272, 178, 5476627808561673095]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 556, have: 1452
Deleted root 2 item[156909494272, 178, 7338474132555182983]
ERROR: file extent[374857 235184128] root 21872 owner 21872 backref lost
Add one extent data backref [156909494272 55320576]
ERROR: file extent[374857 235184128] root 22911 owner 22911 backref lost
Add one extent data backref [156909494272 55320576]
The last two ERROR lines took over a day to get generated, so I'm not sure if it's still working, but just slowly.
For what it's worth non lowmem check used to take 12 to 24H on that filesystem back when it still worked.
> > 2 things that come to mind
> > 1) can lowmem have some progress working so that I know if I'm looking
> > at days, weeks, or even months before it will be done?
>
> It's hard to estimate, especially when every cross check involves a lot
> of disk IO.
> But at least, we could add such indicator to show we're doing something.
Yes, anything to show that I should still wait is still good :)
> > 2) non lowmem is more efficient obviously when it doesn't completely
> > crash your machine, but could lowmem be given an amount of memory to use
> > for caching, or maybe use some heuristics based on RAM free so that it's
> > not so excrutiatingly slow?
>
> IIRC recent commit has added the ability.
> a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers")
Oh, good.
> That's already included in btrfs-progs v4.13.2.
> So it should be a dead loop which lowmem repair code can't handle.
I see. Is there any reasonably easy way to check on this running process?
Both top and iotop show that it's working, but of course I can't tell if
it's looping, or not.
Then again, maybe it already fixed enough that I can mount my filesystem again.
But back to the main point, it's sad that after so many years, the
repair situation is still so suboptimal, especially when it's apparently
pretty easy for btrfs to get damaged (through its own fault or not, hard
to say).
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 5:07 ` Qu Wenruo
2018-06-29 5:28 ` Marc MERLIN
@ 2018-06-29 5:35 ` Su Yue
2018-06-29 5:46 ` Marc MERLIN
1 sibling, 1 reply; 65+ messages in thread
From: Su Yue @ 2018-06-29 5:35 UTC (permalink / raw)
To: Qu Wenruo, Marc MERLIN, linux-btrfs
On 06/29/2018 01:07 PM, Qu Wenruo wrote:
>
>
> On 2018年06月29日 12:27, Marc MERLIN wrote:
>> Regular btrfs check --repair has a nice progress option. It wasn't
>> perfect, but it showed something.
>>
>> But then it also takes all your memory quicker than the linux kernel can
>> defend itself and reliably completely kills my 32GB server quicker than
>> it can OOM anything.
>>
>> lowmem repair seems to be going still, but it's been days and -p seems
>> to do absolutely nothing.
>
> I'm a afraid you hit a bug in lowmem repair code.
> By all means, --repair shouldn't really be used unless you're pretty
> sure the problem is something btrfs check can handle.
>
> That's also why --repair is still marked as dangerous.
> Especially when it's combined with experimental lowmem mode.
>
>>
>> My filesystem is "only" 10TB or so, albeit with a lot of files.
>
> Unless you have tons of snapshots and reflinked (deduped) files, it
> shouldn't take so long.
>
>>
>> 2 things that come to mind
>> 1) can lowmem have some progress working so that I know if I'm looking
>> at days, weeks, or even months before it will be done?
>
> It's hard to estimate, especially when every cross check involves a lot
> of disk IO.
>
> But at least, we could add such indicator to show we're doing something.
> Maybe we can account all roots in root tree first, before checking a
tree, report i/num_roots. So users can see the what is the check doing
something meaningful or silly dead looping.
Thanks,
Su
>>
>> 2) non lowmem is more efficient obviously when it doesn't completely
>> crash your machine, but could lowmem be given an amount of memory to use
>> for caching, or maybe use some heuristics based on RAM free so that it's
>> not so excrutiatingly slow?
>
> IIRC recent commit has added the ability.
> a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers")
>
> That's already included in btrfs-progs v4.13.2.
> So it should be a dead loop which lowmem repair code can't handle.
>
> Thanks,
> Qu
>
>>
>> Thanks,
>> Marc
>>
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 5:35 ` Su Yue
@ 2018-06-29 5:46 ` Marc MERLIN
0 siblings, 0 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29 5:46 UTC (permalink / raw)
To: Su Yue; +Cc: Qu Wenruo, linux-btrfs
On Fri, Jun 29, 2018 at 01:35:06PM +0800, Su Yue wrote:
> > It's hard to estimate, especially when every cross check involves a lot
> > of disk IO.
> >
> > But at least, we could add such indicator to show we're doing something.
> > Maybe we can account all roots in root tree first, before checking a
> tree, report i/num_roots. So users can see the what is the check doing
> something meaningful or silly dead looping.
Sounds reasonable.
Do you want to submit something in git master for btrfs-progs, I pull
it, and just my btrfs check again?
In the meantime, how sane does the output I just posted, look?
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 5:28 ` Marc MERLIN
@ 2018-06-29 5:48 ` Qu Wenruo
2018-06-29 6:06 ` Marc MERLIN
2018-06-29 6:02 ` Su Yue
1 sibling, 1 reply; 65+ messages in thread
From: Qu Wenruo @ 2018-06-29 5:48 UTC (permalink / raw)
To: Marc MERLIN; +Cc: linux-btrfs
[-- Attachment #1.1: Type: text/plain, Size: 7460 bytes --]
On 2018年06月29日 13:28, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 01:07:20PM +0800, Qu Wenruo wrote:
>>> lowmem repair seems to be going still, but it's been days and -p seems
>>> to do absolutely nothing.
>>
>> I'm a afraid you hit a bug in lowmem repair code.
>> By all means, --repair shouldn't really be used unless you're pretty
>> sure the problem is something btrfs check can handle.
>>
>> That's also why --repair is still marked as dangerous.
>> Especially when it's combined with experimental lowmem mode.
>
> Understood, but btrfs got corrupted (by itself or not, I don't know)
> I cannot mount the filesystem read/write
> I cannot btrfs check --repair it since that code will kill my machine
> What do I have left?
Just normal btrfs check, and post the output.
If normal check eats up all your memory, btrfs check --mode=lowmem.
--repair should be considered as the last method.
>
>>> My filesystem is "only" 10TB or so, albeit with a lot of files.
>>
>> Unless you have tons of snapshots and reflinked (deduped) files, it
>> shouldn't take so long.
>
> I may have a fair amount.
> gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2
> enabling repair mode
> WARNING: low-memory mode repair support is only partial
> Checking filesystem on /dev/mapper/dshelf2
> UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
> Fixed 0 roots.
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4
> Created new chunk [18457780224000 1073741824]
> Delete backref in extent [84302495744 69632]
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4
> Delete backref in extent [84302495744 69632]
> ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240
> Delete backref in extent [125712527360 12214272]
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115
> Delete backref in extent [125730848768 5111808]
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115
> Delete backref in extent [125730848768 5111808]
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143
> Delete backref in extent [125736914944 6037504]
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143
> Delete backref in extent [125736914944 6037504]
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431
> Delete backref in extent [129952120832 20242432]
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433
> Delete backref in extent [129952120832 20242432]
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240
> Delete backref in extent [134925357056 11829248]
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240
> Delete backref in extent [134925357056 11829248]
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249
> Delete backref in extent [147895111680 12345344]
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251
> Delete backref in extent [147895111680 12345344]
> ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 348, have: 418
> Delete backref in extent [150850146304 17522688]
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 555, have: 1449
> Deleted root 2 item[156909494272, 178, 5476627808561673095]
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 556, have: 1452
> Deleted root 2 item[156909494272, 178, 7338474132555182983]
> ERROR: file extent[374857 235184128] root 21872 owner 21872 backref lost
> Add one extent data backref [156909494272 55320576]
> ERROR: file extent[374857 235184128] root 22911 owner 22911 backref lost
> Add one extent data backref [156909494272 55320576]
>
> The last two ERROR lines took over a day to get generated, so I'm not sure if it's still working, but just slowly.
OK, that explains something.
One extent is referred hundreds times, no wonder it will take a long time.
Just one tip here, there are really too many snapshots/reflinked files.
It's highly recommended to keep the number of snapshots to a reasonable
number (lower two digits).
Although btrfs snapshot is super fast, it puts a lot of pressure on its
extent tree, so there is no free lunch here.
> For what it's worth non lowmem check used to take 12 to 24H on that filesystem back when it still worked.
>
>>> 2 things that come to mind
>>> 1) can lowmem have some progress working so that I know if I'm looking
>>> at days, weeks, or even months before it will be done?
>>
>> It's hard to estimate, especially when every cross check involves a lot
>> of disk IO.
>> But at least, we could add such indicator to show we're doing something.
>
> Yes, anything to show that I should still wait is still good :)
>
>>> 2) non lowmem is more efficient obviously when it doesn't completely
>>> crash your machine, but could lowmem be given an amount of memory to use
>>> for caching, or maybe use some heuristics based on RAM free so that it's
>>> not so excrutiatingly slow?
>>
>> IIRC recent commit has added the ability.
>> a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers")
>
> Oh, good.
>
>> That's already included in btrfs-progs v4.13.2.
>> So it should be a dead loop which lowmem repair code can't handle.
>
> I see. Is there any reasonably easy way to check on this running process?
GDB attach would be good.
Interrupt and check the inode number if it's checking fs tree.
Check the extent bytenr number if it's checking extent tree.
But considering how many snapshots there are, it's really hard to determine.
In this case, the super large extent tree is causing a lot of problem,
maybe it's a good idea to allow btrfs check to skip extent tree check?
>
> Both top and iotop show that it's working, but of course I can't tell if
> it's looping, or not.
>
> Then again, maybe it already fixed enough that I can mount my filesystem again.
This needs the initial btrfs check report and the kernel messages how it
fails to mount.
>
> But back to the main point, it's sad that after so many years, the
> repair situation is still so suboptimal, especially when it's apparently
> pretty easy for btrfs to get damaged (through its own fault or not, hard
> to say).
Unfortunately, yes.
Especially the extent tree is pretty fragile and hard to repair.
Thanks,
Qu
>
> Thanks,
> Marc
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 5:28 ` Marc MERLIN
2018-06-29 5:48 ` Qu Wenruo
@ 2018-06-29 6:02 ` Su Yue
2018-06-29 6:10 ` Marc MERLIN
1 sibling, 1 reply; 65+ messages in thread
From: Su Yue @ 2018-06-29 6:02 UTC (permalink / raw)
To: Marc MERLIN, Qu Wenruo; +Cc: linux-btrfs
On 06/29/2018 01:28 PM, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 01:07:20PM +0800, Qu Wenruo wrote:
>>> lowmem repair seems to be going still, but it's been days and -p seems
>>> to do absolutely nothing.
>>
>> I'm a afraid you hit a bug in lowmem repair code.
>> By all means, --repair shouldn't really be used unless you're pretty
>> sure the problem is something btrfs check can handle.
>>
>> That's also why --repair is still marked as dangerous.
>> Especially when it's combined with experimental lowmem mode.
>
> Understood, but btrfs got corrupted (by itself or not, I don't know)
> I cannot mount the filesystem read/write
> I cannot btrfs check --repair it since that code will kill my machine
> What do I have left?
>
>>> My filesystem is "only" 10TB or so, albeit with a lot of files.
>>
>> Unless you have tons of snapshots and reflinked (deduped) files, it
>> shouldn't take so long.
>
> I may have a fair amount.
> gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2
> enabling repair mode
> WARNING: low-memory mode repair support is only partial
> Checking filesystem on /dev/mapper/dshelf2
> UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
> Fixed 0 roots.
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4
> Created new chunk [18457780224000 1073741824]
> Delete backref in extent [84302495744 69632]
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4
> Delete backref in extent [84302495744 69632]
> ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240
> Delete backref in extent [125712527360 12214272]
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115
> Delete backref in extent [125730848768 5111808]
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115
> Delete backref in extent [125730848768 5111808]
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143
> Delete backref in extent [125736914944 6037504]
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143
> Delete backref in extent [125736914944 6037504]
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431
> Delete backref in extent [129952120832 20242432]
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433
> Delete backref in extent [129952120832 20242432]
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240
> Delete backref in extent [134925357056 11829248]
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240
> Delete backref in extent [134925357056 11829248]
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249
> Delete backref in extent [147895111680 12345344]
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251
> Delete backref in extent [147895111680 12345344]
> ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 348, have: 418
> Delete backref in extent [150850146304 17522688]
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 555, have: 1449
> Deleted root 2 item[156909494272, 178, 5476627808561673095]
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 556, have: 1452
> Deleted root 2 item[156909494272, 178, 7338474132555182983]
> ERROR: file extent[374857 235184128] root 21872 owner 21872 backref lost
> Add one extent data backref [156909494272 55320576]
> ERROR: file extent[374857 235184128] root 22911 owner 22911 backref lost
> Add one extent data backref [156909494272 55320576]
>
My bad.
It's almost possiblelly a bug about extent of lowmem check which
was reported by Chris too.
The extent check was wrong, the the repair did wrong things.
I have figured out the bug is lowmem check can't deal with shared tree
block in reloc tree. The fix is simple, you can try the follow repo:
https://github.com/Damenly/btrfs-progs/tree/tmp1
Please run lowmem check "without =--repair" first to be sure whether
your filesystem is fine.
Though the bug and phenomenon are clear enough, before sending my patch,
I have to make a test image. I have spent a week to study btrfs balance
but it seems a liitle hard for me.
Thanks,
Su
> The last two ERROR lines took over a day to get generated, so I'm not sure if it's still working, but just slowly.
> For what it's worth non lowmem check used to take 12 to 24H on that filesystem back when it still worked.
>
>>> 2 things that come to mind
>>> 1) can lowmem have some progress working so that I know if I'm looking
>>> at days, weeks, or even months before it will be done?
>>
>> It's hard to estimate, especially when every cross check involves a lot
>> of disk IO.
>> But at least, we could add such indicator to show we're doing something.
>
> Yes, anything to show that I should still wait is still good :)
>
>>> 2) non lowmem is more efficient obviously when it doesn't completely
>>> crash your machine, but could lowmem be given an amount of memory to use
>>> for caching, or maybe use some heuristics based on RAM free so that it's
>>> not so excrutiatingly slow?
>>
>> IIRC recent commit has added the ability.
>> a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers")
>
> Oh, good.
>
>> That's already included in btrfs-progs v4.13.2.
>> So it should be a dead loop which lowmem repair code can't handle.
>
> I see. Is there any reasonably easy way to check on this running process?
>
> Both top and iotop show that it's working, but of course I can't tell if
> it's looping, or not.
>
> Then again, maybe it already fixed enough that I can mount my filesystem again.
>
> But back to the main point, it's sad that after so many years, the
> repair situation is still so suboptimal, especially when it's apparently
> pretty easy for btrfs to get damaged (through its own fault or not, hard
> to say).
>
> Thanks,
> Marc
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 5:48 ` Qu Wenruo
@ 2018-06-29 6:06 ` Marc MERLIN
2018-06-29 6:29 ` Qu Wenruo
0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29 6:06 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Fri, Jun 29, 2018 at 01:48:17PM +0800, Qu Wenruo wrote:
> Just normal btrfs check, and post the output.
> If normal check eats up all your memory, btrfs check --mode=lowmem.
Does check without --repair eat less RAM?
> --repair should be considered as the last method.
If --repair doesn't work, check is useless to me sadly. I know that for
FS analysis and bug reporting, you want to have the FS without changing
it to something maybe worse, but for my use, if it can't be mounted and
can't be fixed, then it gets deleted which is even worse than check
doing the wrong thing.
> > The last two ERROR lines took over a day to get generated, so I'm not sure if it's still working, but just slowly.
>
> OK, that explains something.
>
> One extent is referred hundreds times, no wonder it will take a long time.
>
> Just one tip here, there are really too many snapshots/reflinked files.
> It's highly recommended to keep the number of snapshots to a reasonable
> number (lower two digits).
> Although btrfs snapshot is super fast, it puts a lot of pressure on its
> extent tree, so there is no free lunch here.
Agreed, I doubt I have over or much over 100 snapshots though (but I
can't check right now).
Sadly I'm not allowed to mount even read only while check is running:
gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2
mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy
> > I see. Is there any reasonably easy way to check on this running process?
>
> GDB attach would be good.
> Interrupt and check the inode number if it's checking fs tree.
> Check the extent bytenr number if it's checking extent tree.
>
> But considering how many snapshots there are, it's really hard to determine.
>
> In this case, the super large extent tree is causing a lot of problem,
> maybe it's a good idea to allow btrfs check to skip extent tree check?
I only see --init-extent-tree in the man page, which option did you have
in mind?
> > Then again, maybe it already fixed enough that I can mount my filesystem again.
>
> This needs the initial btrfs check report and the kernel messages how it
> fails to mount.
mount command hangs, kernel does not show anything special outside of disk access hanging.
Jun 23 17:23:26 gargamel kernel: [ 341.802696] BTRFS warning (device dm-2): 'recovery' is deprecated, use 'useback
uproot' instead
Jun 23 17:23:26 gargamel kernel: [ 341.828743] BTRFS info (device dm-2): trying to use backup root at mount time
Jun 23 17:23:26 gargamel kernel: [ 341.850180] BTRFS info (device dm-2): disk space caching is enabled
Jun 23 17:23:26 gargamel kernel: [ 341.869014] BTRFS info (device dm-2): has skinny extents
Jun 23 17:23:26 gargamel kernel: [ 342.206289] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
Jun 23 17:26:26 gargamel kernel: [ 521.571392] BTRFS info (device dm-2): enabling ssd optimizations
Jun 23 17:55:58 gargamel kernel: [ 2293.914867] perf: interrupt took too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
Jun 23 17:56:22 gargamel kernel: [ 2317.718406] BTRFS info (device dm-2): disk space caching is enabled
Jun 23 17:56:22 gargamel kernel: [ 2317.737277] BTRFS info (device dm-2): has skinny extents
Jun 23 17:56:22 gargamel kernel: [ 2318.069461] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
Jun 23 17:59:22 gargamel kernel: [ 2498.256167] BTRFS info (device dm-2): enabling ssd optimizations
Jun 23 18:05:23 gargamel kernel: [ 2859.107057] BTRFS info (device dm-2): disk space caching is enabled
Jun 23 18:05:23 gargamel kernel: [ 2859.125883] BTRFS info (device dm-2): has skinny extents
Jun 23 18:05:24 gargamel kernel: [ 2859.448018] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
Jun 23 18:08:23 gargamel kernel: [ 3039.023305] BTRFS info (device dm-2): enabling ssd optimizations
Jun 23 18:13:41 gargamel kernel: [ 3356.626037] perf: interrupt took too long (3143 > 3133), lowering kernel.perf_event_max_sample_rate to 63500
Jun 23 18:17:23 gargamel kernel: [ 3578.937225] Process accounting resumed
Jun 23 18:33:47 gargamel kernel: [ 4563.356252] JFS: nTxBlock = 8192, nTxLock = 65536
Jun 23 18:33:48 gargamel kernel: [ 4563.446715] ntfs: driver 2.1.32 [Flags: R/W MODULE].
Jun 23 18:42:20 gargamel kernel: [ 5075.995254] INFO: task sync:20253 blocked for more than 120 seconds.
Jun 23 18:42:20 gargamel kernel: [ 5076.015729] Not tainted 4.17.2-amd64-preempt-sysrq-20180817 #1
Jun 23 18:42:20 gargamel kernel: [ 5076.036141] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 23 18:42:20 gargamel kernel: [ 5076.060637] sync D 0 20253 15327 0x20020080
Jun 23 18:42:20 gargamel kernel: [ 5076.078032] Call Trace:
Jun 23 18:42:20 gargamel kernel: [ 5076.086366] ? __schedule+0x53e/0x59b
Jun 23 18:42:20 gargamel kernel: [ 5076.098311] schedule+0x7f/0x98
Jun 23 18:42:20 gargamel kernel: [ 5076.108665] __rwsem_down_read_failed_common+0x127/0x1a8
Jun 23 18:42:20 gargamel kernel: [ 5076.125565] ? sync_fs_one_sb+0x20/0x20
Jun 23 18:42:20 gargamel kernel: [ 5076.137982] ? call_rwsem_down_read_failed+0x14/0x30
Jun 23 18:42:20 gargamel kernel: [ 5076.154081] call_rwsem_down_read_failed+0x14/0x30
Jun 23 18:42:20 gargamel kernel: [ 5076.169429] down_read+0x13/0x25
Jun 23 18:42:20 gargamel kernel: [ 5076.180444] iterate_supers+0x57/0xbe
Jun 23 18:42:20 gargamel kernel: [ 5076.192619] ksys_sync+0x40/0xa4
Jun 23 18:42:20 gargamel kernel: [ 5076.203192] __ia32_sys_sync+0xa/0xd
Jun 23 18:42:20 gargamel kernel: [ 5076.214774] do_fast_syscall_32+0xaf/0xf3
Jun 23 18:42:20 gargamel kernel: [ 5076.227740] entry_SYSENTER_compat+0x7f/0x91
Jun 23 18:44:21 gargamel kernel: [ 5196.828764] INFO: task sync:20253 blocked for more than 120 seconds.
Jun 23 18:44:21 gargamel kernel: [ 5196.848724] Not tainted 4.17.2-amd64-preempt-sysrq-20180817 #1
Jun 23 18:44:21 gargamel kernel: [ 5196.868789] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 23 18:44:21 gargamel kernel: [ 5196.893615] sync D 0 20253 15327 0x20020080
> > But back to the main point, it's sad that after so many years, the
> > repair situation is still so suboptimal, especially when it's apparently
> > pretty easy for btrfs to get damaged (through its own fault or not, hard
> > to say).
>
> Unfortunately, yes.
> Especially the extent tree is pretty fragile and hard to repair.
So, I don't know the code, but if I may make a suggestion (which maybe
is totally wrong, if so forgive me):
I would love for a repair mode that gives me a back a fixed
filesystem. I don't really care how much data is lost (although ideally
it would give me a list of files lost), but I want a working filesystem
at the end. I can then decide if there is enough data left on it to
restore what's missing or if I'm better off starting from scratch.
Is that possible at all?
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 6:02 ` Su Yue
@ 2018-06-29 6:10 ` Marc MERLIN
2018-06-29 6:32 ` Su Yue
0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29 6:10 UTC (permalink / raw)
To: Su Yue; +Cc: Qu Wenruo, linux-btrfs
On Fri, Jun 29, 2018 at 02:02:19PM +0800, Su Yue wrote:
> I have figured out the bug is lowmem check can't deal with shared tree block
> in reloc tree. The fix is simple, you can try the follow repo:
>
> https://github.com/Damenly/btrfs-progs/tree/tmp1
Not sure if I undertand that you meant, here.
> Please run lowmem check "without =--repair" first to be sure whether
> your filesystem is fine.
The filesystem is not fine, it caused btrfs balance to hang, whether
balance actually broke it further or caused the breakage, I can't say.
Then mount hangs, even with recovery, unless I use ro.
This filesystem is trash to me and will require over a week to rebuild
manually if I can't repair it.
Running check without repair for likely several days just to know that
my filesystem is not clear (I already know this) isn't useful :)
Or am I missing something?
> Though the bug and phenomenon are clear enough, before sending my patch,
> I have to make a test image. I have spent a week to study btrfs balance
> but it seems a liitle hard for me.
thanks for having a look, either way.
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 6:06 ` Marc MERLIN
@ 2018-06-29 6:29 ` Qu Wenruo
2018-06-29 6:59 ` Marc MERLIN
0 siblings, 1 reply; 65+ messages in thread
From: Qu Wenruo @ 2018-06-29 6:29 UTC (permalink / raw)
To: Marc MERLIN; +Cc: linux-btrfs
[-- Attachment #1.1: Type: text/plain, Size: 8658 bytes --]
On 2018年06月29日 14:06, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 01:48:17PM +0800, Qu Wenruo wrote:
>> Just normal btrfs check, and post the output.
>> If normal check eats up all your memory, btrfs check --mode=lowmem.
>
> Does check without --repair eat less RAM?
Unfortunately, no.
>
>> --repair should be considered as the last method.
>
> If --repair doesn't work, check is useless to me sadly.
Not exactly.
Although it's time consuming, I have manually patched several users fs,
which normally ends pretty well.
If it's not a wide-spread problem but some small fatal one, it may be fixed.
> I know that for
> FS analysis and bug reporting, you want to have the FS without changing
> it to something maybe worse, but for my use, if it can't be mounted and
> can't be fixed, then it gets deleted which is even worse than check
> doing the wrong thing.
>
>>> The last two ERROR lines took over a day to get generated, so I'm not sure if it's still working, but just slowly.
>>
>> OK, that explains something.
>>
>> One extent is referred hundreds times, no wonder it will take a long time.
>>
>> Just one tip here, there are really too many snapshots/reflinked files.
>> It's highly recommended to keep the number of snapshots to a reasonable
>> number (lower two digits).
>> Although btrfs snapshot is super fast, it puts a lot of pressure on its
>> extent tree, so there is no free lunch here.
>
> Agreed, I doubt I have over or much over 100 snapshots though (but I
> can't check right now).
> Sadly I'm not allowed to mount even read only while check is running:
> gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2
> mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy
>
>>> I see. Is there any reasonably easy way to check on this running process?
>>
>> GDB attach would be good.
>> Interrupt and check the inode number if it's checking fs tree.
>> Check the extent bytenr number if it's checking extent tree.
>>
>> But considering how many snapshots there are, it's really hard to determine.
>>
>> In this case, the super large extent tree is causing a lot of problem,
>> maybe it's a good idea to allow btrfs check to skip extent tree check?
>
> I only see --init-extent-tree in the man page, which option did you have
> in mind?
That feature is just in my mind, not even implemented yet.
>
>>> Then again, maybe it already fixed enough that I can mount my filesystem again.
>>
>> This needs the initial btrfs check report and the kernel messages how it
>> fails to mount.
>
> mount command hangs, kernel does not show anything special outside of disk access hanging.
>
> Jun 23 17:23:26 gargamel kernel: [ 341.802696] BTRFS warning (device dm-2): 'recovery' is deprecated, use 'useback
> uproot' instead
> Jun 23 17:23:26 gargamel kernel: [ 341.828743] BTRFS info (device dm-2): trying to use backup root at mount time
> Jun 23 17:23:26 gargamel kernel: [ 341.850180] BTRFS info (device dm-2): disk space caching is enabled
> Jun 23 17:23:26 gargamel kernel: [ 341.869014] BTRFS info (device dm-2): has skinny extents
> Jun 23 17:23:26 gargamel kernel: [ 342.206289] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
> Jun 23 17:26:26 gargamel kernel: [ 521.571392] BTRFS info (device dm-2): enabling ssd optimizations
> Jun 23 17:55:58 gargamel kernel: [ 2293.914867] perf: interrupt took too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
> Jun 23 17:56:22 gargamel kernel: [ 2317.718406] BTRFS info (device dm-2): disk space caching is enabled
> Jun 23 17:56:22 gargamel kernel: [ 2317.737277] BTRFS info (device dm-2): has skinny extents
> Jun 23 17:56:22 gargamel kernel: [ 2318.069461] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
> Jun 23 17:59:22 gargamel kernel: [ 2498.256167] BTRFS info (device dm-2): enabling ssd optimizations
> Jun 23 18:05:23 gargamel kernel: [ 2859.107057] BTRFS info (device dm-2): disk space caching is enabled
> Jun 23 18:05:23 gargamel kernel: [ 2859.125883] BTRFS info (device dm-2): has skinny extents
> Jun 23 18:05:24 gargamel kernel: [ 2859.448018] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
This looks like super block corruption?
What about "btrfs inspect dump-super -fFa /dev/mapper/dshelf2"?
And what about "skip_balance" mount option?
Another problem is, with so many snapshots, balance is also hugely
slowed, thus I'm not 100% sure if it's really a hang.
> Jun 23 18:08:23 gargamel kernel: [ 3039.023305] BTRFS info (device dm-2): enabling ssd optimizations
> Jun 23 18:13:41 gargamel kernel: [ 3356.626037] perf: interrupt took too long (3143 > 3133), lowering kernel.perf_event_max_sample_rate to 63500
> Jun 23 18:17:23 gargamel kernel: [ 3578.937225] Process accounting resumed
> Jun 23 18:33:47 gargamel kernel: [ 4563.356252] JFS: nTxBlock = 8192, nTxLock = 65536
> Jun 23 18:33:48 gargamel kernel: [ 4563.446715] ntfs: driver 2.1.32 [Flags: R/W MODULE].
> Jun 23 18:42:20 gargamel kernel: [ 5075.995254] INFO: task sync:20253 blocked for more than 120 seconds.
> Jun 23 18:42:20 gargamel kernel: [ 5076.015729] Not tainted 4.17.2-amd64-preempt-sysrq-20180817 #1
> Jun 23 18:42:20 gargamel kernel: [ 5076.036141] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 23 18:42:20 gargamel kernel: [ 5076.060637] sync D 0 20253 15327 0x20020080
> Jun 23 18:42:20 gargamel kernel: [ 5076.078032] Call Trace:
> Jun 23 18:42:20 gargamel kernel: [ 5076.086366] ? __schedule+0x53e/0x59b
> Jun 23 18:42:20 gargamel kernel: [ 5076.098311] schedule+0x7f/0x98
> Jun 23 18:42:20 gargamel kernel: [ 5076.108665] __rwsem_down_read_failed_common+0x127/0x1a8
> Jun 23 18:42:20 gargamel kernel: [ 5076.125565] ? sync_fs_one_sb+0x20/0x20
> Jun 23 18:42:20 gargamel kernel: [ 5076.137982] ? call_rwsem_down_read_failed+0x14/0x30
> Jun 23 18:42:20 gargamel kernel: [ 5076.154081] call_rwsem_down_read_failed+0x14/0x30
> Jun 23 18:42:20 gargamel kernel: [ 5076.169429] down_read+0x13/0x25
> Jun 23 18:42:20 gargamel kernel: [ 5076.180444] iterate_supers+0x57/0xbe
> Jun 23 18:42:20 gargamel kernel: [ 5076.192619] ksys_sync+0x40/0xa4
> Jun 23 18:42:20 gargamel kernel: [ 5076.203192] __ia32_sys_sync+0xa/0xd
> Jun 23 18:42:20 gargamel kernel: [ 5076.214774] do_fast_syscall_32+0xaf/0xf3
> Jun 23 18:42:20 gargamel kernel: [ 5076.227740] entry_SYSENTER_compat+0x7f/0x91
> Jun 23 18:44:21 gargamel kernel: [ 5196.828764] INFO: task sync:20253 blocked for more than 120 seconds.
> Jun 23 18:44:21 gargamel kernel: [ 5196.848724] Not tainted 4.17.2-amd64-preempt-sysrq-20180817 #1
> Jun 23 18:44:21 gargamel kernel: [ 5196.868789] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 23 18:44:21 gargamel kernel: [ 5196.893615] sync D 0 20253 15327 0x20020080
>
>>> But back to the main point, it's sad that after so many years, the
>>> repair situation is still so suboptimal, especially when it's apparently
>>> pretty easy for btrfs to get damaged (through its own fault or not, hard
>>> to say).
>>
>> Unfortunately, yes.
>> Especially the extent tree is pretty fragile and hard to repair.
>
> So, I don't know the code, but if I may make a suggestion (which maybe
> is totally wrong, if so forgive me):
> I would love for a repair mode that gives me a back a fixed
> filesystem. I don't really care how much data is lost (although ideally
> it would give me a list of files lost), but I want a working filesystem
> at the end. I can then decide if there is enough data left on it to
> restore what's missing or if I'm better off starting from scratch.
If for that usage, btrfs-restore would fit your use case more,
Unfortunately it needs extra disk space and isn't good at restoring
subvolume/snapshots.
(Although it's much faster than repairing the possible corrupted extent
tree)
>
> Is that possible at all?
At least for file recovery (fs tree repair), we have such behavior.
However, the problem you hit (and a lot of users hit) is all about
extent tree repair, which doesn't even goes to file recovery.
All the hassle are in extent tree, and for extent tree, it's just good
or bad. Any corruption in extent tree may lead to later bugs.
The only way to avoid extent tree problems is to mount the fs RO.
So, I'm afraid it is at least impossible for recent years.
Thanks,
Qu
>
> Thanks,
> Marc
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 6:10 ` Marc MERLIN
@ 2018-06-29 6:32 ` Su Yue
2018-06-29 6:43 ` Marc MERLIN
0 siblings, 1 reply; 65+ messages in thread
From: Su Yue @ 2018-06-29 6:32 UTC (permalink / raw)
To: Marc MERLIN; +Cc: Qu Wenruo, linux-btrfs
On 06/29/2018 02:10 PM, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 02:02:19PM +0800, Su Yue wrote:
>> I have figured out the bug is lowmem check can't deal with shared tree block
>> in reloc tree. The fix is simple, you can try the follow repo:
>>
>> https://github.com/Damenly/btrfs-progs/tree/tmp1
>
> Not sure if I undertand that you meant, here.
>
Sorry for my unclear words.
Simply speaking, I suggest you to stop current running check.
Then, clone above branch to compile binary then run
'btrfs check --mode=lowmem $dev'.
>> Please run lowmem check "without =--repair" first to be sure whether
>> your filesystem is fine.
>
> The filesystem is not fine, it caused btrfs balance to hang, whether
> balance actually broke it further or caused the breakage, I can't say.
>
> Then mount hangs, even with recovery, unless I use ro.
>
> This filesystem is trash to me and will require over a week to rebuild
> manually if I can't repair it.
Understood your anxiety, a log of check without '--repair' will help
us to make clear what's wrong with your filesystem.
Thanks,
Su
> Running check without repair for likely several days just to know that
> my filesystem is not clear (I already know this) isn't useful :)
> Or am I missing something?
>
>> Though the bug and phenomenon are clear enough, before sending my patch,
>> I have to make a test image. I have spent a week to study btrfs balance
>> but it seems a liitle hard for me.
>
> thanks for having a look, either way.
>
> Marc
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 6:32 ` Su Yue
@ 2018-06-29 6:43 ` Marc MERLIN
2018-07-01 23:22 ` Marc MERLIN
0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29 6:43 UTC (permalink / raw)
To: Su Yue; +Cc: Qu Wenruo, linux-btrfs
On Fri, Jun 29, 2018 at 02:32:44PM +0800, Su Yue wrote:
> > > https://github.com/Damenly/btrfs-progs/tree/tmp1
> >
> > Not sure if I undertand that you meant, here.
> >
> Sorry for my unclear words.
> Simply speaking, I suggest you to stop current running check.
> Then, clone above branch to compile binary then run
> 'btrfs check --mode=lowmem $dev'.
I understand, I'll build and try it.
> > This filesystem is trash to me and will require over a week to rebuild
> > manually if I can't repair it.
>
> Understood your anxiety, a log of check without '--repair' will help
> us to make clear what's wrong with your filesystem.
Ok, I'll run your new code without repair and report back. It will
likely take over a day though.
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 6:29 ` Qu Wenruo
@ 2018-06-29 6:59 ` Marc MERLIN
2018-06-29 7:09 ` Roman Mamedov
2018-06-29 7:20 ` So, does btrfs check lowmem take days? weeks? Qu Wenruo
0 siblings, 2 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29 6:59 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Fri, Jun 29, 2018 at 02:29:10PM +0800, Qu Wenruo wrote:
> > If --repair doesn't work, check is useless to me sadly.
>
> Not exactly.
> Although it's time consuming, I have manually patched several users fs,
> which normally ends pretty well.
Ok I understand now.
> > Agreed, I doubt I have over or much over 100 snapshots though (but I
> > can't check right now).
> > Sadly I'm not allowed to mount even read only while check is running:
> > gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2
> > mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy
Ok, so I just checked now, 270 snapshots, but not because I'm crazy,
because I use btrfs send a lot :)
> This looks like super block corruption?
>
> What about "btrfs inspect dump-super -fFa /dev/mapper/dshelf2"?
Sure, there you go: https://pastebin.com/uF1pHTsg
> And what about "skip_balance" mount option?
I have this in my fstab :)
> Another problem is, with so many snapshots, balance is also hugely
> slowed, thus I'm not 100% sure if it's really a hang.
I sent another thread about this last week, balance got hung after 2
days of doing nothing and just moving a single chunk.
Ok, I was able to remount the filesystem read only. I was wrong, I have
270 snapshots:
gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup/'
74
gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup-btrfssend/'
196
It's a backup server, I use btrfs send for many machines and for each btrs
send, I keep history, maybe 10 or so backups. So it adds up in the end.
Is btrfs unable to deal with this well enough?
> If for that usage, btrfs-restore would fit your use case more,
> Unfortunately it needs extra disk space and isn't good at restoring
> subvolume/snapshots.
> (Although it's much faster than repairing the possible corrupted extent
> tree)
It's a backup server, it only contains data from other machines.
If the filesystem cannot be recovered to a working state, I will need
over a week to restart the many btrfs send commands from many servers.
This is why anything other than --repair is useless ot me, I don't need
the data back, it's still on the original machines, I need the
filesystem to work again so that I don't waste a week recreating the
many btrfs send/receive relationships.
> > Is that possible at all?
>
> At least for file recovery (fs tree repair), we have such behavior.
>
> However, the problem you hit (and a lot of users hit) is all about
> extent tree repair, which doesn't even goes to file recovery.
>
> All the hassle are in extent tree, and for extent tree, it's just good
> or bad. Any corruption in extent tree may lead to later bugs.
> The only way to avoid extent tree problems is to mount the fs RO.
>
> So, I'm afraid it is at least impossible for recent years.
Understood, thanks for answering.
Does the pastebin help and is 270 snapshots ok enough?
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 6:59 ` Marc MERLIN
@ 2018-06-29 7:09 ` Roman Mamedov
2018-06-29 7:22 ` Marc MERLIN
2018-06-29 7:20 ` So, does btrfs check lowmem take days? weeks? Qu Wenruo
1 sibling, 1 reply; 65+ messages in thread
From: Roman Mamedov @ 2018-06-29 7:09 UTC (permalink / raw)
To: Marc MERLIN; +Cc: linux-btrfs
On Thu, 28 Jun 2018 23:59:03 -0700
Marc MERLIN <marc@merlins.org> wrote:
> I don't waste a week recreating the many btrfs send/receive relationships.
Consider not using send/receive, and switching to regular rsync instead.
Send/receive is very limiting and cumbersome, including because of what you
described. And it doesn't gain you much over an incremental rsync. As for
snapshots on the backup server, you can either automate making one as soon as a
backup has finished, or simply make them once/twice a day, during a period
when no backups are ongoing.
--
With respect,
Roman
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 6:59 ` Marc MERLIN
2018-06-29 7:09 ` Roman Mamedov
@ 2018-06-29 7:20 ` Qu Wenruo
2018-06-29 7:28 ` Marc MERLIN
1 sibling, 1 reply; 65+ messages in thread
From: Qu Wenruo @ 2018-06-29 7:20 UTC (permalink / raw)
To: Marc MERLIN; +Cc: linux-btrfs
[-- Attachment #1.1: Type: text/plain, Size: 3808 bytes --]
On 2018年06月29日 14:59, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 02:29:10PM +0800, Qu Wenruo wrote:
>>> If --repair doesn't work, check is useless to me sadly.
>>
>> Not exactly.
>> Although it's time consuming, I have manually patched several users fs,
>> which normally ends pretty well.
>
> Ok I understand now.
>
>>> Agreed, I doubt I have over or much over 100 snapshots though (but I
>>> can't check right now).
>>> Sadly I'm not allowed to mount even read only while check is running:
>>> gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2
>>> mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy
>
> Ok, so I just checked now, 270 snapshots, but not because I'm crazy,
> because I use btrfs send a lot :)
>
>> This looks like super block corruption?
>>
>> What about "btrfs inspect dump-super -fFa /dev/mapper/dshelf2"?
>
> Sure, there you go: https://pastebin.com/uF1pHTsg
>
>> And what about "skip_balance" mount option?
>
> I have this in my fstab :)
>
>> Another problem is, with so many snapshots, balance is also hugely
>> slowed, thus I'm not 100% sure if it's really a hang.
>
> I sent another thread about this last week, balance got hung after 2
> days of doing nothing and just moving a single chunk.
>
> Ok, I was able to remount the filesystem read only. I was wrong, I have
> 270 snapshots:
> gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup/'
> 74
> gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup-btrfssend/'
> 196
>
> It's a backup server, I use btrfs send for many machines and for each btrs
> send, I keep history, maybe 10 or so backups. So it adds up in the end.
>
> Is btrfs unable to deal with this well enough?
It depends.
For certain and rare case, if the only operations to the filesystem are
non-btrfs specific operations (POSIX file operations), then you're fine.
(Maybe you can go thousands snapshots before any obvious performance
degrade)
If certain btrfs specific operations are involved, it's definitely not OK:
1) Balance
2) Quota
3) Btrfs check
>
>> If for that usage, btrfs-restore would fit your use case more,
>> Unfortunately it needs extra disk space and isn't good at restoring
>> subvolume/snapshots.
>> (Although it's much faster than repairing the possible corrupted extent
>> tree)
>
> It's a backup server, it only contains data from other machines.
> If the filesystem cannot be recovered to a working state, I will need
> over a week to restart the many btrfs send commands from many servers.
> This is why anything other than --repair is useless ot me, I don't need
> the data back, it's still on the original machines, I need the
> filesystem to work again so that I don't waste a week recreating the
> many btrfs send/receive relationships.
Now totally understand why you need to repair the fs.
>
>>> Is that possible at all?
>>
>> At least for file recovery (fs tree repair), we have such behavior.
>>
>> However, the problem you hit (and a lot of users hit) is all about
>> extent tree repair, which doesn't even goes to file recovery.
>>
>> All the hassle are in extent tree, and for extent tree, it's just good
>> or bad. Any corruption in extent tree may lead to later bugs.
>> The only way to avoid extent tree problems is to mount the fs RO.
>>
>> So, I'm afraid it is at least impossible for recent years.
>
> Understood, thanks for answering.
>
> Does the pastebin help and is 270 snapshots ok enough?
The super dump doesn't show anything wrong.
So the problem may be in the super large extent tree.
In this case, plain check result with Su's patch would help more, other
than the not so interesting super dump.
Thanks,
Qu
>
> Thanks,
> Marc
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 7:09 ` Roman Mamedov
@ 2018-06-29 7:22 ` Marc MERLIN
2018-06-29 7:34 ` Roman Mamedov
2018-06-29 8:04 ` Lionel Bouton
0 siblings, 2 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29 7:22 UTC (permalink / raw)
To: Roman Mamedov; +Cc: linux-btrfs
On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
> On Thu, 28 Jun 2018 23:59:03 -0700
> Marc MERLIN <marc@merlins.org> wrote:
>
> > I don't waste a week recreating the many btrfs send/receive relationships.
>
> Consider not using send/receive, and switching to regular rsync instead.
> Send/receive is very limiting and cumbersome, including because of what you
> described. And it doesn't gain you much over an incremental rsync. As for
Err, sorry but I cannot agree with you here, at all :)
btrfs send/receive is pretty much the only reason I use btrfs.
rsync takes hours on big filesystems scanning every single inode on both
sides and then seeing what changed, and only then sends the differences
It's super inefficient.
btrfs send knows in seconds what needs to be sent, and works on it right
away.
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 7:20 ` So, does btrfs check lowmem take days? weeks? Qu Wenruo
@ 2018-06-29 7:28 ` Marc MERLIN
2018-06-29 17:10 ` Marc MERLIN
0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29 7:28 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Fri, Jun 29, 2018 at 03:20:42PM +0800, Qu Wenruo wrote:
> If certain btrfs specific operations are involved, it's definitely not OK:
> 1) Balance
> 2) Quota
> 3) Btrfs check
Ok, I understand. I'll try to balance almost never then. My problems did
indeed start because I ran balance and it got stuck 2 days with 0
progress.
That still seems like a bug though. I'm ok with slow, but stuck for 2
days with only 270 snapshots or so means there is a bug, or the
algorithm is so expensive that 270 snapshots could cause it to take days
or weeks to proceed?
> > It's a backup server, it only contains data from other machines.
> > If the filesystem cannot be recovered to a working state, I will need
> > over a week to restart the many btrfs send commands from many servers.
> > This is why anything other than --repair is useless ot me, I don't need
> > the data back, it's still on the original machines, I need the
> > filesystem to work again so that I don't waste a week recreating the
> > many btrfs send/receive relationships.
>
> Now totally understand why you need to repair the fs.
I also understand that my use case is atypical :)
But I guess this also means that using btrfs for a lot of send/receive
on a backup server is not going to work well unfortunately :-/
Now I'm wondering if I'm the only person even doing this.
> > Does the pastebin help and is 270 snapshots ok enough?
>
> The super dump doesn't show anything wrong.
>
> So the problem may be in the super large extent tree.
>
> In this case, plain check result with Su's patch would help more, other
> than the not so interesting super dump.
First I tried to mount with skip balance after the partial repair, and
it hung a long time:
[445635.716318] BTRFS info (device dm-2): disk space caching is enabled
[445635.736229] BTRFS info (device dm-2): has skinny extents
[445636.101999] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[445825.053205] BTRFS info (device dm-2): enabling ssd optimizations
[446511.006588] BTRFS info (device dm-2): disk space caching is enabled
[446511.026737] BTRFS info (device dm-2): has skinny extents
[446511.325470] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[446699.593501] BTRFS info (device dm-2): enabling ssd optimizations
[446964.077045] INFO: task btrfs-transacti:9211 blocked for more than 120 seconds.
[446964.099802] Not tainted 4.17.2-amd64-preempt-sysrq-20180818 #3
[446964.120004] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
So, I rebooted, and will now run Su's btrfs check without repair and
report back.
Thanks both for your help.
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 7:22 ` Marc MERLIN
@ 2018-06-29 7:34 ` Roman Mamedov
2018-06-29 8:04 ` Lionel Bouton
1 sibling, 0 replies; 65+ messages in thread
From: Roman Mamedov @ 2018-06-29 7:34 UTC (permalink / raw)
To: Marc MERLIN; +Cc: linux-btrfs
On Fri, 29 Jun 2018 00:22:10 -0700
Marc MERLIN <marc@merlins.org> wrote:
> On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
> > On Thu, 28 Jun 2018 23:59:03 -0700
> > Marc MERLIN <marc@merlins.org> wrote:
> >
> > > I don't waste a week recreating the many btrfs send/receive relationships.
> >
> > Consider not using send/receive, and switching to regular rsync instead.
> > Send/receive is very limiting and cumbersome, including because of what you
> > described. And it doesn't gain you much over an incremental rsync. As for
>
> Err, sorry but I cannot agree with you here, at all :)
>
> btrfs send/receive is pretty much the only reason I use btrfs.
> rsync takes hours on big filesystems scanning every single inode on both
> sides and then seeing what changed, and only then sends the differences
I use it for backing up root filesystems of about 20 hosts, and for syncing
large multi-terabyte media collections -- it's fast enough in both.
Admittedly neither of those case has millions of subdirs or files where
scanning may take a long time. And in the former case it's also all from and
to SSDs. Maybe your use case is different where it doesn't work as well. But
perhaps then general day-to-day performance is not great either, so I'd suggest
looking into SSD-based LVM caching, it really works wonders with Btrfs.
--
With respect,
Roman
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 7:22 ` Marc MERLIN
2018-06-29 7:34 ` Roman Mamedov
@ 2018-06-29 8:04 ` Lionel Bouton
2018-06-29 16:24 ` btrfs send/receive vs rsync Marc MERLIN
1 sibling, 1 reply; 65+ messages in thread
From: Lionel Bouton @ 2018-06-29 8:04 UTC (permalink / raw)
To: Marc MERLIN, Roman Mamedov; +Cc: linux-btrfs
Hi,
On 29/06/2018 09:22, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
>> On Thu, 28 Jun 2018 23:59:03 -0700
>> Marc MERLIN <marc@merlins.org> wrote:
>>
>>> I don't waste a week recreating the many btrfs send/receive relationships.
>> Consider not using send/receive, and switching to regular rsync instead.
>> Send/receive is very limiting and cumbersome, including because of what you
>> described. And it doesn't gain you much over an incremental rsync. As for
> Err, sorry but I cannot agree with you here, at all :)
>
> btrfs send/receive is pretty much the only reason I use btrfs.
> rsync takes hours on big filesystems scanning every single inode on both
> sides and then seeing what changed, and only then sends the differences
> It's super inefficient.
> btrfs send knows in seconds what needs to be sent, and works on it right
> away.
I've not yet tried send/receive but I feel the pain of rsyncing millions
of files (I had to use lsyncd to limit the problem to the time the
origin servers reboot which is a relatively rare event) so this thread
picked my attention. Looking at the whole thread I wonder if you could
get a more manageable solution by splitting the filesystem.
If instead of using a single BTRFS filesystem you used LVM volumes
(maybe with Thin provisioning and monitoring of the volume group free
space) for each of your servers to backup with one BTRFS filesystem per
volume you would have less snapshots per filesystem and isolate problems
in case of corruption. If you eventually decide to start from scratch
again this might help a lot in your case.
Lionel
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: btrfs send/receive vs rsync
2018-06-29 8:04 ` Lionel Bouton
@ 2018-06-29 16:24 ` Marc MERLIN
2018-06-30 8:18 ` Duncan
0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29 16:24 UTC (permalink / raw)
To: Lionel Bouton; +Cc: Roman Mamedov, linux-btrfs
On Fri, Jun 29, 2018 at 10:04:02AM +0200, Lionel Bouton wrote:
> Hi,
>
> On 29/06/2018 09:22, Marc MERLIN wrote:
> > On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
> >> On Thu, 28 Jun 2018 23:59:03 -0700
> >> Marc MERLIN <marc@merlins.org> wrote:
> >>
> >>> I don't waste a week recreating the many btrfs send/receive relationships.
> >> Consider not using send/receive, and switching to regular rsync instead.
> >> Send/receive is very limiting and cumbersome, including because of what you
> >> described. And it doesn't gain you much over an incremental rsync. As for
> > Err, sorry but I cannot agree with you here, at all :)
> >
> > btrfs send/receive is pretty much the only reason I use btrfs.
> > rsync takes hours on big filesystems scanning every single inode on both
> > sides and then seeing what changed, and only then sends the differences
> > It's super inefficient.
> > btrfs send knows in seconds what needs to be sent, and works on it right
> > away.
>
> I've not yet tried send/receive but I feel the pain of rsyncing millions
> of files (I had to use lsyncd to limit the problem to the time the
> origin servers reboot which is a relatively rare event) so this thread
> picked my attention. Looking at the whole thread I wonder if you could
> get a more manageable solution by splitting the filesystem.
So, let's be clear. I did backups with rsync for 10+ years. It was slow
and painful. On my laptop an hourly rsync between 2 drives slowed down
my machine to a crawl while everything was being stat'ed, it took
forever.
Now with btrfs send/receive, it just works, I don't even see it
happening in the background.
Here is a page I wrote about it in 2014:
http://marc.merlins.org/perso/btrfs/2014-03.html#Btrfs-Tips_-Doing-Fast-Incremental-Backups-With-Btrfs-Send-and-Receive
Here is a talk I gave in 2014 too, scroll to the bottom of the page, and
the bottom of the talk outline:
http://marc.merlins.org/perso/btrfs/2014-05.html#My-Btrfs-Talk-at-Linuxcon-JP-2014
and click on 'Btrfs send/receive'
> If instead of using a single BTRFS filesystem you used LVM volumes
> (maybe with Thin provisioning and monitoring of the volume group free
> space) for each of your servers to backup with one BTRFS filesystem per
> volume you would have less snapshots per filesystem and isolate problems
> in case of corruption. If you eventually decide to start from scratch
> again this might help a lot in your case.
So, I already have problems due to too many block layers:
- raid 5 + ssd
- bcache
- dmcrypt
- btrfs
I get occasional deadlocks due to upper layers sending more data to the
lower layer (bcache) than it can process. I'm a bit warry of adding yet
another layer (LVM), but you're otherwise correct than keeping smaller
btrfs filesystems would help with performance and containing possible
damage.
Has anyone actually done this? :)
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 7:28 ` Marc MERLIN
@ 2018-06-29 17:10 ` Marc MERLIN
2018-06-30 0:04 ` Chris Murphy
2018-06-30 2:44 ` Marc MERLIN
0 siblings, 2 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29 17:10 UTC (permalink / raw)
To: Qu Wenruo, suy.fnst; +Cc: linux-btrfs
On Fri, Jun 29, 2018 at 12:28:31AM -0700, Marc MERLIN wrote:
> So, I rebooted, and will now run Su's btrfs check without repair and
> report back.
As expected, it will likely still take days, here's the start:
gargamel:~# btrfs check --mode=lowmem -p /dev/mapper/dshelf2
Checking filesystem on /dev/mapper/dshelf2
UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 2, have: 4
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 2, have: 4
ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 180, have: 240
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 67, have: 115
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 67, have: 115
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 114, have: 143
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 114, have: 143
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 301, have: 431
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 355, have: 433
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 160, have: 240
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 161, have: 240
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 169, have: 249
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 171, have: 251
ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 347, have: 418
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 1, have: 1449
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 1, have: 1452
Mmmh, these look similar (but not identical) to the last run earlier in this thread:
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4
Created new chunk [18457780224000 1073741824]
Delete backref in extent [84302495744 69632]
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4
Delete backref in extent [84302495744 69632]
ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240
Delete backref in extent [125712527360 12214272]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431
Delete backref in extent [129952120832 20242432]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433
Delete backref in extent [129952120832 20242432]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249
Delete backref in extent [147895111680 12345344]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251
Delete backref in extent [147895111680 12345344]
ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 348, have: 418
Delete backref in extent [150850146304 17522688]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 555, have: 1449
Deleted root 2 item[156909494272, 178, 5476627808561673095]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 556, have: 1452
Deleted root 2 item[156909494272, 178, 7338474132555182983]
I guess the last repair didn't repair things in a way that they're working now?
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 17:10 ` Marc MERLIN
@ 2018-06-30 0:04 ` Chris Murphy
2018-06-30 2:44 ` Marc MERLIN
1 sibling, 0 replies; 65+ messages in thread
From: Chris Murphy @ 2018-06-30 0:04 UTC (permalink / raw)
To: Marc MERLIN; +Cc: Qu Wenruo, Su Yue, Btrfs BTRFS
I've got about 1/2 the snapshots and less than 1/10th the data...but
my btrfs check times are much shorter than either: 15 minutes and 65
minutes (lowmem).
[chris@f28s ~]$ sudo btrfs fi us /mnt/first
Overall:
Device size: 1024.00GiB
Device allocated: 774.12GiB
Device unallocated: 249.87GiB
Device missing: 0.00B
Used: 760.48GiB
Free (estimated): 256.95GiB (min: 132.01GiB)
Data ratio: 1.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 0.00B)
Data,single: Size:761.00GiB, Used:753.93GiB
/dev/mapper/first 761.00GiB
Metadata,DUP: Size:6.50GiB, Used:3.28GiB
/dev/mapper/first 13.00GiB
System,DUP: Size:64.00MiB, Used:112.00KiB
/dev/mapper/first 128.00MiB
Unallocated:
/dev/mapper/first 249.87GiB
146 subvolumes
137 snapshots
total csum bytes: 790549924
total tree bytes: 3519250432
total fs tree bytes: 2546073600
total extent tree bytes: 131350528
Original mode check takes ~15 minutes
Lowmem mode takes ~65 minutes
RAM: 4G
CPU: Intel(R) Pentium(R) CPU N3700 @ 1.60GHz
Chris Murphy
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 17:10 ` Marc MERLIN
2018-06-30 0:04 ` Chris Murphy
@ 2018-06-30 2:44 ` Marc MERLIN
2018-06-30 14:49 ` Qu Wenruo
1 sibling, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-06-30 2:44 UTC (permalink / raw)
To: Qu Wenruo, suy.fnst; +Cc: linux-btrfs
Well, there goes that. After about 18H:
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 1, have: 1452
backref.c:466: __add_missing_keys: Assertion `ref->root_id` failed, value 0
btrfs(+0x3a232)[0x56091704f232]
btrfs(+0x3ab46)[0x56091704fb46]
btrfs(+0x3b9f5)[0x5609170509f5]
btrfs(btrfs_find_all_roots+0x9)[0x560917050a45]
btrfs(+0x572ff)[0x56091706c2ff]
btrfs(+0x60b13)[0x560917075b13]
btrfs(cmd_check+0x2634)[0x56091707d431]
btrfs(main+0x88)[0x560917027260]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f93aa508561]
btrfs(_start+0x2a)[0x560917026dfa]
Aborted
That's https://github.com/Damenly/btrfs-progs.git
Whoops, I didn't use the tmp1 branch, let me try again with that and
report back, although the problem above is still going to be there since
I think the only difference will be this, correct?
https://github.com/Damenly/btrfs-progs/commit/b5851513a12237b3e19a3e71f3ad00b966d25b3a
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: btrfs send/receive vs rsync
2018-06-29 16:24 ` btrfs send/receive vs rsync Marc MERLIN
@ 2018-06-30 8:18 ` Duncan
0 siblings, 0 replies; 65+ messages in thread
From: Duncan @ 2018-06-30 8:18 UTC (permalink / raw)
To: linux-btrfs
Marc MERLIN posted on Fri, 29 Jun 2018 09:24:20 -0700 as excerpted:
>> If instead of using a single BTRFS filesystem you used LVM volumes
>> (maybe with Thin provisioning and monitoring of the volume group free
>> space) for each of your servers to backup with one BTRFS filesystem per
>> volume you would have less snapshots per filesystem and isolate
>> problems in case of corruption. If you eventually decide to start from
>> scratch again this might help a lot in your case.
>
> So, I already have problems due to too many block layers:
> - raid 5 + ssd - bcache - dmcrypt - btrfs
>
> I get occasional deadlocks due to upper layers sending more data to the
> lower layer (bcache) than it can process. I'm a bit warry of adding yet
> another layer (LVM), but you're otherwise correct than keeping smaller
> btrfs filesystems would help with performance and containing possible
> damage.
>
> Has anyone actually done this? :)
So I definitely use (and advocate!) the split-em-up strategy, and I use
btrfs, but that's pretty much all the similarity we have.
I'm all ssd, having left spinning rust behind. My strategy avoids
unnecessary layers like lvm (tho crypt can arguably be necessary),
preferring direct on-device (gpt) partitioning for simplicity of
management and disaster recovery. And my backup and recovery strategy is
an equally simple mkfs and full-filesystem-fileset copy to an identically
sized filesystem, with backups easily bootable/mountable in place of the
working copy if necessary, and multiple backups so if disaster takes out
the backup I was writing at the same time as the working copy, I still
have a backup to fall back to.
So it's different enough I'm not sure how much my experience will help
you. But I /can/ say the subdivision is nice, as it means I can keep my
root filesystem read-only by default for reliability, my most-at-risk log
filesystem tiny for near-instant scrub/balance/check, and my also at risk
home small as well, with the big media files being on a different
filesystem that's mostly read-only, so less at risk and needing less
frequent backups. The tiny boot and large updates (distro repo, sources,
ccache) are also separate, and mounted only for boot maintenance or
updates.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-30 2:44 ` Marc MERLIN
@ 2018-06-30 14:49 ` Qu Wenruo
2018-06-30 21:06 ` Marc MERLIN
0 siblings, 1 reply; 65+ messages in thread
From: Qu Wenruo @ 2018-06-30 14:49 UTC (permalink / raw)
To: Marc MERLIN, suy.fnst; +Cc: linux-btrfs
On 2018年06月30日 10:44, Marc MERLIN wrote:
> Well, there goes that. After about 18H:
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 1, have: 1452
> backref.c:466: __add_missing_keys: Assertion `ref->root_id` failed, value 0
> btrfs(+0x3a232)[0x56091704f232]
> btrfs(+0x3ab46)[0x56091704fb46]
> btrfs(+0x3b9f5)[0x5609170509f5]
> btrfs(btrfs_find_all_roots+0x9)[0x560917050a45]
> btrfs(+0x572ff)[0x56091706c2ff]
> btrfs(+0x60b13)[0x560917075b13]
> btrfs(cmd_check+0x2634)[0x56091707d431]
> btrfs(main+0x88)[0x560917027260]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f93aa508561]
> btrfs(_start+0x2a)[0x560917026dfa]
> Aborted
I think that's the root cause.
Some invalid extent tree backref or bad tree block blow up backref code.
All previous error message may be garbage unless you're using Su's
latest branch, as lowmem mode tends to report false alerts on refrencer
count mismatch.
But the last abort looks pretty possible to be the culprit.
Would you try to dump the extent tree?
# btrfs inspect dump-tree -t extent <device> | grep -A50 156909494272
It should help us locate the culprit and hopefully get some chance to
fix it.
Thanks,
Qu
>
> That's https://github.com/Damenly/btrfs-progs.git
>
> Whoops, I didn't use the tmp1 branch, let me try again with that and
> report back, although the problem above is still going to be there since
> I think the only difference will be this, correct?
> https://github.com/Damenly/btrfs-progs/commit/b5851513a12237b3e19a3e71f3ad00b966d25b3a
>
> Marc
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-30 14:49 ` Qu Wenruo
@ 2018-06-30 21:06 ` Marc MERLIN
0 siblings, 0 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-06-30 21:06 UTC (permalink / raw)
To: Qu Wenruo; +Cc: suy.fnst, linux-btrfs
On Sat, Jun 30, 2018 at 10:49:07PM +0800, Qu Wenruo wrote:
> But the last abort looks pretty possible to be the culprit.
>
> Would you try to dump the extent tree?
> # btrfs inspect dump-tree -t extent <device> | grep -A50 156909494272
Sure, there you go:
item 25 key (156909494272 EXTENT_ITEM 55320576) itemoff 14943 itemsize 24
refs 19715 gen 31575 flags DATA
item 26 key (156909494272 EXTENT_DATA_REF 571620086735451015) itemoff 14915 itemsize 28
extent data backref root 21641 objectid 374857 offset 235175936 count 1452
item 27 key (156909494272 EXTENT_DATA_REF 1765833482087969671) itemoff 14887 itemsize 28
extent data backref root 23094 objectid 374857 offset 235175936 count 1442
item 28 key (156909494272 EXTENT_DATA_REF 1807626434455810951) itemoff 14859 itemsize 28
extent data backref root 21503 objectid 374857 offset 235175936 count 1454
item 29 key (156909494272 EXTENT_DATA_REF 1879818091602916231) itemoff 14831 itemsize 28
extent data backref root 21462 objectid 374857 offset 235175936 count 1454
item 30 key (156909494272 EXTENT_DATA_REF 3610854505775117191) itemoff 14803 itemsize 28
extent data backref root 23134 objectid 374857 offset 235175936 count 1442
item 31 key (156909494272 EXTENT_DATA_REF 3754675454231458695) itemoff 14775 itemsize 28
extent data backref root 23052 objectid 374857 offset 235175936 count 1442
item 32 key (156909494272 EXTENT_DATA_REF 5060494667839714183) itemoff 14747 itemsize 28
extent data backref root 23174 objectid 374857 offset 235175936 count 1440
item 33 key (156909494272 EXTENT_DATA_REF 5476627808561673095) itemoff 14719 itemsize 28
extent data backref root 22911 objectid 374857 offset 235175936 count 1
item 34 key (156909494272 EXTENT_DATA_REF 6378484416458011527) itemoff 14691 itemsize 28
extent data backref root 23012 objectid 374857 offset 235175936 count 1442
item 35 key (156909494272 EXTENT_DATA_REF 7338474132555182983) itemoff 14663 itemsize 28
extent data backref root 21872 objectid 374857 offset 235175936 count 1
item 36 key (156909494272 EXTENT_DATA_REF 7516565391717970823) itemoff 14635 itemsize 28
extent data backref root 21826 objectid 374857 offset 235175936 count 1452
item 37 key (156909494272 SHARED_DATA_REF 14871537025024) itemoff 14631 itemsize 4
shared data backref count 10
item 38 key (156909494272 SHARED_DATA_REF 14871617568768) itemoff 14627 itemsize 4
shared data backref count 73
item 39 key (156909494272 SHARED_DATA_REF 14871619846144) itemoff 14623 itemsize 4
shared data backref count 59
item 40 key (156909494272 SHARED_DATA_REF 14871623270400) itemoff 14619 itemsize 4
shared data backref count 68
item 41 key (156909494272 SHARED_DATA_REF 14871623532544) itemoff 14615 itemsize 4
shared data backref count 70
item 42 key (156909494272 SHARED_DATA_REF 14871626383360) itemoff 14611 itemsize 4
shared data backref count 76
item 43 key (156909494272 SHARED_DATA_REF 14871635132416) itemoff 14607 itemsize 4
shared data backref count 60
item 44 key (156909494272 SHARED_DATA_REF 14871649533952) itemoff 14603 itemsize 4
shared data backref count 79
item 45 key (156909494272 SHARED_DATA_REF 14871862378496) itemoff 14599 itemsize 4
shared data backref count 70
item 46 key (156909494272 SHARED_DATA_REF 14909667098624) itemoff 14595 itemsize 4
shared data backref count 72
item 47 key (156909494272 SHARED_DATA_REF 14909669720064) itemoff 14591 itemsize 4
shared data backref count 58
item 48 key (156909494272 SHARED_DATA_REF 14909734567936) itemoff 14587 itemsize 4
shared data backref count 73
item 49 key (156909494272 SHARED_DATA_REF 14909920477184) itemoff 14583 itemsize 4
shared data backref count 79
item 50 key (156909494272 SHARED_DATA_REF 14942279335936) itemoff 14579 itemsize 4
shared data backref count 79
item 51 key (156909494272 SHARED_DATA_REF 14942304862208) itemoff 14575 itemsize 4
shared data backref count 72
item 52 key (156909494272 SHARED_DATA_REF 14942348378112) itemoff 14571 itemsize 4
shared data backref count 67
item 53 key (156909494272 SHARED_DATA_REF 14942366138368) itemoff 14567 itemsize 4
shared data backref count 51
item 54 key (156909494272 SHARED_DATA_REF 14942384799744) itemoff 14563 itemsize 4
shared data backref count 64
item 55 key (156909494272 SHARED_DATA_REF 14978234613760) itemoff 14559 itemsize 4
shared data backref count 61
item 56 key (156909494272 SHARED_DATA_REF 14978246459392) itemoff 14555 itemsize 4
shared data backref count 56
item 57 key (156909494272 SHARED_DATA_REF 14978256879616) itemoff 14551 itemsize 4
shared data backref count 75
item 58 key (156909494272 SHARED_DATA_REF 15001465749504) itemoff 14547 itemsize 4
shared data backref count 77
item 59 key (156909494272 SHARED_DATA_REF 18215010877440) itemoff 14543 itemsize 4
shared data backref count 79
item 60 key (156909494272 SHARED_DATA_REF 18215045660672) itemoff 14539 itemsize 4
shared data backref count 10
item 61 key (156909494272 SHARED_DATA_REF 18215099023360) itemoff 14535 itemsize 4
shared data backref count 56
item 62 key (156909494272 SHARED_DATA_REF 18215114522624) itemoff 14531 itemsize 4
shared data backref count 70
item 63 key (156909494272 SHARED_DATA_REF 18215129874432) itemoff 14527 itemsize 4
shared data backref count 68
item 64 key (156909494272 SHARED_DATA_REF 18215130267648) itemoff 14523 itemsize 4
shared data backref count 72
item 65 key (156909494272 SHARED_DATA_REF 18215136264192) itemoff 14519 itemsize 4
shared data backref count 64
item 66 key (156909494272 SHARED_DATA_REF 18215138623488) itemoff 14515 itemsize 4
shared data backref count 72
item 67 key (156909494272 SHARED_DATA_REF 18215188414464) itemoff 14511 itemsize 4
shared data backref count 58
item 68 key (156909494272 SHARED_DATA_REF 18215188447232) itemoff 14507 itemsize 4
shared data backref count 74
item 69 key (156909494272 SHARED_DATA_REF 18215188529152) itemoff 14503 itemsize 4
shared data backref count 69
item 70 key (156909494272 SHARED_DATA_REF 18215204896768) itemoff 14499 itemsize 4
shared data backref count 67
item 71 key (156909494272 SHARED_DATA_REF 18215228358656) itemoff 14495 itemsize 4
shared data backref count 68
item 72 key (156909494272 SHARED_DATA_REF 18215228899328) itemoff 14491 itemsize 4
shared data backref count 81
item 73 key (156909494272 SHARED_DATA_REF 18215240892416) itemoff 14487 itemsize 4
shared data backref count 78
item 74 key (156909494272 SHARED_DATA_REF 18215244251136) itemoff 14483 itemsize 4
shared data backref count 58
item 75 key (156909494272 SHARED_DATA_REF 18215244365824) itemoff 14479 itemsize 4
shared data backref count 63
item 76 key (156909494272 SHARED_DATA_REF 18215252770816) itemoff 14475 itemsize 4
shared data backref count 76
item 77 key (156909494272 SHARED_DATA_REF 18215264337920) itemoff 14471 itemsize 4
shared data backref count 76
item 78 key (156909494272 SHARED_DATA_REF 18215270055936) itemoff 14467 itemsize 4
shared data backref count 73
item 79 key (156909494272 SHARED_DATA_REF 18215290601472) itemoff 14463 itemsize 4
shared data backref count 63
item 80 key (156909494272 SHARED_DATA_REF 18215290617856) itemoff 14459 itemsize 4
shared data backref count 54
item 81 key (156909494272 SHARED_DATA_REF 18244453154816) itemoff 14455 itemsize 4
shared data backref count 79
item 82 key (156909494272 SHARED_DATA_REF 18244454383616) itemoff 14451 itemsize 4
shared data backref count 71
item 83 key (156909494272 SHARED_DATA_REF 18249494151168) itemoff 14447 itemsize 4
shared data backref count 79
item 84 key (156909494272 SHARED_DATA_REF 18249500721152) itemoff 14443 itemsize 4
shared data backref count 71
item 85 key (156909494272 SHARED_DATA_REF 18249523789824) itemoff 14439 itemsize 4
shared data backref count 51
item 86 key (156909494272 SHARED_DATA_REF 18249586802688) itemoff 14435 itemsize 4
shared data backref count 68
item 87 key (156909494272 SHARED_DATA_REF 18249587703808) itemoff 14431 itemsize 4
shared data backref count 70
item 88 key (156909494272 SHARED_DATA_REF 18249588178944) itemoff 14427 itemsize 4
shared data backref count 72
item 89 key (156909494272 SHARED_DATA_REF 18249591291904) itemoff 14423 itemsize 4
shared data backref count 67
item 90 key (156909494272 SHARED_DATA_REF 18249598238720) itemoff 14419 itemsize 4
shared data backref count 74
item 91 key (156909494272 SHARED_DATA_REF 18249602285568) itemoff 14415 itemsize 4
shared data backref count 79
item 92 key (156909494272 SHARED_DATA_REF 18249611378688) itemoff 14411 itemsize 4
shared data backref count 65
item 93 key (156909494272 SHARED_DATA_REF 18249613082624) itemoff 14407 itemsize 4
shared data backref count 55
item 94 key (156909494272 SHARED_DATA_REF 18249642229760) itemoff 14403 itemsize 4
shared data backref count 75
item 95 key (156909494272 SHARED_DATA_REF 18249643458560) itemoff 14399 itemsize 4
shared data backref count 68
item 96 key (156909494272 SHARED_DATA_REF 18250800021504) itemoff 14395 itemsize 4
shared data backref count 79
item 97 key (156909494272 SHARED_DATA_REF 18250814963712) itemoff 14391 itemsize 4
shared data backref count 71
item 98 key (156909494272 SHARED_DATA_REF 18252047237120) itemoff 14387 itemsize 4
shared data backref count 55
item 99 key (156909494272 SHARED_DATA_REF 18252132515840) itemoff 14383 itemsize 4
shared data backref count 68
item 100 key (156909494272 SHARED_DATA_REF 18252134236160) itemoff 14379 itemsize 4
shared data backref count 72
item 101 key (156909494272 SHARED_DATA_REF 18252274827264) itemoff 14375 itemsize 4
shared data backref count 68
item 102 key (156909494272 SHARED_DATA_REF 18252313460736) itemoff 14371 itemsize 4
shared data backref count 67
item 103 key (156909494272 SHARED_DATA_REF 18252335906816) itemoff 14367 itemsize 4
shared data backref count 79
item 104 key (156909494272 SHARED_DATA_REF 18252336742400) itemoff 14363 itemsize 4
shared data backref count 74
item 105 key (156909494272 SHARED_DATA_REF 18254150631424) itemoff 14359 itemsize 4
shared data backref count 56
item 106 key (156909494272 SHARED_DATA_REF 18254342537216) itemoff 14355 itemsize 4
shared data backref count 67
item 107 key (156909494272 SHARED_DATA_REF 18255671017472) itemoff 14351 itemsize 4
shared data backref count 72
item 108 key (156909494272 SHARED_DATA_REF 18255806038016) itemoff 14347 itemsize 4
shared data backref count 69
item 109 key (156909494272 SHARED_DATA_REF 18255821996032) itemoff 14343 itemsize 4
shared data backref count 67
item 110 key (156909494272 SHARED_DATA_REF 18256006414336) itemoff 14339 itemsize 4
shared data backref count 79
item 111 key (156909494272 SHARED_DATA_REF 18256021012480) itemoff 14335 itemsize 4
shared data backref count 74
item 112 key (156909494272 SHARED_DATA_REF 18260113752064) itemoff 14331 itemsize 4
shared data backref count 75
item 113 key (156909494272 SHARED_DATA_REF 18260113883136) itemoff 14327 itemsize 4
shared data backref count 65
item 114 key (156909494272 SHARED_DATA_REF 18260114849792) itemoff 14323 itemsize 4
shared data backref count 51
item 115 key (156909494272 SHARED_DATA_REF 18260115013632) itemoff 14319 itemsize 4
shared data backref count 70
item 116 key (156909494272 SHARED_DATA_REF 18261625552896) itemoff 14315 itemsize 4
shared data backref count 75
item 117 key (156909494272 SHARED_DATA_REF 18261631107072) itemoff 14311 itemsize 4
shared data backref count 65
item 118 key (156909494272 SHARED_DATA_REF 18261652078592) itemoff 14307 itemsize 4
shared data backref count 52
item 119 key (156909494272 SHARED_DATA_REF 18261658025984) itemoff 14303 itemsize 4
shared data backref count 70
item 120 key (156964814848 EXTENT_ITEM 7487488) itemoff 13856 itemsize 447
refs 2505 gen 31575 flags DATA
extent data backref root 21826 objectid 374857 offset 290496512 count 192
extent data backref root 21872 objectid 374857 offset 290496512 count 192
extent data backref root 23012 objectid 374857 offset 290496512 count 193
extent data backref root 22911 objectid 374857 offset 290496512 count 192
extent data backref root 23174 objectid 374857 offset 290496512 count 193
extent data backref root 23052 objectid 374857 offset 290496512 count 193
extent data backref root 23134 objectid 374857 offset 290496512 count 193
extent data backref root 21462 objectid 374857 offset 290496512 count 192
extent data backref root 21503 objectid 374857 offset 290496512 count 192
extent data backref root 23094 objectid 374857 offset 290496512 count 193
extent data backref root 21641 objectid 374857 offset 290496512 count 192
shared data backref parent 18215389659136 count 55
shared data backref parent 18215388102656 count 63
shared data backref parent 18215294795776 count 69
shared data backref parent 18215244365824 count 7
shared data backref parent 14978251440128 count 55
shared data backref parent 14978250768384 count 63
shared data backref parent 14978248212480 count 69
shared data backref parent 14978246459392 count 7
item 121 key (156972302336 EXTENT_ITEM 8192) itemoff 13487 itemsize 369
refs 13 gen 31575 flags DATA
extent data backref root 21826 objectid 374857 offset 297984000 count 1
extent data backref root 21872 objectid 374857 offset 297984000 count 1
extent data backref root 23012 objectid 374857 offset 297984000 count 1
extent data backref root 22911 objectid 374857 offset 297984000 count 1
extent data backref root 23174 objectid 374857 offset 297984000 count 1
extent data backref root 23052 objectid 374857 offset 297984000 count 1
extent data backref root 23134 objectid 374857 offset 297984000 count 1
extent data backref root 21462 objectid 374857 offset 297984000 count 1
extent data backref root 21503 objectid 374857 offset 297984000 count 1
extent data backref root 23094 objectid 374857 offset 297984000 count 1
extent data backref root 21641 objectid 374857 offset 297984000 count 1
shared data backref parent 18215389659136 count 1
shared data backref parent 14978251440128 count 1
item 122 key (156972310528 EXTENT_ITEM 102400) itemoff 13450 itemsize 37
refs 1 gen 31631 flags DATA
shared data backref parent 17763118120960 count 1
item 123 key (156972412928 EXTENT_ITEM 102400) itemoff 13413 itemsize 37
refs 1 gen 31631 flags DATA
shared data backref parent 17763118120960 count 1
item 124 key (156972515328 EXTENT_ITEM 102400) itemoff 13376 itemsize 37
refs 1 gen 31631 flags DATA
shared data backref parent 17763118120960 count 1
item 125 key (156972617728 EXTENT_ITEM 102400) itemoff 13339 itemsize 37
refs 1 gen 31631 flags DATA
shared data backref parent 17763118120960 count 1
item 126 key (156972720128 EXTENT_ITEM 98304) itemoff 13302 itemsize 37
--
item 30 key (1569094942720 EXTENT_ITEM 24576) itemoff 14678 itemsize 53
refs 1 gen 97048 flags DATA
extent data backref root 21462 objectid 374857 offset 90849280 count 1
item 31 key (1569094967296 EXTENT_ITEM 94208) itemoff 14625 itemsize 53
refs 1 gen 94313 flags DATA
extent data backref root 19852 objectid 67985779 offset 0 count 1
item 32 key (1569095061504 EXTENT_ITEM 299008) itemoff 14572 itemsize 53
refs 1 gen 136347 flags DATA
extent data backref root 19852 objectid 129958928 offset 0 count 1
item 33 key (1569095360512 EXTENT_ITEM 40960) itemoff 14519 itemsize 53
refs 1 gen 95673 flags DATA
extent data backref root 19852 objectid 70844817 offset 0 count 1
item 34 key (1569095475200 EXTENT_ITEM 36864) itemoff 14466 itemsize 53
refs 1 gen 134400 flags DATA
extent data backref root 19852 objectid 123134122 offset 0 count 1
item 35 key (1569095536640 EXTENT_ITEM 16384) itemoff 14413 itemsize 53
refs 1 gen 134270 flags DATA
extent data backref root 19852 objectid 122565390 offset 0 count 1
item 36 key (1569095557120 EXTENT_ITEM 286720) itemoff 14360 itemsize 53
refs 1 gen 97139 flags DATA
extent data backref root 19852 objectid 75280458 offset 0 count 1
item 37 key (1569095843840 EXTENT_ITEM 8192) itemoff 14323 itemsize 37
refs 1 gen 88571 flags DATA
shared data backref parent 14909069754368 count 1
item 38 key (1569095852032 EXTENT_ITEM 122880) itemoff 14270 itemsize 53
refs 1 gen 76214 flags DATA
extent data backref root 19852 objectid 35849748 offset 0 count 1
item 39 key (1569095974912 EXTENT_ITEM 8192) itemoff 14220 itemsize 50
refs 2 gen 88571 flags DATA
shared data backref parent 18214784647168 count 1
shared data backref parent 14909069754368 count 1
item 40 key (1569095983104 EXTENT_ITEM 8192) itemoff 14170 itemsize 50
refs 2 gen 88571 flags DATA
shared data backref parent 18214784647168 count 1
shared data backref parent 14909069754368 count 1
item 41 key (1569096114176 EXTENT_ITEM 286720) itemoff 14117 itemsize 53
refs 1 gen 95205 flags DATA
extent data backref root 19852 objectid 69436429 offset 0 count 1
item 42 key (1569096400896 EXTENT_ITEM 122880) itemoff 14064 itemsize 53
refs 1 gen 92983 flags DATA
extent data backref root 19852 objectid 66052505 offset 0 count 1
item 43 key (1569096523776 EXTENT_ITEM 270336) itemoff 14011 itemsize 53
refs 1 gen 94720 flags DATA
extent data backref root 19852 objectid 68432863 offset 0 count 1
item 44 key (1569097105408 EXTENT_ITEM 45056) itemoff 13958 itemsize 53
refs 1 gen 96865 flags DATA
extent data backref root 19852 objectid 74357290 offset 0 count 1
item 45 key (1569097150464 EXTENT_ITEM 8192) itemoff 13905 itemsize 53
refs 1 gen 97048 flags DATA
extent data backref root 21462 objectid 374857 offset 99221504 count 1
item 46 key (1569097158656 EXTENT_ITEM 110592) itemoff 13868 itemsize 37
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-06-29 6:43 ` Marc MERLIN
@ 2018-07-01 23:22 ` Marc MERLIN
2018-07-02 2:02 ` Su Yue
0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-07-01 23:22 UTC (permalink / raw)
To: Su Yue; +Cc: Qu Wenruo, linux-btrfs
On Thu, Jun 28, 2018 at 11:43:54PM -0700, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 02:32:44PM +0800, Su Yue wrote:
> > > > https://github.com/Damenly/btrfs-progs/tree/tmp1
> > >
> > > Not sure if I undertand that you meant, here.
> > >
> > Sorry for my unclear words.
> > Simply speaking, I suggest you to stop current running check.
> > Then, clone above branch to compile binary then run
> > 'btrfs check --mode=lowmem $dev'.
>
> I understand, I'll build and try it.
>
> > > This filesystem is trash to me and will require over a week to rebuild
> > > manually if I can't repair it.
> >
> > Understood your anxiety, a log of check without '--repair' will help
> > us to make clear what's wrong with your filesystem.
>
> Ok, I'll run your new code without repair and report back. It will
> likely take over a day though.
Well, it got stuck for over a day, and then I had to reboot :(
saruman:/var/local/src/btrfs-progs.sy# git remote -v
origin https://github.com/Damenly/btrfs-progs.git (fetch)
origin https://github.com/Damenly/btrfs-progs.git (push)
saruman:/var/local/src/btrfs-progs.sy# git branch
master
* tmp1
saruman:/var/local/src/btrfs-progs.sy# git pull
Already up to date.
saruman:/var/local/src/btrfs-progs.sy# make
Making all in Documentation
make[1]: Nothing to be done for 'all'.
However, it still got stuck here:
gargamel:~# btrfs check --mode=lowmem -p /dev/mapper/dshelf2
Checking filesystem on /dev/mapper/dshelf2
UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 2
have: 3
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 2
have: 4
ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wan
d: 180, have: 181
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) want
: 67, have: 68
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) want
: 67, have: 115
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) want
: 114, have: 115
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) want
: 114, have: 143
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wan
d: 301, have: 302
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wan
d: 355, have: 433
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wan
d: 160, have: 161
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wan
d: 161, have: 240
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wan
d: 169, have: 170
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wan
d: 171, have: 251
ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wan
d: 347, have: 348
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wan
d: 1, have: 1449
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wan
d: 1, have: 556
What should I try next?
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-01 23:22 ` Marc MERLIN
@ 2018-07-02 2:02 ` Su Yue
2018-07-02 3:22 ` Marc MERLIN
0 siblings, 1 reply; 65+ messages in thread
From: Su Yue @ 2018-07-02 2:02 UTC (permalink / raw)
To: Marc MERLIN; +Cc: Qu Wenruo, linux-btrfs
On 07/02/2018 07:22 AM, Marc MERLIN wrote:
> On Thu, Jun 28, 2018 at 11:43:54PM -0700, Marc MERLIN wrote:
>> On Fri, Jun 29, 2018 at 02:32:44PM +0800, Su Yue wrote:
>>>>> https://github.com/Damenly/btrfs-progs/tree/tmp1
>>>>
>>>> Not sure if I undertand that you meant, here.
>>>>
>>> Sorry for my unclear words.
>>> Simply speaking, I suggest you to stop current running check.
>>> Then, clone above branch to compile binary then run
>>> 'btrfs check --mode=lowmem $dev'.
>>
>> I understand, I'll build and try it.
>>
>>>> This filesystem is trash to me and will require over a week to rebuild
>>>> manually if I can't repair it.
>>>
>>> Understood your anxiety, a log of check without '--repair' will help
>>> us to make clear what's wrong with your filesystem.
>>
>> Ok, I'll run your new code without repair and report back. It will
>> likely take over a day though.
>
> Well, it got stuck for over a day, and then I had to reboot :(
>
> saruman:/var/local/src/btrfs-progs.sy# git remote -v
> origin https://github.com/Damenly/btrfs-progs.git (fetch)
> origin https://github.com/Damenly/btrfs-progs.git (push)
> saruman:/var/local/src/btrfs-progs.sy# git branch
> master
> * tmp1
> saruman:/var/local/src/btrfs-progs.sy# git pull
> Already up to date.
> saruman:/var/local/src/btrfs-progs.sy# make
> Making all in Documentation
> make[1]: Nothing to be done for 'all'.
>
> However, it still got stuck here:
Thanks, I saw. Some Clues found.
Could you try follow dumps? They shouldn't cost much time.
#btrfs inspect dump-tree -t 21872 <device> | grep -C 50 "374857
EXTENT_DATA "
#btrfs inspect dump-tree -t 22911 <device> | grep -C 50 "374857
EXTENT_DATA "
Thanks,
Su
> gargamel:~# btrfs check --mode=lowmem -p /dev/mapper/dshelf2
> Checking filesystem on /dev/mapper/dshelf2
> UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 2
> have: 3
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 2
> have: 4
> ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wan
> d: 180, have: 181
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) want
> : 67, have: 68
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) want
> : 67, have: 115
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) want
> : 114, have: 115
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) want
> : 114, have: 143
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wan
> d: 301, have: 302
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wan
> d: 355, have: 433
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wan
> d: 160, have: 161
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wan
> d: 161, have: 240
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wan
> d: 169, have: 170
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wan
> d: 171, have: 251
> ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wan
> d: 347, have: 348
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wan
> d: 1, have: 1449
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wan
> d: 1, have: 556
>
> What should I try next?
>
> Thanks,
> Marc
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-02 2:02 ` Su Yue
@ 2018-07-02 3:22 ` Marc MERLIN
2018-07-02 6:22 ` Su Yue
0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-07-02 3:22 UTC (permalink / raw)
To: Su Yue; +Cc: Qu Wenruo, linux-btrfs
On Mon, Jul 02, 2018 at 10:02:33AM +0800, Su Yue wrote:
> Could you try follow dumps? They shouldn't cost much time.
>
> #btrfs inspect dump-tree -t 21872 <device> | grep -C 50 "374857
> EXTENT_DATA "
>
> #btrfs inspect dump-tree -t 22911 <device> | grep -C 50 "374857
> EXTENT_DATA "
Ok, that's 29MB, so it doesn't fit on pastebin:
http://marc.merlins.org/tmp/dshelf2_inspect.txt
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-02 3:22 ` Marc MERLIN
@ 2018-07-02 6:22 ` Su Yue
2018-07-02 14:05 ` Marc MERLIN
0 siblings, 1 reply; 65+ messages in thread
From: Su Yue @ 2018-07-02 6:22 UTC (permalink / raw)
To: Marc MERLIN; +Cc: Qu Wenruo, linux-btrfs
On 07/02/2018 11:22 AM, Marc MERLIN wrote:
> On Mon, Jul 02, 2018 at 10:02:33AM +0800, Su Yue wrote:
>> Could you try follow dumps? They shouldn't cost much time.
>>
>> #btrfs inspect dump-tree -t 21872 <device> | grep -C 50 "374857
>> EXTENT_DATA "
>>
>> #btrfs inspect dump-tree -t 22911 <device> | grep -C 50 "374857
>> EXTENT_DATA "
>
> Ok, that's 29MB, so it doesn't fit on pastebin:
> http://marc.merlins.org/tmp/dshelf2_inspect.txt
>
Sorry Marc. After offline communication with Qu, both
of us think the filesystem is hard to repair.
The filesystem is too large to debug step by step.
Every time check and debug spent is too expensive.
And it already costs serveral days.
Sadly, I am afarid that you have to recreate filesystem
and reback up your data. :(
Sorry again and thanks for you reports and patient.
Su
> Marc
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-02 6:22 ` Su Yue
@ 2018-07-02 14:05 ` Marc MERLIN
2018-07-02 14:42 ` Qu Wenruo
0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-07-02 14:05 UTC (permalink / raw)
To: Su Yue; +Cc: Qu Wenruo, linux-btrfs
On Mon, Jul 02, 2018 at 02:22:20PM +0800, Su Yue wrote:
> > Ok, that's 29MB, so it doesn't fit on pastebin:
> > http://marc.merlins.org/tmp/dshelf2_inspect.txt
> >
> Sorry Marc. After offline communication with Qu, both
> of us think the filesystem is hard to repair.
> The filesystem is too large to debug step by step.
> Every time check and debug spent is too expensive.
> And it already costs serveral days.
>
> Sadly, I am afarid that you have to recreate filesystem
> and reback up your data. :(
>
> Sorry again and thanks for you reports and patient.
I appreciate your help. Honestly I only wanted to help you find why the
tools aren't working. Fixing filesystems by hand (and remotely via Email
on top of that), is way too time consuming like you said.
Is the btrfs design flawed in a way that repair tools just cannot repair
on their own?
I understand that data can be lost, but I don't understand how the tools
just either keep crashing for me, go in infinite loops, or otherwise
fail to give me back a stable filesystem, even if some data is missing
after that.
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-02 14:05 ` Marc MERLIN
@ 2018-07-02 14:42 ` Qu Wenruo
2018-07-02 15:18 ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
` (2 more replies)
0 siblings, 3 replies; 65+ messages in thread
From: Qu Wenruo @ 2018-07-02 14:42 UTC (permalink / raw)
To: Marc MERLIN, Su Yue; +Cc: linux-btrfs
On 2018年07月02日 22:05, Marc MERLIN wrote:
> On Mon, Jul 02, 2018 at 02:22:20PM +0800, Su Yue wrote:
>>> Ok, that's 29MB, so it doesn't fit on pastebin:
>>> http://marc.merlins.org/tmp/dshelf2_inspect.txt
>>>
>> Sorry Marc. After offline communication with Qu, both
>> of us think the filesystem is hard to repair.
>> The filesystem is too large to debug step by step.
>> Every time check and debug spent is too expensive.
>> And it already costs serveral days.
>>
>> Sadly, I am afarid that you have to recreate filesystem
>> and reback up your data. :(
>>
>> Sorry again and thanks for you reports and patient.
>
> I appreciate your help. Honestly I only wanted to help you find why the
> tools aren't working. Fixing filesystems by hand (and remotely via Email
> on top of that), is way too time consuming like you said.
>
> Is the btrfs design flawed in a way that repair tools just cannot repair
> on their own?
For short and for your case, yes, you can consider repair tool just a
garbage and don't use them at any production system.
For full, it depends. (but for most real world case, it's still flawed)
We have small and crafted images as test cases, which btrfs check can
repair without problem at all.
But such images are *SMALL*, and only have *ONE* type of corruption,
which can represent real world case at all.
> I understand that data can be lost, but I don't understand how the tools
> just either keep crashing for me, go in infinite loops, or otherwise
> fail to give me back a stable filesystem, even if some data is missing
> after that.
There are several reasons here that repair tool can't help much:
1) Too large fs (especially too many snapshots)
The use case (too many snapshots and shared extents, a lot of extents
get shared over 1000 times) is in fact a super large challenge for
lowmem mode check/repair.
It needs O(n^2) or even O(n^3) to check each backref, which hugely
slow the progress and make us hard to locate the real bug.
2) Corruption in extent tree and our objective is to mount RW
Extent tree is almost useless if we just want to read data.
But when we do any write, we needs it and if it goes wrong even a
tiny bit, your fs could be damaged really badly.
For other corruption, like some fs tree corruption, we could do
something to discard some corrupted files, but if it's extent tree,
we either mount RO and grab anything we have, or hopes the
almost-never-working --init-extent-tree can work (that's mostly
miracle).
So, I feel very sorry that we can't provide enough help for your case.
But still, we hope to provide some tips on next build if you still want
to choose btrfs.
1) Don't keep too many snapshots.
Really, this is the core.
For send/receive backup, IIRC it only needs the parent subvolume
exists, there is no need to keep the whole history of all those
snapshots.
Keep the number of snapshots to minimal does greatly improve the
possibility (both manual patch or check repair) of a successful
repair.
Normally I would suggest 4 hourly snapshots, 7 daily snapshots, 12
monthly snapshots.
2) Don't keep unrelated snapshots in one btrfs.
I totally understand that maintain different btrfs would hugely add
maintenance pressure, but as explains, all snapshots share one
fragile extent tree.
If we limit the fragile extent tree from each other fs, it's less
possible a single extent tree corruption to take down the whole fs.
Thanks,
Qu
>
> Thanks,
> Marc
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-02 14:42 ` Qu Wenruo
@ 2018-07-02 15:18 ` Marc MERLIN
2018-07-02 16:59 ` Austin S. Hemmelgarn
` (2 more replies)
2018-07-02 15:19 ` So, does btrfs check lowmem take days? weeks? Marc MERLIN
2018-07-03 0:31 ` Chris Murphy
2 siblings, 3 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-07-02 15:18 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Su Yue, linux-btrfs
Hi Qu,
I'll split this part into a new thread:
> 2) Don't keep unrelated snapshots in one btrfs.
> I totally understand that maintain different btrfs would hugely add
> maintenance pressure, but as explains, all snapshots share one
> fragile extent tree.
Yes, I understand that this is what I should do given what you
explained.
My main problem is knowing how to segment things so I don't end up with
filesystems that are full while others are almost empty :)
Am I supposed to put LVM thin volumes underneath so that I can share
the same single 10TB raid5?
If I do this, I would have
software raid 5 < dmcrypt < bcache < lvm < btrfs
That's a lot of layers, and that's also starting to make me nervous :)
Is there any other way that does not involve me creating smaller block
devices for multiple btrfs filesystems and hope that they are the right
size because I won't be able to change it later?
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-02 14:42 ` Qu Wenruo
2018-07-02 15:18 ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
@ 2018-07-02 15:19 ` Marc MERLIN
2018-07-02 17:08 ` Austin S. Hemmelgarn
2018-07-02 17:33 ` Roman Mamedov
2018-07-03 0:31 ` Chris Murphy
2 siblings, 2 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-07-02 15:19 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Su Yue, linux-btrfs
Hi Qu,
thanks for the detailled and honest answer.
A few comments inline.
On Mon, Jul 02, 2018 at 10:42:40PM +0800, Qu Wenruo wrote:
> For full, it depends. (but for most real world case, it's still flawed)
> We have small and crafted images as test cases, which btrfs check can
> repair without problem at all.
> But such images are *SMALL*, and only have *ONE* type of corruption,
> which can represent real world case at all.
right, they're just unittest images, I understand.
> 1) Too large fs (especially too many snapshots)
> The use case (too many snapshots and shared extents, a lot of extents
> get shared over 1000 times) is in fact a super large challenge for
> lowmem mode check/repair.
> It needs O(n^2) or even O(n^3) to check each backref, which hugely
> slow the progress and make us hard to locate the real bug.
So, the non lowmem version would work better, but it's a problem if it
doesn't fit in RAM.
I've always considered it a grave bug that btrfs check repair can use so
much kernel memory that it will crash the entire system. This should not
be possible.
While it won't help me here, can btrfs check be improved not to suck all
the kernel memory, and ideally even allow using swap space if the RAM is
not enough?
Is btrfs check regular mode still being maintained? I think it's still
better than lowmem, correct?
> 2) Corruption in extent tree and our objective is to mount RW
> Extent tree is almost useless if we just want to read data.
> But when we do any write, we needs it and if it goes wrong even a
> tiny bit, your fs could be damaged really badly.
>
> For other corruption, like some fs tree corruption, we could do
> something to discard some corrupted files, but if it's extent tree,
> we either mount RO and grab anything we have, or hopes the
> almost-never-working --init-extent-tree can work (that's mostly
> miracle).
I understand that it's the weak point of btrfs, thanks for explaining.
> 1) Don't keep too many snapshots.
> Really, this is the core.
> For send/receive backup, IIRC it only needs the parent subvolume
> exists, there is no need to keep the whole history of all those
> snapshots.
You are correct on history. The reason I keep history is because I may
want to recover a file from last week or 2 weeks ago after I finally
notice that it's gone.
I have terabytes of space on the backup server, so it's easier to keep
history there than on the client which may not have enough space to keep
a month's worth of history.
As you know, back when we did tape backups, we also kept history of at
least several weeks (usually several months, but that's too much for
btrfs snapshots).
> Keep the number of snapshots to minimal does greatly improve the
> possibility (both manual patch or check repair) of a successful
> repair.
> Normally I would suggest 4 hourly snapshots, 7 daily snapshots, 12
> monthly snapshots.
I actually have fewer snapshots than this per filesystem, but I backup
more than 10 filesystems.
If I used as many snapshots as you recommend, that would already be 230
snapshots for 10 filesystems :)
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-02 15:18 ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
@ 2018-07-02 16:59 ` Austin S. Hemmelgarn
2018-07-02 17:34 ` Marc MERLIN
2018-07-03 0:51 ` Paul Jones
2018-07-03 1:37 ` Qu Wenruo
2 siblings, 1 reply; 65+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-02 16:59 UTC (permalink / raw)
To: Marc MERLIN, Qu Wenruo; +Cc: Su Yue, linux-btrfs
On 2018-07-02 11:18, Marc MERLIN wrote:
> Hi Qu,
>
> I'll split this part into a new thread:
>
>> 2) Don't keep unrelated snapshots in one btrfs.
>> I totally understand that maintain different btrfs would hugely add
>> maintenance pressure, but as explains, all snapshots share one
>> fragile extent tree.
>
> Yes, I understand that this is what I should do given what you
> explained.
> My main problem is knowing how to segment things so I don't end up with
> filesystems that are full while others are almost empty :)
>
> Am I supposed to put LVM thin volumes underneath so that I can share
> the same single 10TB raid5?
Actually, because of the online resize ability in BTRFS, you don't
technically _need_ to use thin provisioning here. It makes the
maintenance a bit easier, but it also adds a much more complicated layer
of indirection than just doing regular volumes.
>
> If I do this, I would have
> software raid 5 < dmcrypt < bcache < lvm < btrfs
> That's a lot of layers, and that's also starting to make me nervous :)
>
> Is there any other way that does not involve me creating smaller block
> devices for multiple btrfs filesystems and hope that they are the right
> size because I won't be able to change it later?
You could (in theory) merge the LVM and software RAID5 layers, though
that may make handling of the RAID5 layer a bit complicated if you
choose to use thin provisioning (for some reason, LVM is unable to do
on-line checks and rebuilds of RAID arrays that are acting as thin pool
data or metadata).
Alternatively, you could increase your array size, remove the software
RAID layer, and switch to using BTRFS in raid10 mode so that you could
eliminate one of the layers, though that would probably reduce the
effectiveness of bcache (you might want to get a bigger cache device if
you do this).
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-02 15:19 ` So, does btrfs check lowmem take days? weeks? Marc MERLIN
@ 2018-07-02 17:08 ` Austin S. Hemmelgarn
2018-07-02 17:33 ` Roman Mamedov
1 sibling, 0 replies; 65+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-02 17:08 UTC (permalink / raw)
To: Marc MERLIN, Qu Wenruo; +Cc: Su Yue, linux-btrfs
On 2018-07-02 11:19, Marc MERLIN wrote:
> Hi Qu,
>
> thanks for the detailled and honest answer.
> A few comments inline.
>
> On Mon, Jul 02, 2018 at 10:42:40PM +0800, Qu Wenruo wrote:
>> For full, it depends. (but for most real world case, it's still flawed)
>> We have small and crafted images as test cases, which btrfs check can
>> repair without problem at all.
>> But such images are *SMALL*, and only have *ONE* type of corruption,
>> which can represent real world case at all.
>
> right, they're just unittest images, I understand.
>
>> 1) Too large fs (especially too many snapshots)
>> The use case (too many snapshots and shared extents, a lot of extents
>> get shared over 1000 times) is in fact a super large challenge for
>> lowmem mode check/repair.
>> It needs O(n^2) or even O(n^3) to check each backref, which hugely
>> slow the progress and make us hard to locate the real bug.
>
> So, the non lowmem version would work better, but it's a problem if it
> doesn't fit in RAM.
> I've always considered it a grave bug that btrfs check repair can use so
> much kernel memory that it will crash the entire system. This should not
> be possible.
> While it won't help me here, can btrfs check be improved not to suck all
> the kernel memory, and ideally even allow using swap space if the RAM is
> not enough?
>
> Is btrfs check regular mode still being maintained? I think it's still
> better than lowmem, correct?
>
>> 2) Corruption in extent tree and our objective is to mount RW
>> Extent tree is almost useless if we just want to read data.
>> But when we do any write, we needs it and if it goes wrong even a
>> tiny bit, your fs could be damaged really badly.
>>
>> For other corruption, like some fs tree corruption, we could do
>> something to discard some corrupted files, but if it's extent tree,
>> we either mount RO and grab anything we have, or hopes the
>> almost-never-working --init-extent-tree can work (that's mostly
>> miracle).
>
> I understand that it's the weak point of btrfs, thanks for explaining.
>
>> 1) Don't keep too many snapshots.
>> Really, this is the core.
>> For send/receive backup, IIRC it only needs the parent subvolume
>> exists, there is no need to keep the whole history of all those
>> snapshots.
>
> You are correct on history. The reason I keep history is because I may
> want to recover a file from last week or 2 weeks ago after I finally
> notice that it's gone.
> I have terabytes of space on the backup server, so it's easier to keep
> history there than on the client which may not have enough space to keep
> a month's worth of history.
> As you know, back when we did tape backups, we also kept history of at
> least several weeks (usually several months, but that's too much for
> btrfs snapshots).
Bit of a case-study here, but it may be of interest. We do something
kind of similar where I work for our internal file servers. We've got
daily snapshots of the whole server kept on the server itself for 7 days
(we usually see less than 5% of the total amount of data in changes on
weekdays, and essentially 0 on weekends, so the snapshots rarely take up
more than ab out 25% of the size of the live data), and then we
additionally do daily backups which we retain for 6 months. I've
written up a short (albeit rather system specific script) for recovering
old versions of a file that first scans the snapshots, and then pulls it
out of the backups if it's not there. I've found this works remarkably
well for our use case (almost all the data on the file server follows a
WORM access pattern with most of the files being between 100kB and 100MB
in size).
We actually did try moving it all over to BTRFS for a while before we
finally ended up with the setup we currently have, but aside from the
whole issue with massive numbers of snapshots, we found that for us at
least, Amanda actually outperforms BTRFS send/receive for everything
except full backups and uses less storage space (though that last bit is
largely because we use really aggressive compression).
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-02 15:19 ` So, does btrfs check lowmem take days? weeks? Marc MERLIN
2018-07-02 17:08 ` Austin S. Hemmelgarn
@ 2018-07-02 17:33 ` Roman Mamedov
2018-07-02 17:39 ` Marc MERLIN
1 sibling, 1 reply; 65+ messages in thread
From: Roman Mamedov @ 2018-07-02 17:33 UTC (permalink / raw)
To: Marc MERLIN; +Cc: linux-btrfs
On Mon, 2 Jul 2018 08:19:03 -0700
Marc MERLIN <marc@merlins.org> wrote:
> I actually have fewer snapshots than this per filesystem, but I backup
> more than 10 filesystems.
> If I used as many snapshots as you recommend, that would already be 230
> snapshots for 10 filesystems :)
(...once again me with my rsync :)
If you didn't use send/receive, you wouldn't be required to keep a separate
snapshot trail per filesystem backed up, one trail of snapshots for the entire
backup server would be enough. Rsync everything to subdirs within one
subvolume, then do timed or event-based snapshots of it. You only need more
than one trail if you want different retention policies for different datasets
(e.g. in my case I have 91 and 31 days).
--
With respect,
Roman
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-02 16:59 ` Austin S. Hemmelgarn
@ 2018-07-02 17:34 ` Marc MERLIN
2018-07-02 18:35 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-07-02 17:34 UTC (permalink / raw)
To: Austin S. Hemmelgarn; +Cc: Qu Wenruo, Su Yue, linux-btrfs
On Mon, Jul 02, 2018 at 12:59:02PM -0400, Austin S. Hemmelgarn wrote:
> > Am I supposed to put LVM thin volumes underneath so that I can share
> > the same single 10TB raid5?
>
> Actually, because of the online resize ability in BTRFS, you don't
> technically _need_ to use thin provisioning here. It makes the maintenance
> a bit easier, but it also adds a much more complicated layer of indirection
> than just doing regular volumes.
You're right that I can use btrfs resize, but then I still need an LVM
device underneath, correct?
So, if I have 10 backup targets, I need 10 LVM LVs, I give them 10%
each of the full size available (as a guess), and then I'd have to
- btrfs resize down one that's bigger than I need
- LVM shrink the LV
- LVM grow the other LV
- LVM resize up the other btrfs
and I think LVM resize and btrfs resize are not linked so I have to do
them separately and hope to type the right numbers each time, correct?
(or is that easier now?)
I kind of linked the thin provisioning idea because it's hands off,
which is appealing. Any reason against it?
> You could (in theory) merge the LVM and software RAID5 layers, though that
> may make handling of the RAID5 layer a bit complicated if you choose to use
> thin provisioning (for some reason, LVM is unable to do on-line checks and
> rebuilds of RAID arrays that are acting as thin pool data or metadata).
Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm
radi5?
But yeah, if it's incompatible with thin provisioning, it's not that
useful.
> Alternatively, you could increase your array size, remove the software RAID
> layer, and switch to using BTRFS in raid10 mode so that you could eliminate
> one of the layers, though that would probably reduce the effectiveness of
> bcache (you might want to get a bigger cache device if you do this).
Sadly that won't work. I have more data than will fit on raid10
Thanks for your suggestions though.
Still need to read up on whether I should do thin provisioning, or not.
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-02 17:33 ` Roman Mamedov
@ 2018-07-02 17:39 ` Marc MERLIN
0 siblings, 0 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-07-02 17:39 UTC (permalink / raw)
To: Roman Mamedov; +Cc: linux-btrfs
On Mon, Jul 02, 2018 at 10:33:09PM +0500, Roman Mamedov wrote:
> On Mon, 2 Jul 2018 08:19:03 -0700
> Marc MERLIN <marc@merlins.org> wrote:
>
> > I actually have fewer snapshots than this per filesystem, but I backup
> > more than 10 filesystems.
> > If I used as many snapshots as you recommend, that would already be 230
> > snapshots for 10 filesystems :)
>
> (...once again me with my rsync :)
>
> If you didn't use send/receive, you wouldn't be required to keep a separate
> snapshot trail per filesystem backed up, one trail of snapshots for the entire
> backup server would be enough. Rsync everything to subdirs within one
> subvolume, then do timed or event-based snapshots of it. You only need more
> than one trail if you want different retention policies for different datasets
> (e.g. in my case I have 91 and 31 days).
This is exactly how I used to do backups before btrfs.
I did
cp -al backup.olddate backup.newdate
rsync -avSH src/ backup.newdate/
You don't even need snapshots or btrfs anymore.
Also, sorry to say, but I have different data retention needs for
different backups. Some need to rotate more quickly than others, but if
you're using rsync, the method I gave above works fine at any rotation
interval you need.
It is almost as efficient as btrfs on space, but as I said, the time
penalty on all those stats for many files was what killed it for me.
If I go back to rsync backups (and I'm really unlikely to), then I'd
also go back to ext4. There would be no point in dealing with the
complexity and fragility of btrfs anymore.
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-02 17:34 ` Marc MERLIN
@ 2018-07-02 18:35 ` Austin S. Hemmelgarn
2018-07-02 19:40 ` Marc MERLIN
2018-07-03 4:25 ` Andrei Borzenkov
0 siblings, 2 replies; 65+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-02 18:35 UTC (permalink / raw)
To: Marc MERLIN; +Cc: Qu Wenruo, Su Yue, linux-btrfs
On 2018-07-02 13:34, Marc MERLIN wrote:
> On Mon, Jul 02, 2018 at 12:59:02PM -0400, Austin S. Hemmelgarn wrote:
>>> Am I supposed to put LVM thin volumes underneath so that I can share
>>> the same single 10TB raid5?
>>
>> Actually, because of the online resize ability in BTRFS, you don't
>> technically _need_ to use thin provisioning here. It makes the maintenance
>> a bit easier, but it also adds a much more complicated layer of indirection
>> than just doing regular volumes.
>
> You're right that I can use btrfs resize, but then I still need an LVM
> device underneath, correct?
> So, if I have 10 backup targets, I need 10 LVM LVs, I give them 10%
> each of the full size available (as a guess), and then I'd have to
> - btrfs resize down one that's bigger than I need
> - LVM shrink the LV
> - LVM grow the other LV
> - LVM resize up the other btrfs
>
> and I think LVM resize and btrfs resize are not linked so I have to do
> them separately and hope to type the right numbers each time, correct?
> (or is that easier now?)
>
> I kind of linked the thin provisioning idea because it's hands off,
> which is appealing. Any reason against it?
No, not currently, except that it adds a whole lot more stuff between
BTRFS and whatever layer is below it. That increase in what's being
done adds some overhead (it's noticeable on 7200 RPM consumer SATA
drives, but not on decent consumer SATA SSD's).
There used to be issues running BTRFS on top of LVM thin targets which
had zero mode turned off, but AFAIK, all of those problems were fixed
long ago (before 4.0).
>
>> You could (in theory) merge the LVM and software RAID5 layers, though that
>> may make handling of the RAID5 layer a bit complicated if you choose to use
>> thin provisioning (for some reason, LVM is unable to do on-line checks and
>> rebuilds of RAID arrays that are acting as thin pool data or metadata).
>
> Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm
> radi5?
Actually, it uses MD's RAID5 implementation as a back-end. Same for
RAID6, and optionally for RAID0, RAID1, and RAID10.
> But yeah, if it's incompatible with thin provisioning, it's not that
> useful.
It's technically not incompatible, just a bit of a pain. Last time I
tried to use it, you had to jump through hoops to repair a damaged RAID
volume that was serving as an underlying volume in a thin pool, and it
required keeping the thin pool offline for the entire duration of the
rebuild.
>
>> Alternatively, you could increase your array size, remove the software RAID
>> layer, and switch to using BTRFS in raid10 mode so that you could eliminate
>> one of the layers, though that would probably reduce the effectiveness of
>> bcache (you might want to get a bigger cache device if you do this).
>
> Sadly that won't work. I have more data than will fit on raid10
>
> Thanks for your suggestions though.
> Still need to read up on whether I should do thin provisioning, or not.
If you do go with thin provisioning, I would encourage you to make
certain to call fstrim on the BTRFS volumes on a semi regular basis so
that the thin pool doesn't get filled up with old unused blocks,
preferably when you are 100% certain that there are no ongoing writes on
them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit
dangerous to do it while writes are happening).
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-02 18:35 ` Austin S. Hemmelgarn
@ 2018-07-02 19:40 ` Marc MERLIN
2018-07-03 4:25 ` Andrei Borzenkov
1 sibling, 0 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-07-02 19:40 UTC (permalink / raw)
To: Austin S. Hemmelgarn; +Cc: Qu Wenruo, Su Yue, linux-btrfs
On Mon, Jul 02, 2018 at 02:35:19PM -0400, Austin S. Hemmelgarn wrote:
> >I kind of linked the thin provisioning idea because it's hands off,
> >which is appealing. Any reason against it?
> No, not currently, except that it adds a whole lot more stuff between
> BTRFS and whatever layer is below it. That increase in what's being
> done adds some overhead (it's noticeable on 7200 RPM consumer SATA
> drives, but not on decent consumer SATA SSD's).
>
> There used to be issues running BTRFS on top of LVM thin targets which
> had zero mode turned off, but AFAIK, all of those problems were fixed
> long ago (before 4.0).
I see, thanks for the heads up.
> >Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm
> >radi5?
> Actually, it uses MD's RAID5 implementation as a back-end. Same for
> RAID6, and optionally for RAID0, RAID1, and RAID10.
Ok, that makes me feel a bit better :)
> >But yeah, if it's incompatible with thin provisioning, it's not that
> >useful.
> It's technically not incompatible, just a bit of a pain. Last time I
> tried to use it, you had to jump through hoops to repair a damaged RAID
> volume that was serving as an underlying volume in a thin pool, and it
> required keeping the thin pool offline for the entire duration of the
> rebuild.
Argh, not good :( / thanks for the heads up.
> If you do go with thin provisioning, I would encourage you to make
> certain to call fstrim on the BTRFS volumes on a semi regular basis so
> that the thin pool doesn't get filled up with old unused blocks,
That's a very good point/reminder, thanks for that. I guess it's like
running on an ssd :)
> preferably when you are 100% certain that there are no ongoing writes on
> them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit
> dangerous to do it while writes are happening).
Argh, that will be harder, but I'll try.
Given what you said, it sounds like I'll still be best off with separate
layers to avoid the rebuild problem you mentioned.
So it'll be
swraid5 / dmcrypt / bcache / lvm dm thin / btrfs
Hopefully that will work well enough.
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-02 14:42 ` Qu Wenruo
2018-07-02 15:18 ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
2018-07-02 15:19 ` So, does btrfs check lowmem take days? weeks? Marc MERLIN
@ 2018-07-03 0:31 ` Chris Murphy
2018-07-03 4:22 ` Marc MERLIN
2 siblings, 1 reply; 65+ messages in thread
From: Chris Murphy @ 2018-07-03 0:31 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Marc MERLIN, Su Yue, Btrfs BTRFS
On Mon, Jul 2, 2018 at 8:42 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
> On 2018年07月02日 22:05, Marc MERLIN wrote:
>> On Mon, Jul 02, 2018 at 02:22:20PM +0800, Su Yue wrote:
>>>> Ok, that's 29MB, so it doesn't fit on pastebin:
>>>> http://marc.merlins.org/tmp/dshelf2_inspect.txt
>>>>
>>> Sorry Marc. After offline communication with Qu, both
>>> of us think the filesystem is hard to repair.
>>> The filesystem is too large to debug step by step.
>>> Every time check and debug spent is too expensive.
>>> And it already costs serveral days.
>>>
>>> Sadly, I am afarid that you have to recreate filesystem
>>> and reback up your data. :(
>>>
>>> Sorry again and thanks for you reports and patient.
>>
>> I appreciate your help. Honestly I only wanted to help you find why the
>> tools aren't working. Fixing filesystems by hand (and remotely via Email
>> on top of that), is way too time consuming like you said.
>>
>> Is the btrfs design flawed in a way that repair tools just cannot repair
>> on their own?
>
> For short and for your case, yes, you can consider repair tool just a
> garbage and don't use them at any production system.
So the idea behind journaled file systems is that journal replay
enabled mount time "repair" that's faster than an fsck. Already Btrfs
use cases with big, but not huge, file systems makes btrfs check a
problem. Either running out of memory or it takes too long. So already
it isn't scaling as well as ext4 or XFS in this regard.
So what's the future hold? It seems like the goal is that the problems
must be avoided in the first place rather than to repair them after
the fact.
Are the problem's Marc is running into understood well enough that
there can eventually be a fix, maybe even an on-disk format change,
that prevents such problems from happening in the first place?
Or does it make sense for him to be running with btrfs debug or some
subset of btrfs integrity checking mask to try to catch the problems
in the act of them happening?
--
Chris Murphy
^ permalink raw reply [flat|nested] 65+ messages in thread
* RE: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-02 15:18 ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
2018-07-02 16:59 ` Austin S. Hemmelgarn
@ 2018-07-03 0:51 ` Paul Jones
2018-07-03 4:06 ` Marc MERLIN
2018-07-03 1:37 ` Qu Wenruo
2 siblings, 1 reply; 65+ messages in thread
From: Paul Jones @ 2018-07-03 0:51 UTC (permalink / raw)
To: Marc MERLIN; +Cc: linux-btrfs
> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> owner@vger.kernel.org> On Behalf Of Marc MERLIN
> Sent: Tuesday, 3 July 2018 1:19 AM
> To: Qu Wenruo <quwenruo.btrfs@gmx.com>
> Cc: Su Yue <suy.fnst@cn.fujitsu.com>; linux-btrfs@vger.kernel.org
> Subject: Re: how to best segment a big block device in resizeable btrfs
> filesystems?
>
> Hi Qu,
>
> I'll split this part into a new thread:
>
> > 2) Don't keep unrelated snapshots in one btrfs.
> > I totally understand that maintain different btrfs would hugely add
> > maintenance pressure, but as explains, all snapshots share one
> > fragile extent tree.
>
> Yes, I understand that this is what I should do given what you explained.
> My main problem is knowing how to segment things so I don't end up with
> filesystems that are full while others are almost empty :)
>
> Am I supposed to put LVM thin volumes underneath so that I can share the
> same single 10TB raid5?
>
> If I do this, I would have
> software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of layers, and
> that's also starting to make me nervous :)
You could combine bcache and lvm if you are happy to use dm-cache instead (which lvm uses).
I use it myself (but without thin provisioning) and it works well.
>
> Is there any other way that does not involve me creating smaller block
> devices for multiple btrfs filesystems and hope that they are the right size
> because I won't be able to change it later?
>
> Thanks,
> Marc
> --
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems ....
> .... what McDonalds is to gourmet cooking
> Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-02 15:18 ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
2018-07-02 16:59 ` Austin S. Hemmelgarn
2018-07-03 0:51 ` Paul Jones
@ 2018-07-03 1:37 ` Qu Wenruo
2018-07-03 4:15 ` Marc MERLIN
2018-07-03 4:23 ` Andrei Borzenkov
2 siblings, 2 replies; 65+ messages in thread
From: Qu Wenruo @ 2018-07-03 1:37 UTC (permalink / raw)
To: Marc MERLIN; +Cc: Su Yue, linux-btrfs
On 2018年07月02日 23:18, Marc MERLIN wrote:
> Hi Qu,
>
> I'll split this part into a new thread:
>
>> 2) Don't keep unrelated snapshots in one btrfs.
>> I totally understand that maintain different btrfs would hugely add
>> maintenance pressure, but as explains, all snapshots share one
>> fragile extent tree.
>
> Yes, I understand that this is what I should do given what you
> explained.
> My main problem is knowing how to segment things so I don't end up with
> filesystems that are full while others are almost empty :)
>
> Am I supposed to put LVM thin volumes underneath so that I can share
> the same single 10TB raid5?
>
> If I do this, I would have
> software raid 5 < dmcrypt < bcache < lvm < btrfs
> That's a lot of layers, and that's also starting to make me nervous :)
If you could keep the number of snapshots to minimal (less than 10) for
each btrfs (and the number of send source is less than 5), one big btrfs
may work in that case.
BTW, IMHO the bcache is not really helping for backup system, which is
more write oriented.
Thanks,
Qu
>
> Is there any other way that does not involve me creating smaller block
> devices for multiple btrfs filesystems and hope that they are the right
> size because I won't be able to change it later?
>
> Thanks,
> Marc
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-03 0:51 ` Paul Jones
@ 2018-07-03 4:06 ` Marc MERLIN
2018-07-03 4:26 ` Paul Jones
0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-07-03 4:06 UTC (permalink / raw)
To: Paul Jones; +Cc: linux-btrfs
On Tue, Jul 03, 2018 at 12:51:30AM +0000, Paul Jones wrote:
> You could combine bcache and lvm if you are happy to use dm-cache instead (which lvm uses).
> I use it myself (but without thin provisioning) and it works well.
Interesting point. So, I used to use lvm and then lvm2 many years ago until
I got tired with its performance, especially as asoon as I took even a
single snapshot.
But that was a long time ago now, just saying that I'm a bit rusty on LVM
itself.
That being said, if I have
raid5
dm-cache
dm-crypt
dm-thin
That's still 4 block layers under btrfs.
Am I any better off using dm-cache instead of bcache, my understanding is
that it only replaces one block layer with another one and one codebase with
another.
Mmmh, a bit of reading shows that dm-cache is now used as lvmcache, which
might change things, or not.
I'll admit that setting up and maintaining bcache is a bit of a pain, I only
used it at the time because it seemed more ready then, but we're a few years
later now.
So, what do you recommend nowadays, assuming you've used both?
(given that it's literally going to take days to recreate my array, I'd
rather do it once and the right way the first time :) )
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-03 1:37 ` Qu Wenruo
@ 2018-07-03 4:15 ` Marc MERLIN
2018-07-03 9:55 ` Paul Jones
2018-07-03 4:23 ` Andrei Borzenkov
1 sibling, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-07-03 4:15 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Su Yue, linux-btrfs
On Tue, Jul 03, 2018 at 09:37:47AM +0800, Qu Wenruo wrote:
> > If I do this, I would have
> > software raid 5 < dmcrypt < bcache < lvm < btrfs
> > That's a lot of layers, and that's also starting to make me nervous :)
>
> If you could keep the number of snapshots to minimal (less than 10) for
> each btrfs (and the number of send source is less than 5), one big btrfs
> may work in that case.
Well, we kind of discussed this already. If btrfs falls over if you reach
100 snapshots or so, and it sure seems to in my case, I won't be much better
off.
Having btrfs check --repair fail because 32GB of RAM is not enough, and it's
unable to use swap, is a big deal in my case. You also confirmed that btrfs
check lowmem does not scale to filesystems like mine, so this translates
into "if regular btrfs check repair can't fit in 32GB, I am completely out
of luck if anything happens to the filesystem"
You're correct that I could tweak my backups and snapshot rotation to get
from 250 or so down to 100, but it seems that I'll just be hoping to avoid
the problem by being just under the limit, until I'm not, again, and it'll
be too late to do anything it next time I'm in trouble again, putting me
back right in the same spot I'm in now.
Is all this fair to say, or did I misunderstand?
> BTW, IMHO the bcache is not really helping for backup system, which is
> more write oriented.
That's a good point. So, what I didn't explain is that I still have some old
filesystem that do get backed up with rsync instead of btrfs send (going
into the same filesystem, but not same subvolume).
Because rsync is so painfully slow when it needs to scan both sides before
it'll even start doing any work, bcache helps there.
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-03 0:31 ` Chris Murphy
@ 2018-07-03 4:22 ` Marc MERLIN
2018-07-03 8:34 ` Su Yue
2018-07-03 8:50 ` Qu Wenruo
0 siblings, 2 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-07-03 4:22 UTC (permalink / raw)
To: Chris Murphy; +Cc: Qu Wenruo, Su Yue, Btrfs BTRFS
On Mon, Jul 02, 2018 at 06:31:43PM -0600, Chris Murphy wrote:
> So the idea behind journaled file systems is that journal replay
> enabled mount time "repair" that's faster than an fsck. Already Btrfs
> use cases with big, but not huge, file systems makes btrfs check a
> problem. Either running out of memory or it takes too long. So already
> it isn't scaling as well as ext4 or XFS in this regard.
>
> So what's the future hold? It seems like the goal is that the problems
> must be avoided in the first place rather than to repair them after
> the fact.
>
> Are the problem's Marc is running into understood well enough that
> there can eventually be a fix, maybe even an on-disk format change,
> that prevents such problems from happening in the first place?
>
> Or does it make sense for him to be running with btrfs debug or some
> subset of btrfs integrity checking mask to try to catch the problems
> in the act of them happening?
Those are all good questions.
To be fair, I cannot claim that btrfs was at fault for whatever filesystem
damage I ended up with. It's very possible that it happened due to a flaky
Sata card that kicked drives off the bus when it shouldn't have.
Sure in theory a journaling filesystem can recover from unexpected power
loss and drives dropping off at bad times, but I'm going to guess that
btrfs' complexity also means that it has data structures (extent tree?) that
need to be updated completely "or else".
I'm obviously ok with a filesystem check being necessary to recover in cases
like this, afterall I still occasionally have to run e2fsck on ext4 too, but
I'm a lot less thrilled with the btrfs situation where basically the repair
tools can either completely crash your kernel, or take days and then either
get stuck in an infinite loop or hit an algorithm that can't scale if you
have too many hardlinks/snapshots.
It sounds like there may not be a fix to this problem with the filesystem's
design, outside of "do not get there, or else".
It would even be useful for btrfs tools to start computing heuristics and
output warnings like "you have more than 100 snapshots on this filesystem,
this is not recommended, please read http://url/"
Qu, Su, does that sound both reasonable and doable?
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-03 1:37 ` Qu Wenruo
2018-07-03 4:15 ` Marc MERLIN
@ 2018-07-03 4:23 ` Andrei Borzenkov
1 sibling, 0 replies; 65+ messages in thread
From: Andrei Borzenkov @ 2018-07-03 4:23 UTC (permalink / raw)
To: Qu Wenruo, Marc MERLIN; +Cc: Su Yue, linux-btrfs
03.07.2018 04:37, Qu Wenruo пишет:
>
> BTW, IMHO the bcache is not really helping for backup system, which is
> more write oriented.
>
There is new writecache target which may help in this case.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-02 18:35 ` Austin S. Hemmelgarn
2018-07-02 19:40 ` Marc MERLIN
@ 2018-07-03 4:25 ` Andrei Borzenkov
2018-07-03 7:15 ` Duncan
1 sibling, 1 reply; 65+ messages in thread
From: Andrei Borzenkov @ 2018-07-03 4:25 UTC (permalink / raw)
To: Austin S. Hemmelgarn, Marc MERLIN; +Cc: Qu Wenruo, Su Yue, linux-btrfs
02.07.2018 21:35, Austin S. Hemmelgarn пишет:
> them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit
> dangerous to do it while writes are happening).
Could you please elaborate? Do you mean btrfs can trim data before new
writes are actually committed to disk?
^ permalink raw reply [flat|nested] 65+ messages in thread
* RE: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-03 4:06 ` Marc MERLIN
@ 2018-07-03 4:26 ` Paul Jones
2018-07-03 5:42 ` Marc MERLIN
0 siblings, 1 reply; 65+ messages in thread
From: Paul Jones @ 2018-07-03 4:26 UTC (permalink / raw)
To: Marc MERLIN; +Cc: linux-btrfs
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2449 bytes --]
> -----Original Message-----
> From: Marc MERLIN <marc@merlins.org>
> Sent: Tuesday, 3 July 2018 2:07 PM
> To: Paul Jones <paul@pauljones.id.au>
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: how to best segment a big block device in resizeable btrfs
> filesystems?
>
> On Tue, Jul 03, 2018 at 12:51:30AM +0000, Paul Jones wrote:
> > You could combine bcache and lvm if you are happy to use dm-cache
> instead (which lvm uses).
> > I use it myself (but without thin provisioning) and it works well.
>
> Interesting point. So, I used to use lvm and then lvm2 many years ago until I
> got tired with its performance, especially as asoon as I took even a single
> snapshot.
> But that was a long time ago now, just saying that I'm a bit rusty on LVM
> itself.
>
> That being said, if I have
> raid5
> dm-cache
> dm-crypt
> dm-thin
>
> That's still 4 block layers under btrfs.
> Am I any better off using dm-cache instead of bcache, my understanding is
> that it only replaces one block layer with another one and one codebase with
> another.
True, I didn't think of it like that.
> Mmmh, a bit of reading shows that dm-cache is now used as lvmcache, which
> might change things, or not.
> I'll admit that setting up and maintaining bcache is a bit of a pain, I only used it
> at the time because it seemed more ready then, but we're a few years later
> now.
>
> So, what do you recommend nowadays, assuming you've used both?
> (given that it's literally going to take days to recreate my array, I'd rather do it
> once and the right way the first time :) )
I don't have any experience with this, but since it's the internet let me tell you how I'd do it anyway ð
raid5
dm-crypt
lvm (using thin provisioning + cache)
btrfs
The cache mode on lvm requires you to set up all your volumes first, then add caching to those volumes last. If you need to modify the volume then you have to remove the cache, make your changes, then re-add the cache. It sounds like a pain, but having the cache separate from the data is quite handy.
Given you are running a backup server I don't think the cache would really do much unless you enable writeback mode. If you can split up your filesystem a bit to the point that btrfs check doesn't OOM that will seriously help performance as well. Rsync might be feasible again.
Paul.
ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±ý»k~ÏâØ^nr¡ö¦zË\x1aëh¨èÚ&£ûàz¿äz¹Þú+Ê+zf£¢·h§~Ûiÿÿïêÿêçz_è®\x0fæj:+v¨þ)ߣøm
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-03 4:26 ` Paul Jones
@ 2018-07-03 5:42 ` Marc MERLIN
0 siblings, 0 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-07-03 5:42 UTC (permalink / raw)
To: Paul Jones; +Cc: linux-btrfs
On Tue, Jul 03, 2018 at 04:26:37AM +0000, Paul Jones wrote:
> I don't have any experience with this, but since it's the internet let me tell you how I'd do it anyway 😝
That's the spirit :)
> raid5
> dm-crypt
> lvm (using thin provisioning + cache)
> btrfs
>
> The cache mode on lvm requires you to set up all your volumes first, then
> add caching to those volumes last. If you need to modify the volume then
> you have to remove the cache, make your changes, then re-add the cache. It
> sounds like a pain, but having the cache separate from the data is quite
> handy.
I'm ok enough with that.
> Given you are running a backup server I don't think the cache would
> really do much unless you enable writeback mode. If you can split up your
> filesystem a bit to the point that btrfs check doesn't OOM that will
> seriously help performance as well. Rsync might be feasible again.
I'm a bit warry of write caching with the issues I've had. I may do
write-through, but not writeback :)
But caching helps indeed for my older filesystems that are still backed up
via rsync because the source fs is ext4 and not btrfs.
Thanks for the suggestions
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-03 4:25 ` Andrei Borzenkov
@ 2018-07-03 7:15 ` Duncan
2018-07-06 4:28 ` Andrei Borzenkov
0 siblings, 1 reply; 65+ messages in thread
From: Duncan @ 2018-07-03 7:15 UTC (permalink / raw)
To: linux-btrfs
Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as excerpted:
> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a
>> bit dangerous to do it while writes are happening).
>
> Could you please elaborate? Do you mean btrfs can trim data before new
> writes are actually committed to disk?
No.
But normally old roots aren't rewritten for some time simply due to odds
(fuller filesystems will of course recycle them sooner), and the btrfs
mount option usebackuproot (formerly recovery, until the norecovery mount
option that parallels that of other filesystems was added and this option
was renamed to avoid confusion) can be used to try an older root if the
current root is too damaged to successfully mount.
But other than simply by odds not using them again immediately, btrfs has
no special protection for those old roots, and trim/discard will recover
them to hardware-unused as it does any other unused space, tho whether it
simply marks them for later processing or actually processes them
immediately is up to the individual implementation -- some do it
immediately, killing all chances at using the backup root because it's
already zeroed out, some don't.
In the context of the discard mount option, that can mean there's never
any old roots available ever, as they've already been cleaned up by the
hardware due to the discard option telling the hardware to do it.
But even not using that mount option, and simply doing the trims
periodically, as done weekly by for instance the systemd fstrim timer and
service units, or done manually if you prefer, obviously potentially
wipes the old roots at that point. If the system's effectively idle at
the time, not much risk as the current commit is likely to represent a
filesystem in full stasis, but if there's lots of writes going on at that
moment *AND* the system happens to crash at just the wrong time, before
additional commits have recreated at least a bit of root history, again,
you'll potentially be left without any old roots for the usebackuproot
mount option to try to fall back to, should it actually be necessary.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-03 4:22 ` Marc MERLIN
@ 2018-07-03 8:34 ` Su Yue
2018-07-03 21:34 ` Chris Murphy
2018-07-03 8:50 ` Qu Wenruo
1 sibling, 1 reply; 65+ messages in thread
From: Su Yue @ 2018-07-03 8:34 UTC (permalink / raw)
To: Marc MERLIN, Chris Murphy; +Cc: Qu Wenruo, Btrfs BTRFS
On 07/03/2018 12:22 PM, Marc MERLIN wrote:
> On Mon, Jul 02, 2018 at 06:31:43PM -0600, Chris Murphy wrote:
>> So the idea behind journaled file systems is that journal replay
>> enabled mount time "repair" that's faster than an fsck. Already Btrfs
>> use cases with big, but not huge, file systems makes btrfs check a
>> problem. Either running out of memory or it takes too long. So already
>> it isn't scaling as well as ext4 or XFS in this regard.
>>
>> So what's the future hold? It seems like the goal is that the problems
>> must be avoided in the first place rather than to repair them after
>> the fact.
>>
>> Are the problem's Marc is running into understood well enough that
>> there can eventually be a fix, maybe even an on-disk format change,
>> that prevents such problems from happening in the first place?
>>
>> Or does it make sense for him to be running with btrfs debug or some
>> subset of btrfs integrity checking mask to try to catch the problems
>> in the act of them happening?
>
> Those are all good questions.
> To be fair, I cannot claim that btrfs was at fault for whatever filesystem
> damage I ended up with. It's very possible that it happened due to a flaky
> Sata card that kicked drives off the bus when it shouldn't have.
> Sure in theory a journaling filesystem can recover from unexpected power
> loss and drives dropping off at bad times, but I'm going to guess that
> btrfs' complexity also means that it has data structures (extent tree?) that
> need to be updated completely "or else".
>
Yes, extent tree is the hardest part for lowmem mode. I'm quite
confident the tool can deal well with file trees(which records metadata
about file and directory name, relationships).
As for extent tree, I have few confidence due to its complexity.
> I'm obviously ok with a filesystem check being necessary to recover in cases
> like this, afterall I still occasionally have to run e2fsck on ext4 too, but
> I'm a lot less thrilled with the btrfs situation where basically the repair
> tools can either completely crash your kernel, or take days and then either
> get stuck in an infinite loop or hit an algorithm that can't scale if you
> have too many hardlinks/snapshots.
>
It's not surprising that real world filesytems have many snapshots.
Original mode repair eats large memory space, so lowmem mode is created
to save memory but costs time. The latter is just not robust to handle
complex situations.
> It sounds like there may not be a fix to this problem with the filesystem's
> design, outside of "do not get there, or else".
> It would even be useful for btrfs tools to start computing heuristics and
> output warnings like "you have more than 100 snapshots on this filesystem,
> this is not recommended, please read http://url/"
>
> Qu, Su, does that sound both reasonable and doable?
>
> Thanks,
> Marc
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-03 4:22 ` Marc MERLIN
2018-07-03 8:34 ` Su Yue
@ 2018-07-03 8:50 ` Qu Wenruo
2018-07-03 14:38 ` Marc MERLIN
2018-07-03 21:46 ` Chris Murphy
1 sibling, 2 replies; 65+ messages in thread
From: Qu Wenruo @ 2018-07-03 8:50 UTC (permalink / raw)
To: Marc MERLIN, Chris Murphy; +Cc: Su Yue, Btrfs BTRFS
On 2018年07月03日 12:22, Marc MERLIN wrote:
> On Mon, Jul 02, 2018 at 06:31:43PM -0600, Chris Murphy wrote:
>> So the idea behind journaled file systems is that journal replay
>> enabled mount time "repair" that's faster than an fsck. Already Btrfs
>> use cases with big, but not huge, file systems makes btrfs check a
>> problem. Either running out of memory or it takes too long. So already
>> it isn't scaling as well as ext4 or XFS in this regard.
>>
>> So what's the future hold? It seems like the goal is that the problems
>> must be avoided in the first place rather than to repair them after
>> the fact.
>>
>> Are the problem's Marc is running into understood well enough that
>> there can eventually be a fix, maybe even an on-disk format change,
>> that prevents such problems from happening in the first place?
>>
>> Or does it make sense for him to be running with btrfs debug or some
>> subset of btrfs integrity checking mask to try to catch the problems
>> in the act of them happening?
>
> Those are all good questions.
> To be fair, I cannot claim that btrfs was at fault for whatever filesystem
> damage I ended up with. It's very possible that it happened due to a flaky
> Sata card that kicked drives off the bus when it shouldn't have.
However this still doesn't explain the problem you hit.
In theory (well, it's theory by all means), btrfs is fully atomic for
its transaction, even for its data (with csum and cow).
So even a powerloss/data corruption happens between transactions, we
should get the previous trans.
There must be something wrong, however due to the size of the fs, and
the complexity of extent tree, I can't tell.
> Sure in theory a journaling filesystem can recover from unexpected power
> loss and drives dropping off at bad times, but I'm going to guess that
> btrfs' complexity also means that it has data structures (extent tree?) that
> need to be updated completely "or else".
I'm wondering if we have some hidden bug somewhere.
For extent tree, it's metadata, and is protected by mandatory CoW, it
shouldn't be corrupted, unless we have bug in the already complex
delayed reference code, or some unexpected behavior (flush/fua failure)
due to so many layers (dmcrypt + mdraid).
Anyway, if we can't reproduce it in a controlled environment (my VM with
pretty small and plain fs), it's really hard to locate the bug.
>
> I'm obviously ok with a filesystem check being necessary to recover in cases
> like this, afterall I still occasionally have to run e2fsck on ext4 too, but
> I'm a lot less thrilled with the btrfs situation where basically the repair
> tools can either completely crash your kernel, or take days and then either
> get stuck in an infinite loop or hit an algorithm that can't scale if you
> have too many hardlinks/snapshots.
Unfortunately, all the price is paid for the super fast snapshot creation.
The tradeoff can not be easily solved.
(Another way to implement snapshot is like LVM thin provision, each time
a snapshot is created we need to iterate all allocated blocks of the
thin LV, which can't scale very well when the fs grows, but makes its
mapping management pretty easy. But I think LVM guys have done some
trick to improve the performance)
>
> It sounds like there may not be a fix to this problem with the filesystem's
> design, outside of "do not get there, or else".
> It would even be useful for btrfs tools to start computing heuristics and
> output warnings like "you have more than 100 snapshots on this filesystem,
> this is not recommended, please read http://url/"
This looks pretty doable, but maybe it's better to add some warning at
btrfs progs (both "subvolume snapshot" and "receive").
Thanks,
Qu
>
> Qu, Su, does that sound both reasonable and doable?
>
> Thanks,
> Marc
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* RE: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-03 4:15 ` Marc MERLIN
@ 2018-07-03 9:55 ` Paul Jones
2018-07-03 11:29 ` Qu Wenruo
0 siblings, 1 reply; 65+ messages in thread
From: Paul Jones @ 2018-07-03 9:55 UTC (permalink / raw)
To: Marc MERLIN, Qu Wenruo; +Cc: Su Yue, linux-btrfs
> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> owner@vger.kernel.org> On Behalf Of Marc MERLIN
> Sent: Tuesday, 3 July 2018 2:16 PM
> To: Qu Wenruo <quwenruo.btrfs@gmx.com>
> Cc: Su Yue <suy.fnst@cn.fujitsu.com>; linux-btrfs@vger.kernel.org
> Subject: Re: how to best segment a big block device in resizeable btrfs
> filesystems?
>
> On Tue, Jul 03, 2018 at 09:37:47AM +0800, Qu Wenruo wrote:
> > > If I do this, I would have
> > > software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of
> > > layers, and that's also starting to make me nervous :)
> >
> > If you could keep the number of snapshots to minimal (less than 10)
> > for each btrfs (and the number of send source is less than 5), one big
> > btrfs may work in that case.
>
> Well, we kind of discussed this already. If btrfs falls over if you reach
> 100 snapshots or so, and it sure seems to in my case, I won't be much better
> off.
> Having btrfs check --repair fail because 32GB of RAM is not enough, and it's
> unable to use swap, is a big deal in my case. You also confirmed that btrfs
> check lowmem does not scale to filesystems like mine, so this translates into
> "if regular btrfs check repair can't fit in 32GB, I am completely out of luck if
> anything happens to the filesystem"
Just out of curiosity I had a look at my backup filesystem.
vm-server /media/backup # btrfs fi us /media/backup/
Overall:
Device size: 5.46TiB
Device allocated: 3.42TiB
Device unallocated: 2.04TiB
Device missing: 0.00B
Used: 1.80TiB
Free (estimated): 1.83TiB (min: 1.83TiB)
Data ratio: 2.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 0.00B)
Data,RAID1: Size:1.69TiB, Used:906.26GiB
/dev/mapper/a-backup--a 1.69TiB
/dev/mapper/b-backup--b 1.69TiB
Metadata,RAID1: Size:19.00GiB, Used:16.90GiB
/dev/mapper/a-backup--a 19.00GiB
/dev/mapper/b-backup--b 19.00GiB
System,RAID1: Size:64.00MiB, Used:336.00KiB
/dev/mapper/a-backup--a 64.00MiB
/dev/mapper/b-backup--b 64.00MiB
Unallocated:
/dev/mapper/a-backup--a 1.02TiB
/dev/mapper/b-backup--b 1.02TiB
compress=zstd,space_cache=v2
202 snapshots, heavily de-duplicated
551G / 361,000 files in latest snapshot
Btrfs check normal mode took 12 mins and 11.5G ram
Lowmem mode I stopped after 4 hours, max memory usage was around 3.9G
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-03 9:55 ` Paul Jones
@ 2018-07-03 11:29 ` Qu Wenruo
0 siblings, 0 replies; 65+ messages in thread
From: Qu Wenruo @ 2018-07-03 11:29 UTC (permalink / raw)
To: Paul Jones, Marc MERLIN; +Cc: Su Yue, linux-btrfs
On 2018年07月03日 17:55, Paul Jones wrote:
>> -----Original Message-----
>> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
>> owner@vger.kernel.org> On Behalf Of Marc MERLIN
>> Sent: Tuesday, 3 July 2018 2:16 PM
>> To: Qu Wenruo <quwenruo.btrfs@gmx.com>
>> Cc: Su Yue <suy.fnst@cn.fujitsu.com>; linux-btrfs@vger.kernel.org
>> Subject: Re: how to best segment a big block device in resizeable btrfs
>> filesystems?
>>
>> On Tue, Jul 03, 2018 at 09:37:47AM +0800, Qu Wenruo wrote:
>>>> If I do this, I would have
>>>> software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of
>>>> layers, and that's also starting to make me nervous :)
>>>
>>> If you could keep the number of snapshots to minimal (less than 10)
>>> for each btrfs (and the number of send source is less than 5), one big
>>> btrfs may work in that case.
>>
>> Well, we kind of discussed this already. If btrfs falls over if you reach
>> 100 snapshots or so, and it sure seems to in my case, I won't be much better
>> off.
>> Having btrfs check --repair fail because 32GB of RAM is not enough, and it's
>> unable to use swap, is a big deal in my case. You also confirmed that btrfs
>> check lowmem does not scale to filesystems like mine, so this translates into
>> "if regular btrfs check repair can't fit in 32GB, I am completely out of luck if
>> anything happens to the filesystem"
>
> Just out of curiosity I had a look at my backup filesystem.
> vm-server /media/backup # btrfs fi us /media/backup/
> Overall:
> Device size: 5.46TiB
> Device allocated: 3.42TiB
> Device unallocated: 2.04TiB
> Device missing: 0.00B
> Used: 1.80TiB
> Free (estimated): 1.83TiB (min: 1.83TiB)
> Data ratio: 2.00
> Metadata ratio: 2.00
> Global reserve: 512.00MiB (used: 0.00B)
>
> Data,RAID1: Size:1.69TiB, Used:906.26GiB
It doesn't affect how fast check run at all.
Unless --check-data-csum is specified.
And even --check-data-csum is specified, most read will still be
sequential, and deduped/reflink won't affect the csum verification speed.
> /dev/mapper/a-backup--a 1.69TiB
> /dev/mapper/b-backup--b 1.69TiB
>
> Metadata,RAID1: Size:19.00GiB, Used:16.90GiB
This is the main factor contributing to btrfs check time.
Just consider it as the minimal amount of data btrfs check needs to read.
> /dev/mapper/a-backup--a 19.00GiB
> /dev/mapper/b-backup--b 19.00GiB
>
> System,RAID1: Size:64.00MiB, Used:336.00KiB
> /dev/mapper/a-backup--a 64.00MiB
> /dev/mapper/b-backup--b 64.00MiB
>
> Unallocated:
> /dev/mapper/a-backup--a 1.02TiB
> /dev/mapper/b-backup--b 1.02TiB
>
> compress=zstd,space_cache=v2
> 202 snapshots, heavily de-duplicated
> 551G / 361,000 files in latest snapshot
No wonder it's so slow for lowmem mode.
>
> Btrfs check normal mode took 12 mins and 11.5G ram
> Lowmem mode I stopped after 4 hours, max memory usage was around 3.9G
For lowmem, btrfs check will use 25% of your total memory as cache to
speed up it a little. (but as you can see, it's still slow)
Maybe we could add some option to modify how many bytes we could use for
lowmem mode.
Thanks,
Qu
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-03 8:50 ` Qu Wenruo
@ 2018-07-03 14:38 ` Marc MERLIN
2018-07-03 21:46 ` Chris Murphy
1 sibling, 0 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-07-03 14:38 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Chris Murphy, Su Yue, Btrfs BTRFS
On Tue, Jul 03, 2018 at 04:50:48PM +0800, Qu Wenruo wrote:
> > It sounds like there may not be a fix to this problem with the filesystem's
> > design, outside of "do not get there, or else".
> > It would even be useful for btrfs tools to start computing heuristics and
> > output warnings like "you have more than 100 snapshots on this filesystem,
> > this is not recommended, please read http://url/"
>
> This looks pretty doable, but maybe it's better to add some warning at
> btrfs progs (both "subvolume snapshot" and "receive").
This is what I meant to say, correct.
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-03 8:34 ` Su Yue
@ 2018-07-03 21:34 ` Chris Murphy
2018-07-03 21:40 ` Marc MERLIN
0 siblings, 1 reply; 65+ messages in thread
From: Chris Murphy @ 2018-07-03 21:34 UTC (permalink / raw)
To: Su Yue; +Cc: Marc MERLIN, Chris Murphy, Qu Wenruo, Btrfs BTRFS
On Tue, Jul 3, 2018 at 2:34 AM, Su Yue <suy.fnst@cn.fujitsu.com> wrote:
> Yes, extent tree is the hardest part for lowmem mode. I'm quite
> confident the tool can deal well with file trees(which records metadata
> about file and directory name, relationships).
> As for extent tree, I have few confidence due to its complexity.
I have to ask again if there's some metadata integrity mask opion Marc
should use to try to catch the corruption cause in the first place?
His use case really can't afford either mode of btrfs check. And also
check is only backward looking, it doesn't show what was happening at
the time. And for big file systems, check rapidly doesn't scale at all
anyway.
And now he's modifying his layout to avoid the problem from happening
again which makes it less likely to catch the cause, and get it fixed.
I think if he's willing to build a kernel with integrity checker
enabled, it should be considered but only if it's likely to reveal why
the problem is happening, even if it can't repair the problem once
it's happened. He's already in that situation so masked integrity
checking is no worse, at least it gives a chance to improve Btrfs
rather than it being a mystery how it got corrupt.
--
Chris Murphy
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-03 21:34 ` Chris Murphy
@ 2018-07-03 21:40 ` Marc MERLIN
2018-07-04 1:37 ` Su Yue
0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-07-03 21:40 UTC (permalink / raw)
To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Btrfs BTRFS
On Tue, Jul 03, 2018 at 03:34:45PM -0600, Chris Murphy wrote:
> On Tue, Jul 3, 2018 at 2:34 AM, Su Yue <suy.fnst@cn.fujitsu.com> wrote:
>
> > Yes, extent tree is the hardest part for lowmem mode. I'm quite
> > confident the tool can deal well with file trees(which records metadata
> > about file and directory name, relationships).
> > As for extent tree, I have few confidence due to its complexity.
>
> I have to ask again if there's some metadata integrity mask opion Marc
> should use to try to catch the corruption cause in the first place?
>
> His use case really can't afford either mode of btrfs check. And also
> check is only backward looking, it doesn't show what was happening at
> the time. And for big file systems, check rapidly doesn't scale at all
> anyway.
>
> And now he's modifying his layout to avoid the problem from happening
> again which makes it less likely to catch the cause, and get it fixed.
> I think if he's willing to build a kernel with integrity checker
> enabled, it should be considered but only if it's likely to reveal why
> the problem is happening, even if it can't repair the problem once
> it's happened. He's already in that situation so masked integrity
> checking is no worse, at least it gives a chance to improve Btrfs
> rather than it being a mystery how it got corrupt.
Yeah, I'm fine waiting a few more ays with this down and gather data if
that helps.
But due to the size, a full btrfs image may be a bit larger than we
want, not counting some confidential data in some filenames.
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-03 8:50 ` Qu Wenruo
2018-07-03 14:38 ` Marc MERLIN
@ 2018-07-03 21:46 ` Chris Murphy
2018-07-03 22:00 ` Marc MERLIN
1 sibling, 1 reply; 65+ messages in thread
From: Chris Murphy @ 2018-07-03 21:46 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Marc MERLIN, Chris Murphy, Su Yue, Btrfs BTRFS
On Tue, Jul 3, 2018 at 2:50 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
> There must be something wrong, however due to the size of the fs, and
> the complexity of extent tree, I can't tell.
Right, which is why I'm asking if any of the metadata integrity
checker mask options might reveal what's going wrong?
I guess the big issues are:
a. compile kernel with CONFIG_BTRFS_FS_CHECK_INTEGRITY=y is necessary
b. it can come with a high resource burden depending on the mask and
where the log is being written (write system logs to a different file
system for sure)
c. the granularity offered in the integrity checker might not be enough.
d. might take a while before corruptions are injected before
corruption is noticed and flagged.
So it might be pointless, no idea.
--
Chris Murphy
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-03 21:46 ` Chris Murphy
@ 2018-07-03 22:00 ` Marc MERLIN
2018-07-03 22:52 ` Qu Wenruo
0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-07-03 22:00 UTC (permalink / raw)
To: Chris Murphy; +Cc: Qu Wenruo, Su Yue, Btrfs BTRFS
On Tue, Jul 03, 2018 at 03:46:59PM -0600, Chris Murphy wrote:
> On Tue, Jul 3, 2018 at 2:50 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >
> >
> > There must be something wrong, however due to the size of the fs, and
> > the complexity of extent tree, I can't tell.
>
> Right, which is why I'm asking if any of the metadata integrity
> checker mask options might reveal what's going wrong?
>
> I guess the big issues are:
> a. compile kernel with CONFIG_BTRFS_FS_CHECK_INTEGRITY=y is necessary
> b. it can come with a high resource burden depending on the mask and
> where the log is being written (write system logs to a different file
> system for sure)
> c. the granularity offered in the integrity checker might not be enough.
> d. might take a while before corruptions are injected before
> corruption is noticed and flagged.
Back to where I'm at right now. I'm going to delete this filesystem and
start over very soon. Tomorrow or the day after.
I'm happy to get more data off it if someone wants it for posterity, but
I indeed need to recover soon since being with a dead backup server is
not a good place to be in :)
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-03 22:00 ` Marc MERLIN
@ 2018-07-03 22:52 ` Qu Wenruo
0 siblings, 0 replies; 65+ messages in thread
From: Qu Wenruo @ 2018-07-03 22:52 UTC (permalink / raw)
To: Marc MERLIN, Chris Murphy; +Cc: Su Yue, Btrfs BTRFS
On 2018年07月04日 06:00, Marc MERLIN wrote:
> On Tue, Jul 03, 2018 at 03:46:59PM -0600, Chris Murphy wrote:
>> On Tue, Jul 3, 2018 at 2:50 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>
>>>
>>> There must be something wrong, however due to the size of the fs, and
>>> the complexity of extent tree, I can't tell.
>>
>> Right, which is why I'm asking if any of the metadata integrity
>> checker mask options might reveal what's going wrong?
>>
>> I guess the big issues are:
>> a. compile kernel with CONFIG_BTRFS_FS_CHECK_INTEGRITY=y is necessary
>> b. it can come with a high resource burden depending on the mask and
>> where the log is being written (write system logs to a different file
>> system for sure)
>> c. the granularity offered in the integrity checker might not be enough.
>> d. might take a while before corruptions are injected before
>> corruption is noticed and flagged.
>
> Back to where I'm at right now. I'm going to delete this filesystem and
> start over very soon. Tomorrow or the day after.
> I'm happy to get more data off it if someone wants it for posterity, but
> I indeed need to recover soon since being with a dead backup server is
> not a good place to be in :)
Feel free to recover asap, as the extent tree is really too large for
human to analyse manually.
Thanks,
Qu
>
> Thanks,
> Marc
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: So, does btrfs check lowmem take days? weeks?
2018-07-03 21:40 ` Marc MERLIN
@ 2018-07-04 1:37 ` Su Yue
0 siblings, 0 replies; 65+ messages in thread
From: Su Yue @ 2018-07-04 1:37 UTC (permalink / raw)
To: Marc MERLIN, Chris Murphy; +Cc: Qu Wenruo, Btrfs BTRFS
On 07/04/2018 05:40 AM, Marc MERLIN wrote:
> On Tue, Jul 03, 2018 at 03:34:45PM -0600, Chris Murphy wrote:
>> On Tue, Jul 3, 2018 at 2:34 AM, Su Yue <suy.fnst@cn.fujitsu.com> wrote:
>>
>>> Yes, extent tree is the hardest part for lowmem mode. I'm quite
>>> confident the tool can deal well with file trees(which records metadata
>>> about file and directory name, relationships).
>>> As for extent tree, I have few confidence due to its complexity.
>>
>> I have to ask again if there's some metadata integrity mask opion Marc
>> should use to try to catch the corruption cause in the first place?
>>
>> His use case really can't afford either mode of btrfs check. And also
>> check is only backward looking, it doesn't show what was happening at
>> the time. And for big file systems, check rapidly doesn't scale at all
>> anyway.
>>
>> And now he's modifying his layout to avoid the problem from happening
>> again which makes it less likely to catch the cause, and get it fixed.
>> I think if he's willing to build a kernel with integrity checker
>> enabled, it should be considered but only if it's likely to reveal why
>> the problem is happening, even if it can't repair the problem once
>> it's happened. He's already in that situation so masked integrity
>> checking is no worse, at least it gives a chance to improve Btrfs
>> rather than it being a mystery how it got corrupt.
>
> Yeah, I'm fine waiting a few more ays with this down and gather data if
> that helps.
Thanks! I will write a special version which skips to check wrong extent
items and print debug log.
And it must run faster to help us locate the stuck problem.
Su
> But due to the size, a full btrfs image may be a bit larger than we
> want, not counting some confidential data in some filenames.
>
> Marc
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-03 7:15 ` Duncan
@ 2018-07-06 4:28 ` Andrei Borzenkov
2018-07-08 8:05 ` Duncan
0 siblings, 1 reply; 65+ messages in thread
From: Andrei Borzenkov @ 2018-07-06 4:28 UTC (permalink / raw)
To: Duncan, linux-btrfs
03.07.2018 10:15, Duncan пишет:
> Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as excerpted:
>
>> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
>>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a
>>> bit dangerous to do it while writes are happening).
>>
>> Could you please elaborate? Do you mean btrfs can trim data before new
>> writes are actually committed to disk?
>
> No.
>
> But normally old roots aren't rewritten for some time simply due to odds
> (fuller filesystems will of course recycle them sooner), and the btrfs
> mount option usebackuproot (formerly recovery, until the norecovery mount
> option that parallels that of other filesystems was added and this option
> was renamed to avoid confusion) can be used to try an older root if the
> current root is too damaged to successfully mount.
> > But other than simply by odds not using them again immediately, btrfs has
> no special protection for those old roots, and trim/discard will recover
> them to hardware-unused as it does any other unused space, tho whether it
> simply marks them for later processing or actually processes them
> immediately is up to the individual implementation -- some do it
> immediately, killing all chances at using the backup root because it's
> already zeroed out, some don't.
>
How is it relevant to "while writes are happening"? Will trimming old
tress immediately after writes have stopped be any different? Why?
> In the context of the discard mount option, that can mean there's never
> any old roots available ever, as they've already been cleaned up by the
> hardware due to the discard option telling the hardware to do it.
>
> But even not using that mount option, and simply doing the trims
> periodically, as done weekly by for instance the systemd fstrim timer and
> service units, or done manually if you prefer, obviously potentially
> wipes the old roots at that point. If the system's effectively idle at
> the time, not much risk as the current commit is likely to represent a
> filesystem in full stasis, but if there's lots of writes going on at that
> moment *AND* the system happens to crash at just the wrong time, before
> additional commits have recreated at least a bit of root history, again,
> you'll potentially be left without any old roots for the usebackuproot
> mount option to try to fall back to, should it actually be necessary.
>
Sorry? You are just saying that "previous state can be discarded before
new state is committed", just more verbosely.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: how to best segment a big block device in resizeable btrfs filesystems?
2018-07-06 4:28 ` Andrei Borzenkov
@ 2018-07-08 8:05 ` Duncan
0 siblings, 0 replies; 65+ messages in thread
From: Duncan @ 2018-07-08 8:05 UTC (permalink / raw)
To: linux-btrfs
Andrei Borzenkov posted on Fri, 06 Jul 2018 07:28:48 +0300 as excerpted:
> 03.07.2018 10:15, Duncan пишет:
>> Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as
>> excerpted:
>>
>>> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
>>>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a
>>>> bit dangerous to do it while writes are happening).
>>>
>>> Could you please elaborate? Do you mean btrfs can trim data before new
>>> writes are actually committed to disk?
>>
>> No.
>>
>> But normally old roots aren't rewritten for some time simply due to
>> odds (fuller filesystems will of course recycle them sooner), and the
>> btrfs mount option usebackuproot (formerly recovery, until the
>> norecovery mount option that parallels that of other filesystems was
>> added and this option was renamed to avoid confusion) can be used to
>> try an older root if the current root is too damaged to successfully
>> mount.
>> But other than simply by odds not using them again immediately, btrfs
>> has
>> no special protection for those old roots, and trim/discard will
>> recover them to hardware-unused as it does any other unused space, tho
>> whether it simply marks them for later processing or actually processes
>> them immediately is up to the individual implementation -- some do it
>> immediately, killing all chances at using the backup root because it's
>> already zeroed out, some don't.
>>
>>
> How is it relevant to "while writes are happening"? Will trimming old
> tress immediately after writes have stopped be any different? Why?
Define "while writes are happening" vs. "immediately after writes have
stopped". How soon is "immediately", and does the writes stopped
condition account for data that has reached the device-hardware write
buffer (so is no longer being transmitted to the device across the bus)
but not been actually written to media, or not?
On a reasonably quiescent system, multiple empty write cycles are likely
to have occurred since the last write barrier, and anything in-process is
likely to have made it to media even if software is missing a write
barrier it needs (software bug) or the hardware lies about honoring the
write barrier (hardware bug, allegedly sometimes deliberate on hardware
willing to gamble with your data that a crash won't happen in a critical
moment, a somewhat rare occurrence, in ordered to improve normal
operation performance metrics).
On an IO-maxed system, data and write-barriers are coming down as fast as
the system can handle them, and write-barriers become critical -- crash
after something was supposed to get to media but didn't, either because
of a missing write barrier or because the hardware/firmware lied about
the barrier and said the data it was supposed to ensure was on-media was,
when it wasn't, and the btrfs atomic-cow commit guarantees of consistent
state at each commit go out the window.
At this point it becomes useful to have a number of previous "guaranteed
consistent state" roots to fall back on, with the /hope/ being that at
least /one/ of them is usably consistent. If all but the last one are
wiped due to trim...
When the system isn't write-maxed the write will have almost certainly
made it regardless of whether the barrier is there or not, because
there's enough idle time to finish the current write before another one
comes down the pipe, so the last-written root is almost certain to be
fine regardless of barriers, and the history of past roots doesn't matter
even if there's a crash.
If "immediately after writes have stopped" is strictly defined as a
condition when all writes including the btrfs commit updating the current
root and the superblock pointers to the current root have completed, with
no new writes coming down the pipe in the mean time that might have
delayed a critical update if a barrier was missed, then trimming old
roots in this state should be entirely safe, and the distinction between
that state and the "while writes are happening" is clear.
But if "immediately after writes have stopped" is less strictly defined,
then the distinction between that state and "while writes are happening"
remains blurry at best, and having old roots around to fall back on in
case a write-barrier was missed (for whatever reason, hardware or
software) becomes a very good thing.
Of course the fact that trim/discard itself is an instruction written to
the device in the combined command/data stream complexifies the picture
substantially. If those write barriers get missed who knows what state
the new root is in, and if the old ones got erased... But again, on a
mostly idle system, it'll probably all "just work", because the writes
will likely all make it to media, regardless, because there's not a bunch
of other writes competing for limited write bandwidth and making ordering
critical.
>> In the context of the discard mount option, that can mean there's never
>> any old roots available ever, as they've already been cleaned up by the
>> hardware due to the discard option telling the hardware to do it.
>>
>> But even not using that mount option, and simply doing the trims
>> periodically, as done weekly by for instance the systemd fstrim timer
>> and service units, or done manually if you prefer, obviously
>> potentially wipes the old roots at that point. If the system's
>> effectively idle at the time, not much risk as the current commit is
>> likely to represent a filesystem in full stasis, but if there's lots of
>> writes going on at that moment *AND* the system happens to crash at
>> just the wrong time, before additional commits have recreated at least
>> a bit of root history, again, you'll potentially be left without any
>> old roots for the usebackuproot mount option to try to fall back to,
>> should it actually be necessary.
>>
>>
> Sorry? You are just saying that "previous state can be discarded before
> new state is committed", just more verbosely.
No, it's more the new state gets committed before the old is trimmed, but
should it turn out to be unusable (due to missing write barriers, etc,
which is more of an issue on a write-bottlenecked system), having a
history of old roots/states around to fall back to can be very useful.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 65+ messages in thread
end of thread, other threads:[~2018-07-08 8:07 UTC | newest]
Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-29 4:27 So, does btrfs check lowmem take days? weeks? Marc MERLIN
2018-06-29 5:07 ` Qu Wenruo
2018-06-29 5:28 ` Marc MERLIN
2018-06-29 5:48 ` Qu Wenruo
2018-06-29 6:06 ` Marc MERLIN
2018-06-29 6:29 ` Qu Wenruo
2018-06-29 6:59 ` Marc MERLIN
2018-06-29 7:09 ` Roman Mamedov
2018-06-29 7:22 ` Marc MERLIN
2018-06-29 7:34 ` Roman Mamedov
2018-06-29 8:04 ` Lionel Bouton
2018-06-29 16:24 ` btrfs send/receive vs rsync Marc MERLIN
2018-06-30 8:18 ` Duncan
2018-06-29 7:20 ` So, does btrfs check lowmem take days? weeks? Qu Wenruo
2018-06-29 7:28 ` Marc MERLIN
2018-06-29 17:10 ` Marc MERLIN
2018-06-30 0:04 ` Chris Murphy
2018-06-30 2:44 ` Marc MERLIN
2018-06-30 14:49 ` Qu Wenruo
2018-06-30 21:06 ` Marc MERLIN
2018-06-29 6:02 ` Su Yue
2018-06-29 6:10 ` Marc MERLIN
2018-06-29 6:32 ` Su Yue
2018-06-29 6:43 ` Marc MERLIN
2018-07-01 23:22 ` Marc MERLIN
2018-07-02 2:02 ` Su Yue
2018-07-02 3:22 ` Marc MERLIN
2018-07-02 6:22 ` Su Yue
2018-07-02 14:05 ` Marc MERLIN
2018-07-02 14:42 ` Qu Wenruo
2018-07-02 15:18 ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
2018-07-02 16:59 ` Austin S. Hemmelgarn
2018-07-02 17:34 ` Marc MERLIN
2018-07-02 18:35 ` Austin S. Hemmelgarn
2018-07-02 19:40 ` Marc MERLIN
2018-07-03 4:25 ` Andrei Borzenkov
2018-07-03 7:15 ` Duncan
2018-07-06 4:28 ` Andrei Borzenkov
2018-07-08 8:05 ` Duncan
2018-07-03 0:51 ` Paul Jones
2018-07-03 4:06 ` Marc MERLIN
2018-07-03 4:26 ` Paul Jones
2018-07-03 5:42 ` Marc MERLIN
2018-07-03 1:37 ` Qu Wenruo
2018-07-03 4:15 ` Marc MERLIN
2018-07-03 9:55 ` Paul Jones
2018-07-03 11:29 ` Qu Wenruo
2018-07-03 4:23 ` Andrei Borzenkov
2018-07-02 15:19 ` So, does btrfs check lowmem take days? weeks? Marc MERLIN
2018-07-02 17:08 ` Austin S. Hemmelgarn
2018-07-02 17:33 ` Roman Mamedov
2018-07-02 17:39 ` Marc MERLIN
2018-07-03 0:31 ` Chris Murphy
2018-07-03 4:22 ` Marc MERLIN
2018-07-03 8:34 ` Su Yue
2018-07-03 21:34 ` Chris Murphy
2018-07-03 21:40 ` Marc MERLIN
2018-07-04 1:37 ` Su Yue
2018-07-03 8:50 ` Qu Wenruo
2018-07-03 14:38 ` Marc MERLIN
2018-07-03 21:46 ` Chris Murphy
2018-07-03 22:00 ` Marc MERLIN
2018-07-03 22:52 ` Qu Wenruo
2018-06-29 5:35 ` Su Yue
2018-06-29 5:46 ` Marc MERLIN
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.