All of lore.kernel.org
 help / color / mirror / Atom feed
* So, does btrfs check lowmem take days? weeks?
@ 2018-06-29  4:27 Marc MERLIN
  2018-06-29  5:07 ` Qu Wenruo
  0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29  4:27 UTC (permalink / raw)
  To: linux-btrfs

Regular btrfs check --repair has a nice progress option. It wasn't
perfect, but it showed something.

But then it also takes all your memory quicker than the linux kernel can
defend itself and reliably completely kills my 32GB server quicker than
it can OOM anything.

lowmem repair seems to be going still, but it's been days and -p seems
to do absolutely nothing.

My filesystem is "only" 10TB or so, albeit with a lot of files.

2 things that come to mind
1) can lowmem have some progress working so that I know if I'm looking
at days, weeks, or even months before it will be done?

2) non lowmem is more efficient obviously when it doesn't completely
crash your machine, but could lowmem be given an amount of memory to use
for caching, or maybe use some heuristics based on RAM free so that it's
not so excrutiatingly slow?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  4:27 So, does btrfs check lowmem take days? weeks? Marc MERLIN
@ 2018-06-29  5:07 ` Qu Wenruo
  2018-06-29  5:28   ` Marc MERLIN
  2018-06-29  5:35   ` Su Yue
  0 siblings, 2 replies; 65+ messages in thread
From: Qu Wenruo @ 2018-06-29  5:07 UTC (permalink / raw)
  To: Marc MERLIN, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1790 bytes --]



On 2018年06月29日 12:27, Marc MERLIN wrote:
> Regular btrfs check --repair has a nice progress option. It wasn't
> perfect, but it showed something.
> 
> But then it also takes all your memory quicker than the linux kernel can
> defend itself and reliably completely kills my 32GB server quicker than
> it can OOM anything.
> 
> lowmem repair seems to be going still, but it's been days and -p seems
> to do absolutely nothing.

I'm a afraid you hit a bug in lowmem repair code.
By all means, --repair shouldn't really be used unless you're pretty
sure the problem is something btrfs check can handle.

That's also why --repair is still marked as dangerous.
Especially when it's combined with experimental lowmem mode.

> 
> My filesystem is "only" 10TB or so, albeit with a lot of files.

Unless you have tons of snapshots and reflinked (deduped) files, it
shouldn't take so long.

> 
> 2 things that come to mind
> 1) can lowmem have some progress working so that I know if I'm looking
> at days, weeks, or even months before it will be done?

It's hard to estimate, especially when every cross check involves a lot
of disk IO.

But at least, we could add such indicator to show we're doing something.

> 
> 2) non lowmem is more efficient obviously when it doesn't completely
> crash your machine, but could lowmem be given an amount of memory to use
> for caching, or maybe use some heuristics based on RAM free so that it's
> not so excrutiatingly slow?

IIRC recent commit has added the ability.
a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers")

That's already included in btrfs-progs v4.13.2.
So it should be a dead loop which lowmem repair code can't handle.

Thanks,
Qu

> 
> Thanks,
> Marc
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  5:07 ` Qu Wenruo
@ 2018-06-29  5:28   ` Marc MERLIN
  2018-06-29  5:48     ` Qu Wenruo
  2018-06-29  6:02     ` Su Yue
  2018-06-29  5:35   ` Su Yue
  1 sibling, 2 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29  5:28 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Jun 29, 2018 at 01:07:20PM +0800, Qu Wenruo wrote:
> > lowmem repair seems to be going still, but it's been days and -p seems
> > to do absolutely nothing.
> 
> I'm a afraid you hit a bug in lowmem repair code.
> By all means, --repair shouldn't really be used unless you're pretty
> sure the problem is something btrfs check can handle.
> 
> That's also why --repair is still marked as dangerous.
> Especially when it's combined with experimental lowmem mode.

Understood, but btrfs got corrupted (by itself or not, I don't know)
I cannot mount the filesystem read/write
I cannot btrfs check --repair it since that code will kill my machine
What do I have left?

> > My filesystem is "only" 10TB or so, albeit with a lot of files.
> 
> Unless you have tons of snapshots and reflinked (deduped) files, it
> shouldn't take so long.

I may have a fair amount.
gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2 
enabling repair mode
WARNING: low-memory mode repair support is only partial
Checking filesystem on /dev/mapper/dshelf2
UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
Fixed 0 roots.
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4
Created new chunk [18457780224000 1073741824]
Delete backref in extent [84302495744 69632]
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4
Delete backref in extent [84302495744 69632]
ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240
Delete backref in extent [125712527360 12214272]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431
Delete backref in extent [129952120832 20242432]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433
Delete backref in extent [129952120832 20242432]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249
Delete backref in extent [147895111680 12345344]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251
Delete backref in extent [147895111680 12345344]
ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 348, have: 418
Delete backref in extent [150850146304 17522688]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 555, have: 1449
Deleted root 2 item[156909494272, 178, 5476627808561673095]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 556, have: 1452
Deleted root 2 item[156909494272, 178, 7338474132555182983]
ERROR: file extent[374857 235184128] root 21872 owner 21872 backref lost
Add one extent data backref [156909494272 55320576]
ERROR: file extent[374857 235184128] root 22911 owner 22911 backref lost
Add one extent data backref [156909494272 55320576]

The last two ERROR lines took over a day to get generated, so I'm not sure if it's still working, but just slowly.
For what it's worth non lowmem check used to take 12 to 24H on that filesystem back when it still worked.

> > 2 things that come to mind
> > 1) can lowmem have some progress working so that I know if I'm looking
> > at days, weeks, or even months before it will be done?
> 
> It's hard to estimate, especially when every cross check involves a lot
> of disk IO.
> But at least, we could add such indicator to show we're doing something.

Yes, anything to show that I should still wait is still good :)

> > 2) non lowmem is more efficient obviously when it doesn't completely
> > crash your machine, but could lowmem be given an amount of memory to use
> > for caching, or maybe use some heuristics based on RAM free so that it's
> > not so excrutiatingly slow?
> 
> IIRC recent commit has added the ability.
> a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers")
 
Oh, good.

> That's already included in btrfs-progs v4.13.2.
> So it should be a dead loop which lowmem repair code can't handle.

I see. Is there any reasonably easy way to check on this running process?

Both top and iotop show that it's working, but of course I can't tell if
it's looping, or not.

Then again, maybe it already fixed enough that I can mount my filesystem again.

But back to the main point, it's sad that after so many years, the
repair situation is still so suboptimal, especially when it's apparently
pretty easy for btrfs to get damaged (through its own fault or not, hard
to say).

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  5:07 ` Qu Wenruo
  2018-06-29  5:28   ` Marc MERLIN
@ 2018-06-29  5:35   ` Su Yue
  2018-06-29  5:46     ` Marc MERLIN
  1 sibling, 1 reply; 65+ messages in thread
From: Su Yue @ 2018-06-29  5:35 UTC (permalink / raw)
  To: Qu Wenruo, Marc MERLIN, linux-btrfs



On 06/29/2018 01:07 PM, Qu Wenruo wrote:
> 
> 
> On 2018年06月29日 12:27, Marc MERLIN wrote:
>> Regular btrfs check --repair has a nice progress option. It wasn't
>> perfect, but it showed something.
>>
>> But then it also takes all your memory quicker than the linux kernel can
>> defend itself and reliably completely kills my 32GB server quicker than
>> it can OOM anything.
>>
>> lowmem repair seems to be going still, but it's been days and -p seems
>> to do absolutely nothing.
> 
> I'm a afraid you hit a bug in lowmem repair code.
> By all means, --repair shouldn't really be used unless you're pretty
> sure the problem is something btrfs check can handle.
> 
> That's also why --repair is still marked as dangerous.
> Especially when it's combined with experimental lowmem mode.
>
>>
>> My filesystem is "only" 10TB or so, albeit with a lot of files.
> 
> Unless you have tons of snapshots and reflinked (deduped) files, it
> shouldn't take so long.
> 
>>
>> 2 things that come to mind
>> 1) can lowmem have some progress working so that I know if I'm looking
>> at days, weeks, or even months before it will be done?
> 
> It's hard to estimate, especially when every cross check involves a lot
> of disk IO.
> 
> But at least, we could add such indicator to show we're doing something.
> Maybe we can account all roots in root tree first, before checking a
tree, report i/num_roots. So users can see the what is the check doing
something meaningful or silly dead looping.

Thanks,
Su

>>
>> 2) non lowmem is more efficient obviously when it doesn't completely
>> crash your machine, but could lowmem be given an amount of memory to use
>> for caching, or maybe use some heuristics based on RAM free so that it's
>> not so excrutiatingly slow?
> 
> IIRC recent commit has added the ability.
> a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers")
> 
> That's already included in btrfs-progs v4.13.2.
> So it should be a dead loop which lowmem repair code can't handle.
> 
> Thanks,
> Qu
> 
>>
>> Thanks,
>> Marc
>>
> 



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  5:35   ` Su Yue
@ 2018-06-29  5:46     ` Marc MERLIN
  0 siblings, 0 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29  5:46 UTC (permalink / raw)
  To: Su Yue; +Cc: Qu Wenruo, linux-btrfs

On Fri, Jun 29, 2018 at 01:35:06PM +0800, Su Yue wrote:
> > It's hard to estimate, especially when every cross check involves a lot
> > of disk IO.
> > 
> > But at least, we could add such indicator to show we're doing something.
> > Maybe we can account all roots in root tree first, before checking a
> tree, report i/num_roots. So users can see the what is the check doing
> something meaningful or silly dead looping.

Sounds reasonable.
Do you want to submit something in git master for btrfs-progs, I pull
it, and just my btrfs check again?

In the meantime, how sane does the output I just posted, look?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  5:28   ` Marc MERLIN
@ 2018-06-29  5:48     ` Qu Wenruo
  2018-06-29  6:06       ` Marc MERLIN
  2018-06-29  6:02     ` Su Yue
  1 sibling, 1 reply; 65+ messages in thread
From: Qu Wenruo @ 2018-06-29  5:48 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 7460 bytes --]



On 2018年06月29日 13:28, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 01:07:20PM +0800, Qu Wenruo wrote:
>>> lowmem repair seems to be going still, but it's been days and -p seems
>>> to do absolutely nothing.
>>
>> I'm a afraid you hit a bug in lowmem repair code.
>> By all means, --repair shouldn't really be used unless you're pretty
>> sure the problem is something btrfs check can handle.
>>
>> That's also why --repair is still marked as dangerous.
>> Especially when it's combined with experimental lowmem mode.
> 
> Understood, but btrfs got corrupted (by itself or not, I don't know)
> I cannot mount the filesystem read/write
> I cannot btrfs check --repair it since that code will kill my machine
> What do I have left?

Just normal btrfs check, and post the output.
If normal check eats up all your memory, btrfs check --mode=lowmem.

--repair should be considered as the last method.

> 
>>> My filesystem is "only" 10TB or so, albeit with a lot of files.
>>
>> Unless you have tons of snapshots and reflinked (deduped) files, it
>> shouldn't take so long.
> 
> I may have a fair amount.
> gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2 
> enabling repair mode
> WARNING: low-memory mode repair support is only partial
> Checking filesystem on /dev/mapper/dshelf2
> UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
> Fixed 0 roots.
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4
> Created new chunk [18457780224000 1073741824]
> Delete backref in extent [84302495744 69632]
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4
> Delete backref in extent [84302495744 69632]
> ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240
> Delete backref in extent [125712527360 12214272]
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115
> Delete backref in extent [125730848768 5111808]
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115
> Delete backref in extent [125730848768 5111808]
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143
> Delete backref in extent [125736914944 6037504]
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143
> Delete backref in extent [125736914944 6037504]
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431
> Delete backref in extent [129952120832 20242432]
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433
> Delete backref in extent [129952120832 20242432]
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240
> Delete backref in extent [134925357056 11829248]
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240
> Delete backref in extent [134925357056 11829248]
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249
> Delete backref in extent [147895111680 12345344]
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251
> Delete backref in extent [147895111680 12345344]
> ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 348, have: 418
> Delete backref in extent [150850146304 17522688]
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 555, have: 1449
> Deleted root 2 item[156909494272, 178, 5476627808561673095]
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 556, have: 1452
> Deleted root 2 item[156909494272, 178, 7338474132555182983]
> ERROR: file extent[374857 235184128] root 21872 owner 21872 backref lost
> Add one extent data backref [156909494272 55320576]
> ERROR: file extent[374857 235184128] root 22911 owner 22911 backref lost
> Add one extent data backref [156909494272 55320576]
> 
> The last two ERROR lines took over a day to get generated, so I'm not sure if it's still working, but just slowly.

OK, that explains something.

One extent is referred hundreds times, no wonder it will take a long time.

Just one tip here, there are really too many snapshots/reflinked files.
It's highly recommended to keep the number of snapshots to a reasonable
number (lower two digits).
Although btrfs snapshot is super fast, it puts a lot of pressure on its
extent tree, so there is no free lunch here.

> For what it's worth non lowmem check used to take 12 to 24H on that filesystem back when it still worked.
> 
>>> 2 things that come to mind
>>> 1) can lowmem have some progress working so that I know if I'm looking
>>> at days, weeks, or even months before it will be done?
>>
>> It's hard to estimate, especially when every cross check involves a lot
>> of disk IO.
>> But at least, we could add such indicator to show we're doing something.
> 
> Yes, anything to show that I should still wait is still good :)
> 
>>> 2) non lowmem is more efficient obviously when it doesn't completely
>>> crash your machine, but could lowmem be given an amount of memory to use
>>> for caching, or maybe use some heuristics based on RAM free so that it's
>>> not so excrutiatingly slow?
>>
>> IIRC recent commit has added the ability.
>> a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers")
>  
> Oh, good.
> 
>> That's already included in btrfs-progs v4.13.2.
>> So it should be a dead loop which lowmem repair code can't handle.
> 
> I see. Is there any reasonably easy way to check on this running process?

GDB attach would be good.
Interrupt and check the inode number if it's checking fs tree.
Check the extent bytenr number if it's checking extent tree.

But considering how many snapshots there are, it's really hard to determine.

In this case, the super large extent tree is causing a lot of problem,
maybe it's a good idea to allow btrfs check to skip extent tree check?

> 
> Both top and iotop show that it's working, but of course I can't tell if
> it's looping, or not.
> 
> Then again, maybe it already fixed enough that I can mount my filesystem again.

This needs the initial btrfs check report and the kernel messages how it
fails to mount.

> 
> But back to the main point, it's sad that after so many years, the
> repair situation is still so suboptimal, especially when it's apparently
> pretty easy for btrfs to get damaged (through its own fault or not, hard
> to say).

Unfortunately, yes.
Especially the extent tree is pretty fragile and hard to repair.

Thanks,
Qu

> 
> Thanks,
> Marc
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  5:28   ` Marc MERLIN
  2018-06-29  5:48     ` Qu Wenruo
@ 2018-06-29  6:02     ` Su Yue
  2018-06-29  6:10       ` Marc MERLIN
  1 sibling, 1 reply; 65+ messages in thread
From: Su Yue @ 2018-06-29  6:02 UTC (permalink / raw)
  To: Marc MERLIN, Qu Wenruo; +Cc: linux-btrfs



On 06/29/2018 01:28 PM, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 01:07:20PM +0800, Qu Wenruo wrote:
>>> lowmem repair seems to be going still, but it's been days and -p seems
>>> to do absolutely nothing.
>>
>> I'm a afraid you hit a bug in lowmem repair code.
>> By all means, --repair shouldn't really be used unless you're pretty
>> sure the problem is something btrfs check can handle.
>>
>> That's also why --repair is still marked as dangerous.
>> Especially when it's combined with experimental lowmem mode.
> 
> Understood, but btrfs got corrupted (by itself or not, I don't know)
> I cannot mount the filesystem read/write
> I cannot btrfs check --repair it since that code will kill my machine
> What do I have left?
> 
>>> My filesystem is "only" 10TB or so, albeit with a lot of files.
>>
>> Unless you have tons of snapshots and reflinked (deduped) files, it
>> shouldn't take so long.
> 
> I may have a fair amount.
> gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2
> enabling repair mode
> WARNING: low-memory mode repair support is only partial
> Checking filesystem on /dev/mapper/dshelf2
> UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
> Fixed 0 roots.
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4
> Created new chunk [18457780224000 1073741824]
> Delete backref in extent [84302495744 69632]
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4
> Delete backref in extent [84302495744 69632]
> ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240
> Delete backref in extent [125712527360 12214272]
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115
> Delete backref in extent [125730848768 5111808]
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115
> Delete backref in extent [125730848768 5111808]
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143
> Delete backref in extent [125736914944 6037504]
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143
> Delete backref in extent [125736914944 6037504]
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431
> Delete backref in extent [129952120832 20242432]
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433
> Delete backref in extent [129952120832 20242432]
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240
> Delete backref in extent [134925357056 11829248]
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240
> Delete backref in extent [134925357056 11829248]
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249
> Delete backref in extent [147895111680 12345344]
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251
> Delete backref in extent [147895111680 12345344]
> ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 348, have: 418
> Delete backref in extent [150850146304 17522688]
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 555, have: 1449
> Deleted root 2 item[156909494272, 178, 5476627808561673095]
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 556, have: 1452
> Deleted root 2 item[156909494272, 178, 7338474132555182983]
> ERROR: file extent[374857 235184128] root 21872 owner 21872 backref lost
> Add one extent data backref [156909494272 55320576]
> ERROR: file extent[374857 235184128] root 22911 owner 22911 backref lost
> Add one extent data backref [156909494272 55320576]
> 
My bad.
It's almost possiblelly a bug about extent of lowmem check which
was reported by Chris too.
The extent check was wrong, the the repair did wrong things.

I have figured out the bug is lowmem check can't deal with shared tree 
block in reloc tree. The fix is simple, you can try the follow repo:

https://github.com/Damenly/btrfs-progs/tree/tmp1

Please run lowmem check "without =--repair" first to be sure whether
your filesystem is fine.

Though the bug and phenomenon are clear enough, before sending my patch,
I have to make a test image. I have spent a week to study btrfs balance
but it seems a liitle hard for me.

Thanks,
Su

> The last two ERROR lines took over a day to get generated, so I'm not sure if it's still working, but just slowly.
> For what it's worth non lowmem check used to take 12 to 24H on that filesystem back when it still worked.
> 
>>> 2 things that come to mind
>>> 1) can lowmem have some progress working so that I know if I'm looking
>>> at days, weeks, or even months before it will be done?
>>
>> It's hard to estimate, especially when every cross check involves a lot
>> of disk IO.
>> But at least, we could add such indicator to show we're doing something.
> 
> Yes, anything to show that I should still wait is still good :)
> 
>>> 2) non lowmem is more efficient obviously when it doesn't completely
>>> crash your machine, but could lowmem be given an amount of memory to use
>>> for caching, or maybe use some heuristics based on RAM free so that it's
>>> not so excrutiatingly slow?
>>
>> IIRC recent commit has added the ability.
>> a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers")
>   
> Oh, good.
> 
>> That's already included in btrfs-progs v4.13.2.
>> So it should be a dead loop which lowmem repair code can't handle.
> 
> I see. Is there any reasonably easy way to check on this running process?
> 
> Both top and iotop show that it's working, but of course I can't tell if
> it's looping, or not.
> 
> Then again, maybe it already fixed enough that I can mount my filesystem again.
> 
> But back to the main point, it's sad that after so many years, the
> repair situation is still so suboptimal, especially when it's apparently
> pretty easy for btrfs to get damaged (through its own fault or not, hard
> to say).
> 
> Thanks,
> Marc
> 



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  5:48     ` Qu Wenruo
@ 2018-06-29  6:06       ` Marc MERLIN
  2018-06-29  6:29         ` Qu Wenruo
  0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29  6:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Jun 29, 2018 at 01:48:17PM +0800, Qu Wenruo wrote:
> Just normal btrfs check, and post the output.
> If normal check eats up all your memory, btrfs check --mode=lowmem.
 
Does check without --repair eat less RAM?

> --repair should be considered as the last method.

If --repair doesn't work, check is useless to me sadly. I know that for
FS analysis and bug reporting, you want to have the FS without changing
it to something maybe worse, but for my use, if it can't be mounted and
can't be fixed, then it gets deleted which is even worse than check
doing the wrong thing.

> > The last two ERROR lines took over a day to get generated, so I'm not sure if it's still working, but just slowly.
> 
> OK, that explains something.
> 
> One extent is referred hundreds times, no wonder it will take a long time.
> 
> Just one tip here, there are really too many snapshots/reflinked files.
> It's highly recommended to keep the number of snapshots to a reasonable
> number (lower two digits).
> Although btrfs snapshot is super fast, it puts a lot of pressure on its
> extent tree, so there is no free lunch here.
 
Agreed, I doubt I have over or much over 100 snapshots though (but I
can't check right now).
Sadly I'm not allowed to mount even read only while check is running:
gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2
mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy

> > I see. Is there any reasonably easy way to check on this running process?
> 
> GDB attach would be good.
> Interrupt and check the inode number if it's checking fs tree.
> Check the extent bytenr number if it's checking extent tree.
> 
> But considering how many snapshots there are, it's really hard to determine.
> 
> In this case, the super large extent tree is causing a lot of problem,
> maybe it's a good idea to allow btrfs check to skip extent tree check?

I only see --init-extent-tree in the man page, which option did you have
in mind?

> > Then again, maybe it already fixed enough that I can mount my filesystem again.
> 
> This needs the initial btrfs check report and the kernel messages how it
> fails to mount.

mount command hangs, kernel does not show anything special outside of disk access hanging.

Jun 23 17:23:26 gargamel kernel: [  341.802696] BTRFS warning (device dm-2): 'recovery' is deprecated, use 'useback
uproot' instead
Jun 23 17:23:26 gargamel kernel: [  341.828743] BTRFS info (device dm-2): trying to use backup root at mount time
Jun 23 17:23:26 gargamel kernel: [  341.850180] BTRFS info (device dm-2): disk space caching is enabled
Jun 23 17:23:26 gargamel kernel: [  341.869014] BTRFS info (device dm-2): has skinny extents
Jun 23 17:23:26 gargamel kernel: [  342.206289] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
Jun 23 17:26:26 gargamel kernel: [  521.571392] BTRFS info (device dm-2): enabling ssd optimizations
Jun 23 17:55:58 gargamel kernel: [ 2293.914867] perf: interrupt took too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
Jun 23 17:56:22 gargamel kernel: [ 2317.718406] BTRFS info (device dm-2): disk space caching is enabled
Jun 23 17:56:22 gargamel kernel: [ 2317.737277] BTRFS info (device dm-2): has skinny extents
Jun 23 17:56:22 gargamel kernel: [ 2318.069461] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
Jun 23 17:59:22 gargamel kernel: [ 2498.256167] BTRFS info (device dm-2): enabling ssd optimizations
Jun 23 18:05:23 gargamel kernel: [ 2859.107057] BTRFS info (device dm-2): disk space caching is enabled
Jun 23 18:05:23 gargamel kernel: [ 2859.125883] BTRFS info (device dm-2): has skinny extents
Jun 23 18:05:24 gargamel kernel: [ 2859.448018] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
Jun 23 18:08:23 gargamel kernel: [ 3039.023305] BTRFS info (device dm-2): enabling ssd optimizations
Jun 23 18:13:41 gargamel kernel: [ 3356.626037] perf: interrupt took too long (3143 > 3133), lowering kernel.perf_event_max_sample_rate to 63500
Jun 23 18:17:23 gargamel kernel: [ 3578.937225] Process accounting resumed
Jun 23 18:33:47 gargamel kernel: [ 4563.356252] JFS: nTxBlock = 8192, nTxLock = 65536
Jun 23 18:33:48 gargamel kernel: [ 4563.446715] ntfs: driver 2.1.32 [Flags: R/W MODULE].
Jun 23 18:42:20 gargamel kernel: [ 5075.995254] INFO: task sync:20253 blocked for more than 120 seconds.
Jun 23 18:42:20 gargamel kernel: [ 5076.015729]       Not tainted 4.17.2-amd64-preempt-sysrq-20180817 #1
Jun 23 18:42:20 gargamel kernel: [ 5076.036141] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 23 18:42:20 gargamel kernel: [ 5076.060637] sync            D    0 20253  15327 0x20020080
Jun 23 18:42:20 gargamel kernel: [ 5076.078032] Call Trace:
Jun 23 18:42:20 gargamel kernel: [ 5076.086366]  ? __schedule+0x53e/0x59b
Jun 23 18:42:20 gargamel kernel: [ 5076.098311]  schedule+0x7f/0x98
Jun 23 18:42:20 gargamel kernel: [ 5076.108665]  __rwsem_down_read_failed_common+0x127/0x1a8
Jun 23 18:42:20 gargamel kernel: [ 5076.125565]  ? sync_fs_one_sb+0x20/0x20
Jun 23 18:42:20 gargamel kernel: [ 5076.137982]  ? call_rwsem_down_read_failed+0x14/0x30
Jun 23 18:42:20 gargamel kernel: [ 5076.154081]  call_rwsem_down_read_failed+0x14/0x30
Jun 23 18:42:20 gargamel kernel: [ 5076.169429]  down_read+0x13/0x25
Jun 23 18:42:20 gargamel kernel: [ 5076.180444]  iterate_supers+0x57/0xbe
Jun 23 18:42:20 gargamel kernel: [ 5076.192619]  ksys_sync+0x40/0xa4
Jun 23 18:42:20 gargamel kernel: [ 5076.203192]  __ia32_sys_sync+0xa/0xd
Jun 23 18:42:20 gargamel kernel: [ 5076.214774]  do_fast_syscall_32+0xaf/0xf3
Jun 23 18:42:20 gargamel kernel: [ 5076.227740]  entry_SYSENTER_compat+0x7f/0x91
Jun 23 18:44:21 gargamel kernel: [ 5196.828764] INFO: task sync:20253 blocked for more than 120 seconds.
Jun 23 18:44:21 gargamel kernel: [ 5196.848724]       Not tainted 4.17.2-amd64-preempt-sysrq-20180817 #1
Jun 23 18:44:21 gargamel kernel: [ 5196.868789] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 23 18:44:21 gargamel kernel: [ 5196.893615] sync            D    0 20253  15327 0x20020080

> > But back to the main point, it's sad that after so many years, the
> > repair situation is still so suboptimal, especially when it's apparently
> > pretty easy for btrfs to get damaged (through its own fault or not, hard
> > to say).
> 
> Unfortunately, yes.
> Especially the extent tree is pretty fragile and hard to repair.

So, I don't know the code, but if I may make a suggestion (which maybe
is totally wrong, if so forgive me):
I would love for a repair mode that gives me a back a fixed
filesystem. I don't really care how much data is lost (although ideally
it would give me a list of files lost), but I want a working filesystem
at the end. I can then decide if there is enough data left on it to
restore what's missing or if I'm better off starting from scratch.

Is that possible at all?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  6:02     ` Su Yue
@ 2018-06-29  6:10       ` Marc MERLIN
  2018-06-29  6:32         ` Su Yue
  0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29  6:10 UTC (permalink / raw)
  To: Su Yue; +Cc: Qu Wenruo, linux-btrfs

On Fri, Jun 29, 2018 at 02:02:19PM +0800, Su Yue wrote:
> I have figured out the bug is lowmem check can't deal with shared tree block
> in reloc tree. The fix is simple, you can try the follow repo:
> 
> https://github.com/Damenly/btrfs-progs/tree/tmp1

Not sure if I undertand that you meant, here.

> Please run lowmem check "without =--repair" first to be sure whether
> your filesystem is fine.
 
The filesystem is not fine, it caused btrfs balance to hang, whether
balance actually broke it further or caused the breakage, I can't say.

Then mount hangs, even with recovery, unless I use ro.

This filesystem is trash to me and will require over a week to rebuild
manually if I can't repair it.
Running check without repair for likely several days just to know that
my filesystem is not clear (I already know this) isn't useful :)
Or am I missing something?

> Though the bug and phenomenon are clear enough, before sending my patch,
> I have to make a test image. I have spent a week to study btrfs balance
> but it seems a liitle hard for me.

thanks for having a look, either way.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  6:06       ` Marc MERLIN
@ 2018-06-29  6:29         ` Qu Wenruo
  2018-06-29  6:59           ` Marc MERLIN
  0 siblings, 1 reply; 65+ messages in thread
From: Qu Wenruo @ 2018-06-29  6:29 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 8658 bytes --]



On 2018年06月29日 14:06, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 01:48:17PM +0800, Qu Wenruo wrote:
>> Just normal btrfs check, and post the output.
>> If normal check eats up all your memory, btrfs check --mode=lowmem.
>  
> Does check without --repair eat less RAM?

Unfortunately, no.

> 
>> --repair should be considered as the last method.
> 
> If --repair doesn't work, check is useless to me sadly.

Not exactly.
Although it's time consuming, I have manually patched several users fs,
which normally ends pretty well.

If it's not a wide-spread problem but some small fatal one, it may be fixed.

> I know that for
> FS analysis and bug reporting, you want to have the FS without changing
> it to something maybe worse, but for my use, if it can't be mounted and
> can't be fixed, then it gets deleted which is even worse than check
> doing the wrong thing.
> 
>>> The last two ERROR lines took over a day to get generated, so I'm not sure if it's still working, but just slowly.
>>
>> OK, that explains something.
>>
>> One extent is referred hundreds times, no wonder it will take a long time.
>>
>> Just one tip here, there are really too many snapshots/reflinked files.
>> It's highly recommended to keep the number of snapshots to a reasonable
>> number (lower two digits).
>> Although btrfs snapshot is super fast, it puts a lot of pressure on its
>> extent tree, so there is no free lunch here.
>  
> Agreed, I doubt I have over or much over 100 snapshots though (but I
> can't check right now).
> Sadly I'm not allowed to mount even read only while check is running:
> gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2
> mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy
> 
>>> I see. Is there any reasonably easy way to check on this running process?
>>
>> GDB attach would be good.
>> Interrupt and check the inode number if it's checking fs tree.
>> Check the extent bytenr number if it's checking extent tree.
>>
>> But considering how many snapshots there are, it's really hard to determine.
>>
>> In this case, the super large extent tree is causing a lot of problem,
>> maybe it's a good idea to allow btrfs check to skip extent tree check?
> 
> I only see --init-extent-tree in the man page, which option did you have
> in mind?

That feature is just in my mind, not even implemented yet.

> 
>>> Then again, maybe it already fixed enough that I can mount my filesystem again.
>>
>> This needs the initial btrfs check report and the kernel messages how it
>> fails to mount.
> 
> mount command hangs, kernel does not show anything special outside of disk access hanging.
> 
> Jun 23 17:23:26 gargamel kernel: [  341.802696] BTRFS warning (device dm-2): 'recovery' is deprecated, use 'useback
> uproot' instead
> Jun 23 17:23:26 gargamel kernel: [  341.828743] BTRFS info (device dm-2): trying to use backup root at mount time
> Jun 23 17:23:26 gargamel kernel: [  341.850180] BTRFS info (device dm-2): disk space caching is enabled
> Jun 23 17:23:26 gargamel kernel: [  341.869014] BTRFS info (device dm-2): has skinny extents
> Jun 23 17:23:26 gargamel kernel: [  342.206289] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
> Jun 23 17:26:26 gargamel kernel: [  521.571392] BTRFS info (device dm-2): enabling ssd optimizations
> Jun 23 17:55:58 gargamel kernel: [ 2293.914867] perf: interrupt took too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
> Jun 23 17:56:22 gargamel kernel: [ 2317.718406] BTRFS info (device dm-2): disk space caching is enabled
> Jun 23 17:56:22 gargamel kernel: [ 2317.737277] BTRFS info (device dm-2): has skinny extents
> Jun 23 17:56:22 gargamel kernel: [ 2318.069461] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
> Jun 23 17:59:22 gargamel kernel: [ 2498.256167] BTRFS info (device dm-2): enabling ssd optimizations
> Jun 23 18:05:23 gargamel kernel: [ 2859.107057] BTRFS info (device dm-2): disk space caching is enabled
> Jun 23 18:05:23 gargamel kernel: [ 2859.125883] BTRFS info (device dm-2): has skinny extents
> Jun 23 18:05:24 gargamel kernel: [ 2859.448018] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0

This looks like super block corruption?

What about "btrfs inspect dump-super -fFa /dev/mapper/dshelf2"?

And what about "skip_balance" mount option?

Another problem is, with so many snapshots, balance is also hugely
slowed, thus I'm not 100% sure if it's really a hang.

> Jun 23 18:08:23 gargamel kernel: [ 3039.023305] BTRFS info (device dm-2): enabling ssd optimizations
> Jun 23 18:13:41 gargamel kernel: [ 3356.626037] perf: interrupt took too long (3143 > 3133), lowering kernel.perf_event_max_sample_rate to 63500
> Jun 23 18:17:23 gargamel kernel: [ 3578.937225] Process accounting resumed
> Jun 23 18:33:47 gargamel kernel: [ 4563.356252] JFS: nTxBlock = 8192, nTxLock = 65536
> Jun 23 18:33:48 gargamel kernel: [ 4563.446715] ntfs: driver 2.1.32 [Flags: R/W MODULE].
> Jun 23 18:42:20 gargamel kernel: [ 5075.995254] INFO: task sync:20253 blocked for more than 120 seconds.
> Jun 23 18:42:20 gargamel kernel: [ 5076.015729]       Not tainted 4.17.2-amd64-preempt-sysrq-20180817 #1
> Jun 23 18:42:20 gargamel kernel: [ 5076.036141] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 23 18:42:20 gargamel kernel: [ 5076.060637] sync            D    0 20253  15327 0x20020080
> Jun 23 18:42:20 gargamel kernel: [ 5076.078032] Call Trace:
> Jun 23 18:42:20 gargamel kernel: [ 5076.086366]  ? __schedule+0x53e/0x59b
> Jun 23 18:42:20 gargamel kernel: [ 5076.098311]  schedule+0x7f/0x98
> Jun 23 18:42:20 gargamel kernel: [ 5076.108665]  __rwsem_down_read_failed_common+0x127/0x1a8
> Jun 23 18:42:20 gargamel kernel: [ 5076.125565]  ? sync_fs_one_sb+0x20/0x20
> Jun 23 18:42:20 gargamel kernel: [ 5076.137982]  ? call_rwsem_down_read_failed+0x14/0x30
> Jun 23 18:42:20 gargamel kernel: [ 5076.154081]  call_rwsem_down_read_failed+0x14/0x30
> Jun 23 18:42:20 gargamel kernel: [ 5076.169429]  down_read+0x13/0x25
> Jun 23 18:42:20 gargamel kernel: [ 5076.180444]  iterate_supers+0x57/0xbe
> Jun 23 18:42:20 gargamel kernel: [ 5076.192619]  ksys_sync+0x40/0xa4
> Jun 23 18:42:20 gargamel kernel: [ 5076.203192]  __ia32_sys_sync+0xa/0xd
> Jun 23 18:42:20 gargamel kernel: [ 5076.214774]  do_fast_syscall_32+0xaf/0xf3
> Jun 23 18:42:20 gargamel kernel: [ 5076.227740]  entry_SYSENTER_compat+0x7f/0x91
> Jun 23 18:44:21 gargamel kernel: [ 5196.828764] INFO: task sync:20253 blocked for more than 120 seconds.
> Jun 23 18:44:21 gargamel kernel: [ 5196.848724]       Not tainted 4.17.2-amd64-preempt-sysrq-20180817 #1
> Jun 23 18:44:21 gargamel kernel: [ 5196.868789] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 23 18:44:21 gargamel kernel: [ 5196.893615] sync            D    0 20253  15327 0x20020080
> 
>>> But back to the main point, it's sad that after so many years, the
>>> repair situation is still so suboptimal, especially when it's apparently
>>> pretty easy for btrfs to get damaged (through its own fault or not, hard
>>> to say).
>>
>> Unfortunately, yes.
>> Especially the extent tree is pretty fragile and hard to repair.
> 
> So, I don't know the code, but if I may make a suggestion (which maybe
> is totally wrong, if so forgive me):
> I would love for a repair mode that gives me a back a fixed
> filesystem. I don't really care how much data is lost (although ideally
> it would give me a list of files lost), but I want a working filesystem
> at the end. I can then decide if there is enough data left on it to
> restore what's missing or if I'm better off starting from scratch.

If for that usage, btrfs-restore would fit your use case more,
Unfortunately it needs extra disk space and isn't good at restoring
subvolume/snapshots.
(Although it's much faster than repairing the possible corrupted extent
tree)

> 
> Is that possible at all?

At least for file recovery (fs tree repair), we have such behavior.

However, the problem you hit (and a lot of users hit) is all about
extent tree repair, which doesn't even goes to file recovery.

All the hassle are in extent tree, and for extent tree, it's just good
or bad. Any corruption in extent tree may lead to later bugs.
The only way to avoid extent tree problems is to mount the fs RO.

So, I'm afraid it is at least impossible for recent years.

Thanks,
Qu

> 
> Thanks,
> Marc
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  6:10       ` Marc MERLIN
@ 2018-06-29  6:32         ` Su Yue
  2018-06-29  6:43           ` Marc MERLIN
  0 siblings, 1 reply; 65+ messages in thread
From: Su Yue @ 2018-06-29  6:32 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Qu Wenruo, linux-btrfs



On 06/29/2018 02:10 PM, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 02:02:19PM +0800, Su Yue wrote:
>> I have figured out the bug is lowmem check can't deal with shared tree block
>> in reloc tree. The fix is simple, you can try the follow repo:
>>
>> https://github.com/Damenly/btrfs-progs/tree/tmp1
> 
> Not sure if I undertand that you meant, here.
> 
Sorry for my unclear words.
Simply speaking, I suggest you to stop current running check.
Then, clone above branch to compile binary then run
'btrfs check --mode=lowmem $dev'.

>> Please run lowmem check "without =--repair" first to be sure whether
>> your filesystem is fine.
>   
> The filesystem is not fine, it caused btrfs balance to hang, whether
> balance actually broke it further or caused the breakage, I can't say.
> 
> Then mount hangs, even with recovery, unless I use ro.
> 
> This filesystem is trash to me and will require over a week to rebuild
> manually if I can't repair it.

Understood your anxiety, a log of check without '--repair' will help
us to make clear what's wrong with your filesystem.

Thanks,
Su
> Running check without repair for likely several days just to know that
> my filesystem is not clear (I already know this) isn't useful :)
> Or am I missing something?
> 
>> Though the bug and phenomenon are clear enough, before sending my patch,
>> I have to make a test image. I have spent a week to study btrfs balance
>> but it seems a liitle hard for me.
> 
> thanks for having a look, either way.
> 
> Marc
> 



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  6:32         ` Su Yue
@ 2018-06-29  6:43           ` Marc MERLIN
  2018-07-01 23:22             ` Marc MERLIN
  0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29  6:43 UTC (permalink / raw)
  To: Su Yue; +Cc: Qu Wenruo, linux-btrfs

On Fri, Jun 29, 2018 at 02:32:44PM +0800, Su Yue wrote:
> > > https://github.com/Damenly/btrfs-progs/tree/tmp1
> > 
> > Not sure if I undertand that you meant, here.
> > 
> Sorry for my unclear words.
> Simply speaking, I suggest you to stop current running check.
> Then, clone above branch to compile binary then run
> 'btrfs check --mode=lowmem $dev'.
 
I understand, I'll build and try it.

> > This filesystem is trash to me and will require over a week to rebuild
> > manually if I can't repair it.
> 
> Understood your anxiety, a log of check without '--repair' will help
> us to make clear what's wrong with your filesystem.

Ok, I'll run your new code without repair and report back. It will
likely take over a day though.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  6:29         ` Qu Wenruo
@ 2018-06-29  6:59           ` Marc MERLIN
  2018-06-29  7:09             ` Roman Mamedov
  2018-06-29  7:20             ` So, does btrfs check lowmem take days? weeks? Qu Wenruo
  0 siblings, 2 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29  6:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Jun 29, 2018 at 02:29:10PM +0800, Qu Wenruo wrote:
> > If --repair doesn't work, check is useless to me sadly.
> 
> Not exactly.
> Although it's time consuming, I have manually patched several users fs,
> which normally ends pretty well.
 
Ok I understand now.

> > Agreed, I doubt I have over or much over 100 snapshots though (but I
> > can't check right now).
> > Sadly I'm not allowed to mount even read only while check is running:
> > gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2
> > mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy

Ok, so I just checked now, 270 snapshots, but not because I'm crazy,
because I use btrfs send a lot :)

> This looks like super block corruption?
> 
> What about "btrfs inspect dump-super -fFa /dev/mapper/dshelf2"?

Sure, there you go: https://pastebin.com/uF1pHTsg

> And what about "skip_balance" mount option?
 
I have this in my fstab :)

> Another problem is, with so many snapshots, balance is also hugely
> slowed, thus I'm not 100% sure if it's really a hang.

I sent another thread about this last week, balance got hung after 2
days of doing nothing and just moving a single chunk.

Ok, I was able to remount the filesystem read only. I was wrong, I have
270 snapshots:
gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup/'
74
gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup-btrfssend/'
196

It's a backup server, I use btrfs send for many machines and for each btrs
send, I keep history, maybe 10 or so backups. So it adds up in the end.

Is btrfs unable to deal with this well enough?

> If for that usage, btrfs-restore would fit your use case more,
> Unfortunately it needs extra disk space and isn't good at restoring
> subvolume/snapshots.
> (Although it's much faster than repairing the possible corrupted extent
> tree)

It's a backup server, it only contains data from other machines.
If the filesystem cannot be recovered to a working state, I will need
over a week to restart the many btrfs send commands from many servers.
This is why anything other than --repair is useless ot me, I don't need
the data back, it's still on the original machines, I need the
filesystem to work again so that I don't waste a week recreating the
many btrfs send/receive relationships.

> > Is that possible at all?
> 
> At least for file recovery (fs tree repair), we have such behavior.
> 
> However, the problem you hit (and a lot of users hit) is all about
> extent tree repair, which doesn't even goes to file recovery.
> 
> All the hassle are in extent tree, and for extent tree, it's just good
> or bad. Any corruption in extent tree may lead to later bugs.
> The only way to avoid extent tree problems is to mount the fs RO.
> 
> So, I'm afraid it is at least impossible for recent years.

Understood, thanks for answering.

Does the pastebin help and is 270 snapshots ok enough?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  6:59           ` Marc MERLIN
@ 2018-06-29  7:09             ` Roman Mamedov
  2018-06-29  7:22               ` Marc MERLIN
  2018-06-29  7:20             ` So, does btrfs check lowmem take days? weeks? Qu Wenruo
  1 sibling, 1 reply; 65+ messages in thread
From: Roman Mamedov @ 2018-06-29  7:09 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

On Thu, 28 Jun 2018 23:59:03 -0700
Marc MERLIN <marc@merlins.org> wrote:

> I don't waste a week recreating the many btrfs send/receive relationships.

Consider not using send/receive, and switching to regular rsync instead.
Send/receive is very limiting and cumbersome, including because of what you
described. And it doesn't gain you much over an incremental rsync. As for
snapshots on the backup server, you can either automate making one as soon as a
backup has finished, or simply make them once/twice a day, during a period
when no backups are ongoing.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  6:59           ` Marc MERLIN
  2018-06-29  7:09             ` Roman Mamedov
@ 2018-06-29  7:20             ` Qu Wenruo
  2018-06-29  7:28               ` Marc MERLIN
  1 sibling, 1 reply; 65+ messages in thread
From: Qu Wenruo @ 2018-06-29  7:20 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3808 bytes --]



On 2018年06月29日 14:59, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 02:29:10PM +0800, Qu Wenruo wrote:
>>> If --repair doesn't work, check is useless to me sadly.
>>
>> Not exactly.
>> Although it's time consuming, I have manually patched several users fs,
>> which normally ends pretty well.
>  
> Ok I understand now.
> 
>>> Agreed, I doubt I have over or much over 100 snapshots though (but I
>>> can't check right now).
>>> Sadly I'm not allowed to mount even read only while check is running:
>>> gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2
>>> mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy
> 
> Ok, so I just checked now, 270 snapshots, but not because I'm crazy,
> because I use btrfs send a lot :)
> 
>> This looks like super block corruption?
>>
>> What about "btrfs inspect dump-super -fFa /dev/mapper/dshelf2"?
> 
> Sure, there you go: https://pastebin.com/uF1pHTsg
> 
>> And what about "skip_balance" mount option?
>  
> I have this in my fstab :)
> 
>> Another problem is, with so many snapshots, balance is also hugely
>> slowed, thus I'm not 100% sure if it's really a hang.
> 
> I sent another thread about this last week, balance got hung after 2
> days of doing nothing and just moving a single chunk.
> 
> Ok, I was able to remount the filesystem read only. I was wrong, I have
> 270 snapshots:
> gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup/'
> 74
> gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup-btrfssend/'
> 196
> 
> It's a backup server, I use btrfs send for many machines and for each btrs
> send, I keep history, maybe 10 or so backups. So it adds up in the end.
> 
> Is btrfs unable to deal with this well enough?

It depends.
For certain and rare case, if the only operations to the filesystem are
non-btrfs specific operations (POSIX file operations), then you're fine.
(Maybe you can go thousands snapshots before any obvious performance
degrade)

If certain btrfs specific operations are involved, it's definitely not OK:
1) Balance
2) Quota
3) Btrfs check

> 
>> If for that usage, btrfs-restore would fit your use case more,
>> Unfortunately it needs extra disk space and isn't good at restoring
>> subvolume/snapshots.
>> (Although it's much faster than repairing the possible corrupted extent
>> tree)
> 
> It's a backup server, it only contains data from other machines.
> If the filesystem cannot be recovered to a working state, I will need
> over a week to restart the many btrfs send commands from many servers.
> This is why anything other than --repair is useless ot me, I don't need
> the data back, it's still on the original machines, I need the
> filesystem to work again so that I don't waste a week recreating the
> many btrfs send/receive relationships.

Now totally understand why you need to repair the fs.

> 
>>> Is that possible at all?
>>
>> At least for file recovery (fs tree repair), we have such behavior.
>>
>> However, the problem you hit (and a lot of users hit) is all about
>> extent tree repair, which doesn't even goes to file recovery.
>>
>> All the hassle are in extent tree, and for extent tree, it's just good
>> or bad. Any corruption in extent tree may lead to later bugs.
>> The only way to avoid extent tree problems is to mount the fs RO.
>>
>> So, I'm afraid it is at least impossible for recent years.
> 
> Understood, thanks for answering.
> 
> Does the pastebin help and is 270 snapshots ok enough?

The super dump doesn't show anything wrong.

So the problem may be in the super large extent tree.

In this case, plain check result with Su's patch would help more, other
than the not so interesting super dump.

Thanks,
Qu

> 
> Thanks,
> Marc
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  7:09             ` Roman Mamedov
@ 2018-06-29  7:22               ` Marc MERLIN
  2018-06-29  7:34                 ` Roman Mamedov
  2018-06-29  8:04                 ` Lionel Bouton
  0 siblings, 2 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29  7:22 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-btrfs

On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
> On Thu, 28 Jun 2018 23:59:03 -0700
> Marc MERLIN <marc@merlins.org> wrote:
> 
> > I don't waste a week recreating the many btrfs send/receive relationships.
> 
> Consider not using send/receive, and switching to regular rsync instead.
> Send/receive is very limiting and cumbersome, including because of what you
> described. And it doesn't gain you much over an incremental rsync. As for

Err, sorry but I cannot agree with you here, at all :)

btrfs send/receive is pretty much the only reason I use btrfs. 
rsync takes hours on big filesystems scanning every single inode on both
sides and then seeing what changed, and only then sends the differences
It's super inefficient.
btrfs send knows in seconds what needs to be sent, and works on it right
away.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  7:20             ` So, does btrfs check lowmem take days? weeks? Qu Wenruo
@ 2018-06-29  7:28               ` Marc MERLIN
  2018-06-29 17:10                 ` Marc MERLIN
  0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29  7:28 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Jun 29, 2018 at 03:20:42PM +0800, Qu Wenruo wrote:
> If certain btrfs specific operations are involved, it's definitely not OK:
> 1) Balance
> 2) Quota
> 3) Btrfs check

Ok, I understand. I'll try to balance almost never then. My problems did
indeed start because I ran balance and it got stuck 2 days with 0
progress.
That still seems like a bug though. I'm ok with slow, but stuck for 2
days with only 270 snapshots or so means there is a bug, or the
algorithm is so expensive that 270 snapshots could cause it to take days
or weeks to proceed?

> > It's a backup server, it only contains data from other machines.
> > If the filesystem cannot be recovered to a working state, I will need
> > over a week to restart the many btrfs send commands from many servers.
> > This is why anything other than --repair is useless ot me, I don't need
> > the data back, it's still on the original machines, I need the
> > filesystem to work again so that I don't waste a week recreating the
> > many btrfs send/receive relationships.
> 
> Now totally understand why you need to repair the fs.

I also understand that my use case is atypical :)
But I guess this also means that using btrfs for a lot of send/receive
on a backup server is not going to work well unfortunately :-/

Now I'm wondering if I'm the only person even doing this.

> > Does the pastebin help and is 270 snapshots ok enough?
> 
> The super dump doesn't show anything wrong.
> 
> So the problem may be in the super large extent tree.
> 
> In this case, plain check result with Su's patch would help more, other
> than the not so interesting super dump.

First I tried to mount with skip balance after the partial repair, and
it hung a long time:
[445635.716318] BTRFS info (device dm-2): disk space caching is enabled
[445635.736229] BTRFS info (device dm-2): has skinny extents
[445636.101999] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[445825.053205] BTRFS info (device dm-2): enabling ssd optimizations
[446511.006588] BTRFS info (device dm-2): disk space caching is enabled
[446511.026737] BTRFS info (device dm-2): has skinny extents
[446511.325470] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[446699.593501] BTRFS info (device dm-2): enabling ssd optimizations
[446964.077045] INFO: task btrfs-transacti:9211 blocked for more than 120 seconds.
[446964.099802]       Not tainted 4.17.2-amd64-preempt-sysrq-20180818 #3
[446964.120004] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

So, I rebooted, and will now run Su's btrfs check without repair and
report back.

Thanks both for your help.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  7:22               ` Marc MERLIN
@ 2018-06-29  7:34                 ` Roman Mamedov
  2018-06-29  8:04                 ` Lionel Bouton
  1 sibling, 0 replies; 65+ messages in thread
From: Roman Mamedov @ 2018-06-29  7:34 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

On Fri, 29 Jun 2018 00:22:10 -0700
Marc MERLIN <marc@merlins.org> wrote:

> On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
> > On Thu, 28 Jun 2018 23:59:03 -0700
> > Marc MERLIN <marc@merlins.org> wrote:
> > 
> > > I don't waste a week recreating the many btrfs send/receive relationships.
> > 
> > Consider not using send/receive, and switching to regular rsync instead.
> > Send/receive is very limiting and cumbersome, including because of what you
> > described. And it doesn't gain you much over an incremental rsync. As for
> 
> Err, sorry but I cannot agree with you here, at all :)
> 
> btrfs send/receive is pretty much the only reason I use btrfs. 
> rsync takes hours on big filesystems scanning every single inode on both
> sides and then seeing what changed, and only then sends the differences

I use it for backing up root filesystems of about 20 hosts, and for syncing
large multi-terabyte media collections -- it's fast enough in both.
Admittedly neither of those case has millions of subdirs or files where
scanning may take a long time. And in the former case it's also all from and
to SSDs. Maybe your use case is different where it doesn't work as well. But
perhaps then general day-to-day performance is not great either, so I'd suggest
looking into SSD-based LVM caching, it really works wonders with Btrfs.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  7:22               ` Marc MERLIN
  2018-06-29  7:34                 ` Roman Mamedov
@ 2018-06-29  8:04                 ` Lionel Bouton
  2018-06-29 16:24                   ` btrfs send/receive vs rsync Marc MERLIN
  1 sibling, 1 reply; 65+ messages in thread
From: Lionel Bouton @ 2018-06-29  8:04 UTC (permalink / raw)
  To: Marc MERLIN, Roman Mamedov; +Cc: linux-btrfs

Hi,

On 29/06/2018 09:22, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
>> On Thu, 28 Jun 2018 23:59:03 -0700
>> Marc MERLIN <marc@merlins.org> wrote:
>>
>>> I don't waste a week recreating the many btrfs send/receive relationships.
>> Consider not using send/receive, and switching to regular rsync instead.
>> Send/receive is very limiting and cumbersome, including because of what you
>> described. And it doesn't gain you much over an incremental rsync. As for
> Err, sorry but I cannot agree with you here, at all :)
>
> btrfs send/receive is pretty much the only reason I use btrfs. 
> rsync takes hours on big filesystems scanning every single inode on both
> sides and then seeing what changed, and only then sends the differences
> It's super inefficient.
> btrfs send knows in seconds what needs to be sent, and works on it right
> away.

I've not yet tried send/receive but I feel the pain of rsyncing millions
of files (I had to use lsyncd to limit the problem to the time the
origin servers reboot which is a relatively rare event) so this thread
picked my attention. Looking at the whole thread I wonder if you could
get a more manageable solution by splitting the filesystem.

If instead of using a single BTRFS filesystem you used LVM volumes
(maybe with Thin provisioning and monitoring of the volume group free
space) for each of your servers to backup with one BTRFS filesystem per
volume you would have less snapshots per filesystem and isolate problems
in case of corruption. If you eventually decide to start from scratch
again this might help a lot in your case.

Lionel

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: btrfs send/receive vs rsync
  2018-06-29  8:04                 ` Lionel Bouton
@ 2018-06-29 16:24                   ` Marc MERLIN
  2018-06-30  8:18                     ` Duncan
  0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29 16:24 UTC (permalink / raw)
  To: Lionel Bouton; +Cc: Roman Mamedov, linux-btrfs

On Fri, Jun 29, 2018 at 10:04:02AM +0200, Lionel Bouton wrote:
> Hi,
> 
> On 29/06/2018 09:22, Marc MERLIN wrote:
> > On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
> >> On Thu, 28 Jun 2018 23:59:03 -0700
> >> Marc MERLIN <marc@merlins.org> wrote:
> >>
> >>> I don't waste a week recreating the many btrfs send/receive relationships.
> >> Consider not using send/receive, and switching to regular rsync instead.
> >> Send/receive is very limiting and cumbersome, including because of what you
> >> described. And it doesn't gain you much over an incremental rsync. As for
> > Err, sorry but I cannot agree with you here, at all :)
> >
> > btrfs send/receive is pretty much the only reason I use btrfs. 
> > rsync takes hours on big filesystems scanning every single inode on both
> > sides and then seeing what changed, and only then sends the differences
> > It's super inefficient.
> > btrfs send knows in seconds what needs to be sent, and works on it right
> > away.
> 
> I've not yet tried send/receive but I feel the pain of rsyncing millions
> of files (I had to use lsyncd to limit the problem to the time the
> origin servers reboot which is a relatively rare event) so this thread
> picked my attention. Looking at the whole thread I wonder if you could
> get a more manageable solution by splitting the filesystem.

So, let's be clear. I did backups with rsync for 10+ years. It was slow
and painful. On my laptop an hourly rsync between 2 drives slowed down
my machine to a crawl while everything was being stat'ed, it took
forever.
Now with btrfs send/receive, it just works, I don't even see it
happening in the background.

Here is a page I wrote about it in 2014:
http://marc.merlins.org/perso/btrfs/2014-03.html#Btrfs-Tips_-Doing-Fast-Incremental-Backups-With-Btrfs-Send-and-Receive

Here is a talk I gave in 2014 too, scroll to the bottom of the page, and
the bottom of the talk outline:
http://marc.merlins.org/perso/btrfs/2014-05.html#My-Btrfs-Talk-at-Linuxcon-JP-2014
and click on 'Btrfs send/receive'

> If instead of using a single BTRFS filesystem you used LVM volumes
> (maybe with Thin provisioning and monitoring of the volume group free
> space) for each of your servers to backup with one BTRFS filesystem per
> volume you would have less snapshots per filesystem and isolate problems
> in case of corruption. If you eventually decide to start from scratch
> again this might help a lot in your case.

So, I already have problems due to too many block layers:
- raid 5 + ssd
- bcache
- dmcrypt
- btrfs

I get occasional deadlocks due to upper layers sending more data to the
lower layer (bcache) than it can process. I'm a bit warry of adding yet
another layer (LVM), but you're otherwise correct than keeping smaller
btrfs filesystems would help with performance and containing possible
damage.

Has anyone actually done this? :)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  7:28               ` Marc MERLIN
@ 2018-06-29 17:10                 ` Marc MERLIN
  2018-06-30  0:04                   ` Chris Murphy
  2018-06-30  2:44                   ` Marc MERLIN
  0 siblings, 2 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-06-29 17:10 UTC (permalink / raw)
  To: Qu Wenruo, suy.fnst; +Cc: linux-btrfs

On Fri, Jun 29, 2018 at 12:28:31AM -0700, Marc MERLIN wrote:
> So, I rebooted, and will now run Su's btrfs check without repair and
> report back.

As expected, it will likely still take days, here's the start:

gargamel:~# btrfs check --mode=lowmem  -p /dev/mapper/dshelf2  
Checking filesystem on /dev/mapper/dshelf2 
UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d 
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 2, have: 4
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 2, have: 4
ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 180, have: 240
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 67, have: 115
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 67, have: 115
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 114, have: 143
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 114, have: 143
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 301, have: 431
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 355, have: 433
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 160, have: 240
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 161, have: 240
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 169, have: 249
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 171, have: 251
ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 347, have: 418
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 1, have: 1449
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 1, have: 1452

Mmmh, these look similar (but not identical) to the last run earlier in this thread:
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4
Created new chunk [18457780224000 1073741824]
Delete backref in extent [84302495744 69632]
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4
Delete backref in extent [84302495744 69632]
ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240
Delete backref in extent [125712527360 12214272]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431
Delete backref in extent [129952120832 20242432]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433
Delete backref in extent [129952120832 20242432]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249
Delete backref in extent [147895111680 12345344]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251
Delete backref in extent [147895111680 12345344]
ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 348, have: 418
Delete backref in extent [150850146304 17522688]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 555, have: 1449
Deleted root 2 item[156909494272, 178, 5476627808561673095]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 556, have: 1452
Deleted root 2 item[156909494272, 178, 7338474132555182983]

I guess the last repair didn't repair things in a way that they're working now?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29 17:10                 ` Marc MERLIN
@ 2018-06-30  0:04                   ` Chris Murphy
  2018-06-30  2:44                   ` Marc MERLIN
  1 sibling, 0 replies; 65+ messages in thread
From: Chris Murphy @ 2018-06-30  0:04 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Qu Wenruo, Su Yue, Btrfs BTRFS

I've got about 1/2 the snapshots and less than 1/10th the data...but
my btrfs check times are much shorter than either: 15 minutes and 65
minutes (lowmem).


[chris@f28s ~]$ sudo btrfs fi us /mnt/first
Overall:
    Device size:        1024.00GiB
    Device allocated:         774.12GiB
    Device unallocated:         249.87GiB
    Device missing:             0.00B
    Used:             760.48GiB
    Free (estimated):         256.95GiB    (min: 132.01GiB)
    Data ratio:                  1.00
    Metadata ratio:              2.00
    Global reserve:         512.00MiB    (used: 0.00B)

Data,single: Size:761.00GiB, Used:753.93GiB
   /dev/mapper/first     761.00GiB

Metadata,DUP: Size:6.50GiB, Used:3.28GiB
   /dev/mapper/first      13.00GiB

System,DUP: Size:64.00MiB, Used:112.00KiB
   /dev/mapper/first     128.00MiB

Unallocated:
   /dev/mapper/first     249.87GiB


146 subvolumes
137 snapshots

total csum bytes: 790549924
total tree bytes: 3519250432
total fs tree bytes: 2546073600
total extent tree bytes: 131350528


Original mode check takes ~15 minutes
Lowmem mode takes ~65 minutes

RAM: 4G
CPU: Intel(R) Pentium(R) CPU  N3700  @ 1.60GHz



Chris Murphy

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29 17:10                 ` Marc MERLIN
  2018-06-30  0:04                   ` Chris Murphy
@ 2018-06-30  2:44                   ` Marc MERLIN
  2018-06-30 14:49                     ` Qu Wenruo
  1 sibling, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-06-30  2:44 UTC (permalink / raw)
  To: Qu Wenruo, suy.fnst; +Cc: linux-btrfs

Well, there goes that. After about 18H:
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 1, have: 1452 
backref.c:466: __add_missing_keys: Assertion `ref->root_id` failed, value 0 
btrfs(+0x3a232)[0x56091704f232] 
btrfs(+0x3ab46)[0x56091704fb46] 
btrfs(+0x3b9f5)[0x5609170509f5] 
btrfs(btrfs_find_all_roots+0x9)[0x560917050a45] 
btrfs(+0x572ff)[0x56091706c2ff] 
btrfs(+0x60b13)[0x560917075b13] 
btrfs(cmd_check+0x2634)[0x56091707d431] 
btrfs(main+0x88)[0x560917027260] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f93aa508561] 
btrfs(_start+0x2a)[0x560917026dfa] 
Aborted 

That's https://github.com/Damenly/btrfs-progs.git

Whoops, I didn't use the tmp1 branch, let me try again with that and
report back, although the problem above is still going to be there since
I think the only difference will be this, correct?
https://github.com/Damenly/btrfs-progs/commit/b5851513a12237b3e19a3e71f3ad00b966d25b3a

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: btrfs send/receive vs rsync
  2018-06-29 16:24                   ` btrfs send/receive vs rsync Marc MERLIN
@ 2018-06-30  8:18                     ` Duncan
  0 siblings, 0 replies; 65+ messages in thread
From: Duncan @ 2018-06-30  8:18 UTC (permalink / raw)
  To: linux-btrfs

Marc MERLIN posted on Fri, 29 Jun 2018 09:24:20 -0700 as excerpted:

>> If instead of using a single BTRFS filesystem you used LVM volumes
>> (maybe with Thin provisioning and monitoring of the volume group free
>> space) for each of your servers to backup with one BTRFS filesystem per
>> volume you would have less snapshots per filesystem and isolate
>> problems in case of corruption. If you eventually decide to start from
>> scratch again this might help a lot in your case.
> 
> So, I already have problems due to too many block layers:
> - raid 5 + ssd - bcache - dmcrypt - btrfs
> 
> I get occasional deadlocks due to upper layers sending more data to the
> lower layer (bcache) than it can process. I'm a bit warry of adding yet
> another layer (LVM), but you're otherwise correct than keeping smaller
> btrfs filesystems would help with performance and containing possible
> damage.
> 
> Has anyone actually done this? :)

So I definitely use (and advocate!) the split-em-up strategy, and I use 
btrfs, but that's pretty much all the similarity we have.

I'm all ssd, having left spinning rust behind.  My strategy avoids 
unnecessary layers like lvm (tho crypt can arguably be necessary), 
preferring direct on-device (gpt) partitioning for simplicity of 
management and disaster recovery.  And my backup and recovery strategy is 
an equally simple mkfs and full-filesystem-fileset copy to an identically 
sized filesystem, with backups easily bootable/mountable in place of the 
working copy if necessary, and multiple backups so if disaster takes out 
the backup I was writing at the same time as the working copy, I still 
have a backup to fall back to.

So it's different enough I'm not sure how much my experience will help 
you.  But I /can/ say the subdivision is nice, as it means I can keep my 
root filesystem read-only by default for reliability, my most-at-risk log 
filesystem tiny for near-instant scrub/balance/check, and my also at risk 
home small as well, with the big media files being on a different 
filesystem that's mostly read-only, so less at risk and needing less 
frequent backups.  The tiny boot and large updates (distro repo, sources, 
ccache) are also separate, and mounted only for boot maintenance or 
updates.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-30  2:44                   ` Marc MERLIN
@ 2018-06-30 14:49                     ` Qu Wenruo
  2018-06-30 21:06                       ` Marc MERLIN
  0 siblings, 1 reply; 65+ messages in thread
From: Qu Wenruo @ 2018-06-30 14:49 UTC (permalink / raw)
  To: Marc MERLIN, suy.fnst; +Cc: linux-btrfs



On 2018年06月30日 10:44, Marc MERLIN wrote:
> Well, there goes that. After about 18H:
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 1, have: 1452 
> backref.c:466: __add_missing_keys: Assertion `ref->root_id` failed, value 0 
> btrfs(+0x3a232)[0x56091704f232] 
> btrfs(+0x3ab46)[0x56091704fb46] 
> btrfs(+0x3b9f5)[0x5609170509f5] 
> btrfs(btrfs_find_all_roots+0x9)[0x560917050a45] 
> btrfs(+0x572ff)[0x56091706c2ff] 
> btrfs(+0x60b13)[0x560917075b13] 
> btrfs(cmd_check+0x2634)[0x56091707d431] 
> btrfs(main+0x88)[0x560917027260] 
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f93aa508561] 
> btrfs(_start+0x2a)[0x560917026dfa] 
> Aborted 

I think that's the root cause.
Some invalid extent tree backref or bad tree block blow up backref code.

All previous error message may be garbage unless you're using Su's
latest branch, as lowmem mode tends to report false alerts on refrencer
count mismatch.

But the last abort looks pretty possible to be the culprit.

Would you try to dump the extent tree?
# btrfs inspect dump-tree -t extent <device> | grep -A50 156909494272

It should help us locate the culprit and hopefully get some chance to
fix it.

Thanks,
Qu

> 
> That's https://github.com/Damenly/btrfs-progs.git
> 
> Whoops, I didn't use the tmp1 branch, let me try again with that and
> report back, although the problem above is still going to be there since
> I think the only difference will be this, correct?
> https://github.com/Damenly/btrfs-progs/commit/b5851513a12237b3e19a3e71f3ad00b966d25b3a
> 
> Marc
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-30 14:49                     ` Qu Wenruo
@ 2018-06-30 21:06                       ` Marc MERLIN
  0 siblings, 0 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-06-30 21:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: suy.fnst, linux-btrfs

On Sat, Jun 30, 2018 at 10:49:07PM +0800, Qu Wenruo wrote:
> But the last abort looks pretty possible to be the culprit.
> 
> Would you try to dump the extent tree?
> # btrfs inspect dump-tree -t extent <device> | grep -A50 156909494272

Sure, there you go:

	item 25 key (156909494272 EXTENT_ITEM 55320576) itemoff 14943 itemsize 24
		refs 19715 gen 31575 flags DATA
	item 26 key (156909494272 EXTENT_DATA_REF 571620086735451015) itemoff 14915 itemsize 28
		extent data backref root 21641 objectid 374857 offset 235175936 count 1452
	item 27 key (156909494272 EXTENT_DATA_REF 1765833482087969671) itemoff 14887 itemsize 28
		extent data backref root 23094 objectid 374857 offset 235175936 count 1442
	item 28 key (156909494272 EXTENT_DATA_REF 1807626434455810951) itemoff 14859 itemsize 28
		extent data backref root 21503 objectid 374857 offset 235175936 count 1454
	item 29 key (156909494272 EXTENT_DATA_REF 1879818091602916231) itemoff 14831 itemsize 28
		extent data backref root 21462 objectid 374857 offset 235175936 count 1454
	item 30 key (156909494272 EXTENT_DATA_REF 3610854505775117191) itemoff 14803 itemsize 28
		extent data backref root 23134 objectid 374857 offset 235175936 count 1442
	item 31 key (156909494272 EXTENT_DATA_REF 3754675454231458695) itemoff 14775 itemsize 28
		extent data backref root 23052 objectid 374857 offset 235175936 count 1442
	item 32 key (156909494272 EXTENT_DATA_REF 5060494667839714183) itemoff 14747 itemsize 28
		extent data backref root 23174 objectid 374857 offset 235175936 count 1440
	item 33 key (156909494272 EXTENT_DATA_REF 5476627808561673095) itemoff 14719 itemsize 28
		extent data backref root 22911 objectid 374857 offset 235175936 count 1
	item 34 key (156909494272 EXTENT_DATA_REF 6378484416458011527) itemoff 14691 itemsize 28
		extent data backref root 23012 objectid 374857 offset 235175936 count 1442
	item 35 key (156909494272 EXTENT_DATA_REF 7338474132555182983) itemoff 14663 itemsize 28
		extent data backref root 21872 objectid 374857 offset 235175936 count 1
	item 36 key (156909494272 EXTENT_DATA_REF 7516565391717970823) itemoff 14635 itemsize 28
		extent data backref root 21826 objectid 374857 offset 235175936 count 1452
	item 37 key (156909494272 SHARED_DATA_REF 14871537025024) itemoff 14631 itemsize 4
		shared data backref count 10
	item 38 key (156909494272 SHARED_DATA_REF 14871617568768) itemoff 14627 itemsize 4
		shared data backref count 73
	item 39 key (156909494272 SHARED_DATA_REF 14871619846144) itemoff 14623 itemsize 4
		shared data backref count 59
	item 40 key (156909494272 SHARED_DATA_REF 14871623270400) itemoff 14619 itemsize 4
		shared data backref count 68
	item 41 key (156909494272 SHARED_DATA_REF 14871623532544) itemoff 14615 itemsize 4
		shared data backref count 70
	item 42 key (156909494272 SHARED_DATA_REF 14871626383360) itemoff 14611 itemsize 4
		shared data backref count 76
	item 43 key (156909494272 SHARED_DATA_REF 14871635132416) itemoff 14607 itemsize 4
		shared data backref count 60
	item 44 key (156909494272 SHARED_DATA_REF 14871649533952) itemoff 14603 itemsize 4
		shared data backref count 79
	item 45 key (156909494272 SHARED_DATA_REF 14871862378496) itemoff 14599 itemsize 4
		shared data backref count 70
	item 46 key (156909494272 SHARED_DATA_REF 14909667098624) itemoff 14595 itemsize 4
		shared data backref count 72
	item 47 key (156909494272 SHARED_DATA_REF 14909669720064) itemoff 14591 itemsize 4
		shared data backref count 58
	item 48 key (156909494272 SHARED_DATA_REF 14909734567936) itemoff 14587 itemsize 4
		shared data backref count 73
	item 49 key (156909494272 SHARED_DATA_REF 14909920477184) itemoff 14583 itemsize 4
		shared data backref count 79
	item 50 key (156909494272 SHARED_DATA_REF 14942279335936) itemoff 14579 itemsize 4
		shared data backref count 79
	item 51 key (156909494272 SHARED_DATA_REF 14942304862208) itemoff 14575 itemsize 4
		shared data backref count 72
	item 52 key (156909494272 SHARED_DATA_REF 14942348378112) itemoff 14571 itemsize 4
		shared data backref count 67
	item 53 key (156909494272 SHARED_DATA_REF 14942366138368) itemoff 14567 itemsize 4
		shared data backref count 51
	item 54 key (156909494272 SHARED_DATA_REF 14942384799744) itemoff 14563 itemsize 4
		shared data backref count 64
	item 55 key (156909494272 SHARED_DATA_REF 14978234613760) itemoff 14559 itemsize 4
		shared data backref count 61
	item 56 key (156909494272 SHARED_DATA_REF 14978246459392) itemoff 14555 itemsize 4
		shared data backref count 56
	item 57 key (156909494272 SHARED_DATA_REF 14978256879616) itemoff 14551 itemsize 4
		shared data backref count 75
	item 58 key (156909494272 SHARED_DATA_REF 15001465749504) itemoff 14547 itemsize 4
		shared data backref count 77
	item 59 key (156909494272 SHARED_DATA_REF 18215010877440) itemoff 14543 itemsize 4
		shared data backref count 79
	item 60 key (156909494272 SHARED_DATA_REF 18215045660672) itemoff 14539 itemsize 4
		shared data backref count 10
	item 61 key (156909494272 SHARED_DATA_REF 18215099023360) itemoff 14535 itemsize 4
		shared data backref count 56
	item 62 key (156909494272 SHARED_DATA_REF 18215114522624) itemoff 14531 itemsize 4
		shared data backref count 70
	item 63 key (156909494272 SHARED_DATA_REF 18215129874432) itemoff 14527 itemsize 4
		shared data backref count 68
	item 64 key (156909494272 SHARED_DATA_REF 18215130267648) itemoff 14523 itemsize 4
		shared data backref count 72
	item 65 key (156909494272 SHARED_DATA_REF 18215136264192) itemoff 14519 itemsize 4
		shared data backref count 64
	item 66 key (156909494272 SHARED_DATA_REF 18215138623488) itemoff 14515 itemsize 4
		shared data backref count 72
	item 67 key (156909494272 SHARED_DATA_REF 18215188414464) itemoff 14511 itemsize 4
		shared data backref count 58
	item 68 key (156909494272 SHARED_DATA_REF 18215188447232) itemoff 14507 itemsize 4
		shared data backref count 74
	item 69 key (156909494272 SHARED_DATA_REF 18215188529152) itemoff 14503 itemsize 4
		shared data backref count 69
	item 70 key (156909494272 SHARED_DATA_REF 18215204896768) itemoff 14499 itemsize 4
		shared data backref count 67
	item 71 key (156909494272 SHARED_DATA_REF 18215228358656) itemoff 14495 itemsize 4
		shared data backref count 68
	item 72 key (156909494272 SHARED_DATA_REF 18215228899328) itemoff 14491 itemsize 4
		shared data backref count 81
	item 73 key (156909494272 SHARED_DATA_REF 18215240892416) itemoff 14487 itemsize 4
		shared data backref count 78
	item 74 key (156909494272 SHARED_DATA_REF 18215244251136) itemoff 14483 itemsize 4
		shared data backref count 58
	item 75 key (156909494272 SHARED_DATA_REF 18215244365824) itemoff 14479 itemsize 4
		shared data backref count 63
	item 76 key (156909494272 SHARED_DATA_REF 18215252770816) itemoff 14475 itemsize 4
		shared data backref count 76
	item 77 key (156909494272 SHARED_DATA_REF 18215264337920) itemoff 14471 itemsize 4
		shared data backref count 76
	item 78 key (156909494272 SHARED_DATA_REF 18215270055936) itemoff 14467 itemsize 4
		shared data backref count 73
	item 79 key (156909494272 SHARED_DATA_REF 18215290601472) itemoff 14463 itemsize 4
		shared data backref count 63
	item 80 key (156909494272 SHARED_DATA_REF 18215290617856) itemoff 14459 itemsize 4
		shared data backref count 54
	item 81 key (156909494272 SHARED_DATA_REF 18244453154816) itemoff 14455 itemsize 4
		shared data backref count 79
	item 82 key (156909494272 SHARED_DATA_REF 18244454383616) itemoff 14451 itemsize 4
		shared data backref count 71
	item 83 key (156909494272 SHARED_DATA_REF 18249494151168) itemoff 14447 itemsize 4
		shared data backref count 79
	item 84 key (156909494272 SHARED_DATA_REF 18249500721152) itemoff 14443 itemsize 4
		shared data backref count 71
	item 85 key (156909494272 SHARED_DATA_REF 18249523789824) itemoff 14439 itemsize 4
		shared data backref count 51
	item 86 key (156909494272 SHARED_DATA_REF 18249586802688) itemoff 14435 itemsize 4
		shared data backref count 68
	item 87 key (156909494272 SHARED_DATA_REF 18249587703808) itemoff 14431 itemsize 4
		shared data backref count 70
	item 88 key (156909494272 SHARED_DATA_REF 18249588178944) itemoff 14427 itemsize 4
		shared data backref count 72
	item 89 key (156909494272 SHARED_DATA_REF 18249591291904) itemoff 14423 itemsize 4
		shared data backref count 67
	item 90 key (156909494272 SHARED_DATA_REF 18249598238720) itemoff 14419 itemsize 4
		shared data backref count 74
	item 91 key (156909494272 SHARED_DATA_REF 18249602285568) itemoff 14415 itemsize 4
		shared data backref count 79
	item 92 key (156909494272 SHARED_DATA_REF 18249611378688) itemoff 14411 itemsize 4
		shared data backref count 65
	item 93 key (156909494272 SHARED_DATA_REF 18249613082624) itemoff 14407 itemsize 4
		shared data backref count 55
	item 94 key (156909494272 SHARED_DATA_REF 18249642229760) itemoff 14403 itemsize 4
		shared data backref count 75
	item 95 key (156909494272 SHARED_DATA_REF 18249643458560) itemoff 14399 itemsize 4
		shared data backref count 68
	item 96 key (156909494272 SHARED_DATA_REF 18250800021504) itemoff 14395 itemsize 4
		shared data backref count 79
	item 97 key (156909494272 SHARED_DATA_REF 18250814963712) itemoff 14391 itemsize 4
		shared data backref count 71
	item 98 key (156909494272 SHARED_DATA_REF 18252047237120) itemoff 14387 itemsize 4
		shared data backref count 55
	item 99 key (156909494272 SHARED_DATA_REF 18252132515840) itemoff 14383 itemsize 4
		shared data backref count 68
	item 100 key (156909494272 SHARED_DATA_REF 18252134236160) itemoff 14379 itemsize 4
		shared data backref count 72
	item 101 key (156909494272 SHARED_DATA_REF 18252274827264) itemoff 14375 itemsize 4
		shared data backref count 68
	item 102 key (156909494272 SHARED_DATA_REF 18252313460736) itemoff 14371 itemsize 4
		shared data backref count 67
	item 103 key (156909494272 SHARED_DATA_REF 18252335906816) itemoff 14367 itemsize 4
		shared data backref count 79
	item 104 key (156909494272 SHARED_DATA_REF 18252336742400) itemoff 14363 itemsize 4
		shared data backref count 74
	item 105 key (156909494272 SHARED_DATA_REF 18254150631424) itemoff 14359 itemsize 4
		shared data backref count 56
	item 106 key (156909494272 SHARED_DATA_REF 18254342537216) itemoff 14355 itemsize 4
		shared data backref count 67
	item 107 key (156909494272 SHARED_DATA_REF 18255671017472) itemoff 14351 itemsize 4
		shared data backref count 72
	item 108 key (156909494272 SHARED_DATA_REF 18255806038016) itemoff 14347 itemsize 4
		shared data backref count 69
	item 109 key (156909494272 SHARED_DATA_REF 18255821996032) itemoff 14343 itemsize 4
		shared data backref count 67
	item 110 key (156909494272 SHARED_DATA_REF 18256006414336) itemoff 14339 itemsize 4
		shared data backref count 79
	item 111 key (156909494272 SHARED_DATA_REF 18256021012480) itemoff 14335 itemsize 4
		shared data backref count 74
	item 112 key (156909494272 SHARED_DATA_REF 18260113752064) itemoff 14331 itemsize 4
		shared data backref count 75
	item 113 key (156909494272 SHARED_DATA_REF 18260113883136) itemoff 14327 itemsize 4
		shared data backref count 65
	item 114 key (156909494272 SHARED_DATA_REF 18260114849792) itemoff 14323 itemsize 4
		shared data backref count 51
	item 115 key (156909494272 SHARED_DATA_REF 18260115013632) itemoff 14319 itemsize 4
		shared data backref count 70
	item 116 key (156909494272 SHARED_DATA_REF 18261625552896) itemoff 14315 itemsize 4
		shared data backref count 75
	item 117 key (156909494272 SHARED_DATA_REF 18261631107072) itemoff 14311 itemsize 4
		shared data backref count 65
	item 118 key (156909494272 SHARED_DATA_REF 18261652078592) itemoff 14307 itemsize 4
		shared data backref count 52
	item 119 key (156909494272 SHARED_DATA_REF 18261658025984) itemoff 14303 itemsize 4
		shared data backref count 70
	item 120 key (156964814848 EXTENT_ITEM 7487488) itemoff 13856 itemsize 447
		refs 2505 gen 31575 flags DATA
		extent data backref root 21826 objectid 374857 offset 290496512 count 192
		extent data backref root 21872 objectid 374857 offset 290496512 count 192
		extent data backref root 23012 objectid 374857 offset 290496512 count 193
		extent data backref root 22911 objectid 374857 offset 290496512 count 192
		extent data backref root 23174 objectid 374857 offset 290496512 count 193
		extent data backref root 23052 objectid 374857 offset 290496512 count 193
		extent data backref root 23134 objectid 374857 offset 290496512 count 193
		extent data backref root 21462 objectid 374857 offset 290496512 count 192
		extent data backref root 21503 objectid 374857 offset 290496512 count 192
		extent data backref root 23094 objectid 374857 offset 290496512 count 193
		extent data backref root 21641 objectid 374857 offset 290496512 count 192
		shared data backref parent 18215389659136 count 55
		shared data backref parent 18215388102656 count 63
		shared data backref parent 18215294795776 count 69
		shared data backref parent 18215244365824 count 7
		shared data backref parent 14978251440128 count 55
		shared data backref parent 14978250768384 count 63
		shared data backref parent 14978248212480 count 69
		shared data backref parent 14978246459392 count 7
	item 121 key (156972302336 EXTENT_ITEM 8192) itemoff 13487 itemsize 369
		refs 13 gen 31575 flags DATA
		extent data backref root 21826 objectid 374857 offset 297984000 count 1
		extent data backref root 21872 objectid 374857 offset 297984000 count 1
		extent data backref root 23012 objectid 374857 offset 297984000 count 1
		extent data backref root 22911 objectid 374857 offset 297984000 count 1
		extent data backref root 23174 objectid 374857 offset 297984000 count 1
		extent data backref root 23052 objectid 374857 offset 297984000 count 1
		extent data backref root 23134 objectid 374857 offset 297984000 count 1
		extent data backref root 21462 objectid 374857 offset 297984000 count 1
		extent data backref root 21503 objectid 374857 offset 297984000 count 1
		extent data backref root 23094 objectid 374857 offset 297984000 count 1
		extent data backref root 21641 objectid 374857 offset 297984000 count 1
		shared data backref parent 18215389659136 count 1
		shared data backref parent 14978251440128 count 1
	item 122 key (156972310528 EXTENT_ITEM 102400) itemoff 13450 itemsize 37
		refs 1 gen 31631 flags DATA
		shared data backref parent 17763118120960 count 1
	item 123 key (156972412928 EXTENT_ITEM 102400) itemoff 13413 itemsize 37
		refs 1 gen 31631 flags DATA
		shared data backref parent 17763118120960 count 1
	item 124 key (156972515328 EXTENT_ITEM 102400) itemoff 13376 itemsize 37
		refs 1 gen 31631 flags DATA
		shared data backref parent 17763118120960 count 1
	item 125 key (156972617728 EXTENT_ITEM 102400) itemoff 13339 itemsize 37
		refs 1 gen 31631 flags DATA
		shared data backref parent 17763118120960 count 1
	item 126 key (156972720128 EXTENT_ITEM 98304) itemoff 13302 itemsize 37
--
	item 30 key (1569094942720 EXTENT_ITEM 24576) itemoff 14678 itemsize 53
		refs 1 gen 97048 flags DATA
		extent data backref root 21462 objectid 374857 offset 90849280 count 1
	item 31 key (1569094967296 EXTENT_ITEM 94208) itemoff 14625 itemsize 53
		refs 1 gen 94313 flags DATA
		extent data backref root 19852 objectid 67985779 offset 0 count 1
	item 32 key (1569095061504 EXTENT_ITEM 299008) itemoff 14572 itemsize 53
		refs 1 gen 136347 flags DATA
		extent data backref root 19852 objectid 129958928 offset 0 count 1
	item 33 key (1569095360512 EXTENT_ITEM 40960) itemoff 14519 itemsize 53
		refs 1 gen 95673 flags DATA
		extent data backref root 19852 objectid 70844817 offset 0 count 1
	item 34 key (1569095475200 EXTENT_ITEM 36864) itemoff 14466 itemsize 53
		refs 1 gen 134400 flags DATA
		extent data backref root 19852 objectid 123134122 offset 0 count 1
	item 35 key (1569095536640 EXTENT_ITEM 16384) itemoff 14413 itemsize 53
		refs 1 gen 134270 flags DATA
		extent data backref root 19852 objectid 122565390 offset 0 count 1
	item 36 key (1569095557120 EXTENT_ITEM 286720) itemoff 14360 itemsize 53
		refs 1 gen 97139 flags DATA
		extent data backref root 19852 objectid 75280458 offset 0 count 1
	item 37 key (1569095843840 EXTENT_ITEM 8192) itemoff 14323 itemsize 37
		refs 1 gen 88571 flags DATA
		shared data backref parent 14909069754368 count 1
	item 38 key (1569095852032 EXTENT_ITEM 122880) itemoff 14270 itemsize 53
		refs 1 gen 76214 flags DATA
		extent data backref root 19852 objectid 35849748 offset 0 count 1
	item 39 key (1569095974912 EXTENT_ITEM 8192) itemoff 14220 itemsize 50
		refs 2 gen 88571 flags DATA
		shared data backref parent 18214784647168 count 1
		shared data backref parent 14909069754368 count 1
	item 40 key (1569095983104 EXTENT_ITEM 8192) itemoff 14170 itemsize 50
		refs 2 gen 88571 flags DATA
		shared data backref parent 18214784647168 count 1
		shared data backref parent 14909069754368 count 1
	item 41 key (1569096114176 EXTENT_ITEM 286720) itemoff 14117 itemsize 53
		refs 1 gen 95205 flags DATA
		extent data backref root 19852 objectid 69436429 offset 0 count 1
	item 42 key (1569096400896 EXTENT_ITEM 122880) itemoff 14064 itemsize 53
		refs 1 gen 92983 flags DATA
		extent data backref root 19852 objectid 66052505 offset 0 count 1
	item 43 key (1569096523776 EXTENT_ITEM 270336) itemoff 14011 itemsize 53
		refs 1 gen 94720 flags DATA
		extent data backref root 19852 objectid 68432863 offset 0 count 1
	item 44 key (1569097105408 EXTENT_ITEM 45056) itemoff 13958 itemsize 53
		refs 1 gen 96865 flags DATA
		extent data backref root 19852 objectid 74357290 offset 0 count 1
	item 45 key (1569097150464 EXTENT_ITEM 8192) itemoff 13905 itemsize 53
		refs 1 gen 97048 flags DATA
		extent data backref root 21462 objectid 374857 offset 99221504 count 1
	item 46 key (1569097158656 EXTENT_ITEM 110592) itemoff 13868 itemsize 37

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  6:43           ` Marc MERLIN
@ 2018-07-01 23:22             ` Marc MERLIN
  2018-07-02  2:02               ` Su Yue
  0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-07-01 23:22 UTC (permalink / raw)
  To: Su Yue; +Cc: Qu Wenruo, linux-btrfs

On Thu, Jun 28, 2018 at 11:43:54PM -0700, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 02:32:44PM +0800, Su Yue wrote:
> > > > https://github.com/Damenly/btrfs-progs/tree/tmp1
> > > 
> > > Not sure if I undertand that you meant, here.
> > > 
> > Sorry for my unclear words.
> > Simply speaking, I suggest you to stop current running check.
> > Then, clone above branch to compile binary then run
> > 'btrfs check --mode=lowmem $dev'.
>  
> I understand, I'll build and try it.
> 
> > > This filesystem is trash to me and will require over a week to rebuild
> > > manually if I can't repair it.
> > 
> > Understood your anxiety, a log of check without '--repair' will help
> > us to make clear what's wrong with your filesystem.
> 
> Ok, I'll run your new code without repair and report back. It will
> likely take over a day though.

Well, it got stuck for over a day, and then I had to reboot :(

saruman:/var/local/src/btrfs-progs.sy# git remote -v
origin	https://github.com/Damenly/btrfs-progs.git (fetch)
origin	https://github.com/Damenly/btrfs-progs.git (push)
saruman:/var/local/src/btrfs-progs.sy# git branch
  master
* tmp1
saruman:/var/local/src/btrfs-progs.sy# git pull
Already up to date.
saruman:/var/local/src/btrfs-progs.sy# make
Making all in Documentation
make[1]: Nothing to be done for 'all'.

However, it still got stuck here:
gargamel:~# btrfs check --mode=lowmem  -p /dev/mapper/dshelf2   
Checking filesystem on /dev/mapper/dshelf2  
UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d  
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 2
have: 3  
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 2
have: 4  
ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wan
d: 180, have: 181  
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) want
: 67, have: 68  
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) want
: 67, have: 115  
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) want
: 114, have: 115  
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) want
: 114, have: 143  
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wan
d: 301, have: 302  
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wan
d: 355, have: 433  
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wan
d: 160, have: 161  
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wan
d: 161, have: 240  
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wan
d: 169, have: 170  
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wan
d: 171, have: 251  
ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wan
d: 347, have: 348  
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wan
d: 1, have: 1449  
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wan
d: 1, have: 556  

What should I try next?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-01 23:22             ` Marc MERLIN
@ 2018-07-02  2:02               ` Su Yue
  2018-07-02  3:22                 ` Marc MERLIN
  0 siblings, 1 reply; 65+ messages in thread
From: Su Yue @ 2018-07-02  2:02 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Qu Wenruo, linux-btrfs



On 07/02/2018 07:22 AM, Marc MERLIN wrote:
> On Thu, Jun 28, 2018 at 11:43:54PM -0700, Marc MERLIN wrote:
>> On Fri, Jun 29, 2018 at 02:32:44PM +0800, Su Yue wrote:
>>>>> https://github.com/Damenly/btrfs-progs/tree/tmp1
>>>>
>>>> Not sure if I undertand that you meant, here.
>>>>
>>> Sorry for my unclear words.
>>> Simply speaking, I suggest you to stop current running check.
>>> Then, clone above branch to compile binary then run
>>> 'btrfs check --mode=lowmem $dev'.
>>   
>> I understand, I'll build and try it.
>>
>>>> This filesystem is trash to me and will require over a week to rebuild
>>>> manually if I can't repair it.
>>>
>>> Understood your anxiety, a log of check without '--repair' will help
>>> us to make clear what's wrong with your filesystem.
>>
>> Ok, I'll run your new code without repair and report back. It will
>> likely take over a day though.
> 
> Well, it got stuck for over a day, and then I had to reboot :(
> 
> saruman:/var/local/src/btrfs-progs.sy# git remote -v
> origin	https://github.com/Damenly/btrfs-progs.git (fetch)
> origin	https://github.com/Damenly/btrfs-progs.git (push)
> saruman:/var/local/src/btrfs-progs.sy# git branch
>    master
> * tmp1
> saruman:/var/local/src/btrfs-progs.sy# git pull
> Already up to date.
> saruman:/var/local/src/btrfs-progs.sy# make
> Making all in Documentation
> make[1]: Nothing to be done for 'all'.
> 
> However, it still got stuck here:
Thanks, I saw. Some Clues found.

Could you try follow dumps? They shouldn't cost much time.

#btrfs inspect dump-tree -t 21872 <device> | grep -C 50 "374857 
EXTENT_DATA "

#btrfs inspect dump-tree -t 22911 <device> | grep -C 50 "374857 
EXTENT_DATA "

Thanks,
Su

> gargamel:~# btrfs check --mode=lowmem  -p /dev/mapper/dshelf2
> Checking filesystem on /dev/mapper/dshelf2
> UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 2
> have: 3
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 2
> have: 4
> ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wan
> d: 180, have: 181
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) want
> : 67, have: 68
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) want
> : 67, have: 115
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) want
> : 114, have: 115
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) want
> : 114, have: 143
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wan
> d: 301, have: 302
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wan
> d: 355, have: 433
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wan
> d: 160, have: 161
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wan
> d: 161, have: 240
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wan
> d: 169, have: 170
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wan
> d: 171, have: 251
> ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wan
> d: 347, have: 348
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wan
> d: 1, have: 1449
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wan
> d: 1, have: 556
> 
> What should I try next?
> 
> Thanks,
> Marc
> 



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02  2:02               ` Su Yue
@ 2018-07-02  3:22                 ` Marc MERLIN
  2018-07-02  6:22                   ` Su Yue
  0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-07-02  3:22 UTC (permalink / raw)
  To: Su Yue; +Cc: Qu Wenruo, linux-btrfs

On Mon, Jul 02, 2018 at 10:02:33AM +0800, Su Yue wrote:
> Could you try follow dumps? They shouldn't cost much time.
> 
> #btrfs inspect dump-tree -t 21872 <device> | grep -C 50 "374857 
> EXTENT_DATA "
> 
> #btrfs inspect dump-tree -t 22911 <device> | grep -C 50 "374857 
> EXTENT_DATA "

Ok, that's 29MB, so it doesn't fit on pastebin:
http://marc.merlins.org/tmp/dshelf2_inspect.txt

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02  3:22                 ` Marc MERLIN
@ 2018-07-02  6:22                   ` Su Yue
  2018-07-02 14:05                     ` Marc MERLIN
  0 siblings, 1 reply; 65+ messages in thread
From: Su Yue @ 2018-07-02  6:22 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Qu Wenruo, linux-btrfs



On 07/02/2018 11:22 AM, Marc MERLIN wrote:
> On Mon, Jul 02, 2018 at 10:02:33AM +0800, Su Yue wrote:
>> Could you try follow dumps? They shouldn't cost much time.
>>
>> #btrfs inspect dump-tree -t 21872 <device> | grep -C 50 "374857
>> EXTENT_DATA "
>>
>> #btrfs inspect dump-tree -t 22911 <device> | grep -C 50 "374857
>> EXTENT_DATA "
> 
> Ok, that's 29MB, so it doesn't fit on pastebin:
> http://marc.merlins.org/tmp/dshelf2_inspect.txt
> 
Sorry Marc. After offline communication with Qu, both
of us think the filesystem is hard to repair.
The filesystem is too large to debug step by step.
Every time check and debug spent is too expensive.
And it already costs serveral days.

Sadly, I am afarid that you have to recreate filesystem
and reback up your data. :(

Sorry again and thanks for you reports and patient.

Su
> Marc
> 



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02  6:22                   ` Su Yue
@ 2018-07-02 14:05                     ` Marc MERLIN
  2018-07-02 14:42                       ` Qu Wenruo
  0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-07-02 14:05 UTC (permalink / raw)
  To: Su Yue; +Cc: Qu Wenruo, linux-btrfs

On Mon, Jul 02, 2018 at 02:22:20PM +0800, Su Yue wrote:
> > Ok, that's 29MB, so it doesn't fit on pastebin:
> > http://marc.merlins.org/tmp/dshelf2_inspect.txt
> > 
> Sorry Marc. After offline communication with Qu, both
> of us think the filesystem is hard to repair.
> The filesystem is too large to debug step by step.
> Every time check and debug spent is too expensive.
> And it already costs serveral days.
> 
> Sadly, I am afarid that you have to recreate filesystem
> and reback up your data. :(
> 
> Sorry again and thanks for you reports and patient.

I appreciate your help. Honestly I only wanted to help you find why the
tools aren't working. Fixing filesystems by hand (and remotely via Email
on top of that), is way too time consuming like you said.

Is the btrfs design flawed in a way that repair tools just cannot repair
on their own? 
I understand that data can be lost, but I don't understand how the tools
just either keep crashing for me, go in infinite loops, or otherwise
fail to give me back a stable filesystem, even if some data is missing
after that.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02 14:05                     ` Marc MERLIN
@ 2018-07-02 14:42                       ` Qu Wenruo
  2018-07-02 15:18                         ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
                                           ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Qu Wenruo @ 2018-07-02 14:42 UTC (permalink / raw)
  To: Marc MERLIN, Su Yue; +Cc: linux-btrfs



On 2018年07月02日 22:05, Marc MERLIN wrote:
> On Mon, Jul 02, 2018 at 02:22:20PM +0800, Su Yue wrote:
>>> Ok, that's 29MB, so it doesn't fit on pastebin:
>>> http://marc.merlins.org/tmp/dshelf2_inspect.txt
>>>
>> Sorry Marc. After offline communication with Qu, both
>> of us think the filesystem is hard to repair.
>> The filesystem is too large to debug step by step.
>> Every time check and debug spent is too expensive.
>> And it already costs serveral days.
>>
>> Sadly, I am afarid that you have to recreate filesystem
>> and reback up your data. :(
>>
>> Sorry again and thanks for you reports and patient.
> 
> I appreciate your help. Honestly I only wanted to help you find why the
> tools aren't working. Fixing filesystems by hand (and remotely via Email
> on top of that), is way too time consuming like you said.
> 
> Is the btrfs design flawed in a way that repair tools just cannot repair
> on their own? 

For short and for your case, yes, you can consider repair tool just a
garbage and don't use them at any production system.

For full, it depends. (but for most real world case, it's still flawed)
We have small and crafted images as test cases, which btrfs check can
repair without problem at all.
But such images are *SMALL*, and only have *ONE* type of corruption,
which can represent real world case at all.

> I understand that data can be lost, but I don't understand how the tools
> just either keep crashing for me, go in infinite loops, or otherwise
> fail to give me back a stable filesystem, even if some data is missing
> after that.

There are several reasons here that repair tool can't help much:

1) Too large fs (especially too many snapshots)
   The use case (too many snapshots and shared extents, a lot of extents
   get shared over 1000 times) is in fact a super large challenge for
   lowmem mode check/repair.
   It needs O(n^2) or even O(n^3) to check each backref, which hugely
   slow the progress and make us hard to locate the real bug.

2) Corruption in extent tree and our objective is to mount RW
   Extent tree is almost useless if we just want to read data.
   But when we do any write, we needs it and if it goes wrong even a
   tiny bit, your fs could be damaged really badly.

   For other corruption, like some fs tree corruption, we could do
   something to discard some corrupted files, but if it's extent tree,
   we either mount RO and grab anything we have, or hopes the
   almost-never-working --init-extent-tree can work (that's mostly
   miracle).

So, I feel very sorry that we can't provide enough help for your case.

But still, we hope to provide some tips on next build if you still want
to choose btrfs.

1) Don't keep too many snapshots.
   Really, this is the core.
   For send/receive backup, IIRC it only needs the parent subvolume
   exists, there is no need to keep the whole history of all those
   snapshots.
   Keep the number of snapshots to minimal does greatly improve the
   possibility (both manual patch or check repair) of a successful
   repair.
   Normally I would suggest 4 hourly snapshots, 7 daily snapshots, 12
   monthly snapshots.

2) Don't keep unrelated snapshots in one btrfs.
   I totally understand that maintain different btrfs would hugely add
   maintenance pressure, but as explains, all snapshots share one
   fragile extent tree.
   If we limit the fragile extent tree from each other fs, it's less
   possible a single extent tree corruption to take down the whole fs.

Thanks,
Qu

> 
> Thanks,
> Marc
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-02 14:42                       ` Qu Wenruo
@ 2018-07-02 15:18                         ` Marc MERLIN
  2018-07-02 16:59                           ` Austin S. Hemmelgarn
                                             ` (2 more replies)
  2018-07-02 15:19                         ` So, does btrfs check lowmem take days? weeks? Marc MERLIN
  2018-07-03  0:31                         ` Chris Murphy
  2 siblings, 3 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-07-02 15:18 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Su Yue, linux-btrfs

Hi Qu,

I'll split this part into a new thread:

> 2) Don't keep unrelated snapshots in one btrfs.
>    I totally understand that maintain different btrfs would hugely add
>    maintenance pressure, but as explains, all snapshots share one
>    fragile extent tree.

Yes, I understand that this is what I should do given what you
explained.
My main problem is knowing how to segment things so I don't end up with
filesystems that are full while others are almost empty :)

Am I supposed to put LVM thin volumes underneath so that I can share
the same single 10TB raid5?

If I do this, I would have
software raid 5 < dmcrypt < bcache < lvm < btrfs
That's a lot of layers, and that's also starting to make me nervous :)

Is there any other way that does not involve me creating smaller block
devices for multiple btrfs filesystems and hope that they are the right
size because I won't be able to change it later?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02 14:42                       ` Qu Wenruo
  2018-07-02 15:18                         ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
@ 2018-07-02 15:19                         ` Marc MERLIN
  2018-07-02 17:08                           ` Austin S. Hemmelgarn
  2018-07-02 17:33                           ` Roman Mamedov
  2018-07-03  0:31                         ` Chris Murphy
  2 siblings, 2 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-07-02 15:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Su Yue, linux-btrfs

Hi Qu,

thanks for the detailled and honest answer.
A few comments inline.

On Mon, Jul 02, 2018 at 10:42:40PM +0800, Qu Wenruo wrote:
> For full, it depends. (but for most real world case, it's still flawed)
> We have small and crafted images as test cases, which btrfs check can
> repair without problem at all.
> But such images are *SMALL*, and only have *ONE* type of corruption,
> which can represent real world case at all.
 
right, they're just unittest images, I understand.

> 1) Too large fs (especially too many snapshots)
>    The use case (too many snapshots and shared extents, a lot of extents
>    get shared over 1000 times) is in fact a super large challenge for
>    lowmem mode check/repair.
>    It needs O(n^2) or even O(n^3) to check each backref, which hugely
>    slow the progress and make us hard to locate the real bug.
 
So, the non lowmem version would work better, but it's a problem if it
doesn't fit in RAM.
I've always considered it a grave bug that btrfs check repair can use so
much kernel memory that it will crash the entire system. This should not
be possible.
While it won't help me here, can btrfs check be improved not to suck all
the kernel memory, and ideally even allow using swap space if the RAM is
not enough?

Is btrfs check regular mode still being maintained? I think it's still
better than lowmem, correct?

> 2) Corruption in extent tree and our objective is to mount RW
>    Extent tree is almost useless if we just want to read data.
>    But when we do any write, we needs it and if it goes wrong even a
>    tiny bit, your fs could be damaged really badly.
> 
>    For other corruption, like some fs tree corruption, we could do
>    something to discard some corrupted files, but if it's extent tree,
>    we either mount RO and grab anything we have, or hopes the
>    almost-never-working --init-extent-tree can work (that's mostly
>    miracle).
 
I understand that it's the weak point of btrfs, thanks for explaining.

> 1) Don't keep too many snapshots.
>    Really, this is the core.
>    For send/receive backup, IIRC it only needs the parent subvolume
>    exists, there is no need to keep the whole history of all those
>    snapshots.

You are correct on history. The reason I keep history is because I may
want to recover a file from last week or 2 weeks ago after I finally
notice that it's gone. 
I have terabytes of space on the backup server, so it's easier to keep
history there than on the client which may not have enough space to keep
a month's worth of history.
As you know, back when we did tape backups, we also kept history of at
least several weeks (usually several months, but that's too much for
btrfs snapshots).

>    Keep the number of snapshots to minimal does greatly improve the
>    possibility (both manual patch or check repair) of a successful
>    repair.
>    Normally I would suggest 4 hourly snapshots, 7 daily snapshots, 12
>    monthly snapshots.

I actually have fewer snapshots than this per filesystem, but I backup
more than 10 filesystems.
If I used as many snapshots as you recommend, that would already be 230
snapshots for 10 filesystems :)

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-02 15:18                         ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
@ 2018-07-02 16:59                           ` Austin S. Hemmelgarn
  2018-07-02 17:34                             ` Marc MERLIN
  2018-07-03  0:51                           ` Paul Jones
  2018-07-03  1:37                           ` Qu Wenruo
  2 siblings, 1 reply; 65+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-02 16:59 UTC (permalink / raw)
  To: Marc MERLIN, Qu Wenruo; +Cc: Su Yue, linux-btrfs

On 2018-07-02 11:18, Marc MERLIN wrote:
> Hi Qu,
> 
> I'll split this part into a new thread:
> 
>> 2) Don't keep unrelated snapshots in one btrfs.
>>     I totally understand that maintain different btrfs would hugely add
>>     maintenance pressure, but as explains, all snapshots share one
>>     fragile extent tree.
> 
> Yes, I understand that this is what I should do given what you
> explained.
> My main problem is knowing how to segment things so I don't end up with
> filesystems that are full while others are almost empty :)
> 
> Am I supposed to put LVM thin volumes underneath so that I can share
> the same single 10TB raid5?
Actually, because of the online resize ability in BTRFS, you don't 
technically _need_ to use thin provisioning here.  It makes the 
maintenance a bit easier, but it also adds a much more complicated layer 
of indirection than just doing regular volumes.
> 
> If I do this, I would have
> software raid 5 < dmcrypt < bcache < lvm < btrfs
> That's a lot of layers, and that's also starting to make me nervous :)
> 
> Is there any other way that does not involve me creating smaller block
> devices for multiple btrfs filesystems and hope that they are the right
> size because I won't be able to change it later?
You could (in theory) merge the LVM and software RAID5 layers, though 
that may make handling of the RAID5 layer a bit complicated if you 
choose to use thin provisioning (for some reason, LVM is unable to do 
on-line checks and rebuilds of RAID arrays that are acting as thin pool 
data or metadata).

Alternatively, you could increase your array size, remove the software 
RAID layer, and switch to using BTRFS in raid10 mode so that you could 
eliminate one of the layers, though that would probably reduce the 
effectiveness of bcache (you might want to get a bigger cache device if 
you do this).

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02 15:19                         ` So, does btrfs check lowmem take days? weeks? Marc MERLIN
@ 2018-07-02 17:08                           ` Austin S. Hemmelgarn
  2018-07-02 17:33                           ` Roman Mamedov
  1 sibling, 0 replies; 65+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-02 17:08 UTC (permalink / raw)
  To: Marc MERLIN, Qu Wenruo; +Cc: Su Yue, linux-btrfs

On 2018-07-02 11:19, Marc MERLIN wrote:
> Hi Qu,
> 
> thanks for the detailled and honest answer.
> A few comments inline.
> 
> On Mon, Jul 02, 2018 at 10:42:40PM +0800, Qu Wenruo wrote:
>> For full, it depends. (but for most real world case, it's still flawed)
>> We have small and crafted images as test cases, which btrfs check can
>> repair without problem at all.
>> But such images are *SMALL*, and only have *ONE* type of corruption,
>> which can represent real world case at all.
>   
> right, they're just unittest images, I understand.
> 
>> 1) Too large fs (especially too many snapshots)
>>     The use case (too many snapshots and shared extents, a lot of extents
>>     get shared over 1000 times) is in fact a super large challenge for
>>     lowmem mode check/repair.
>>     It needs O(n^2) or even O(n^3) to check each backref, which hugely
>>     slow the progress and make us hard to locate the real bug.
>   
> So, the non lowmem version would work better, but it's a problem if it
> doesn't fit in RAM.
> I've always considered it a grave bug that btrfs check repair can use so
> much kernel memory that it will crash the entire system. This should not
> be possible.
> While it won't help me here, can btrfs check be improved not to suck all
> the kernel memory, and ideally even allow using swap space if the RAM is
> not enough?
> 
> Is btrfs check regular mode still being maintained? I think it's still
> better than lowmem, correct?
> 
>> 2) Corruption in extent tree and our objective is to mount RW
>>     Extent tree is almost useless if we just want to read data.
>>     But when we do any write, we needs it and if it goes wrong even a
>>     tiny bit, your fs could be damaged really badly.
>>
>>     For other corruption, like some fs tree corruption, we could do
>>     something to discard some corrupted files, but if it's extent tree,
>>     we either mount RO and grab anything we have, or hopes the
>>     almost-never-working --init-extent-tree can work (that's mostly
>>     miracle).
>   
> I understand that it's the weak point of btrfs, thanks for explaining.
> 
>> 1) Don't keep too many snapshots.
>>     Really, this is the core.
>>     For send/receive backup, IIRC it only needs the parent subvolume
>>     exists, there is no need to keep the whole history of all those
>>     snapshots.
> 
> You are correct on history. The reason I keep history is because I may
> want to recover a file from last week or 2 weeks ago after I finally
> notice that it's gone.
> I have terabytes of space on the backup server, so it's easier to keep
> history there than on the client which may not have enough space to keep
> a month's worth of history.
> As you know, back when we did tape backups, we also kept history of at
> least several weeks (usually several months, but that's too much for
> btrfs snapshots).
Bit of a case-study here, but it may be of interest.  We do something 
kind of similar where I work for our internal file servers.  We've got 
daily snapshots of the whole server kept on the server itself for 7 days 
(we usually see less than 5% of the total amount of data in changes on 
weekdays, and essentially 0 on weekends, so the snapshots rarely take up 
more than ab out 25% of the size of the live data), and then we 
additionally do daily backups which we retain for 6 months.  I've 
written up a short (albeit rather system specific script) for recovering 
old versions of a file that first scans the snapshots, and then pulls it 
out of the backups if it's not there.  I've found this works remarkably 
well for our use case (almost all the data on the file server follows a 
WORM access pattern with most of the files being between 100kB and 100MB 
in size).

We actually did try moving it all over to BTRFS for a while before we 
finally ended up with the setup we currently have, but aside from the 
whole issue with massive numbers of snapshots, we found that for us at 
least, Amanda actually outperforms BTRFS send/receive for everything 
except full backups and uses less storage space (though that last bit is 
largely because we use really aggressive compression).


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02 15:19                         ` So, does btrfs check lowmem take days? weeks? Marc MERLIN
  2018-07-02 17:08                           ` Austin S. Hemmelgarn
@ 2018-07-02 17:33                           ` Roman Mamedov
  2018-07-02 17:39                             ` Marc MERLIN
  1 sibling, 1 reply; 65+ messages in thread
From: Roman Mamedov @ 2018-07-02 17:33 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

On Mon, 2 Jul 2018 08:19:03 -0700
Marc MERLIN <marc@merlins.org> wrote:

> I actually have fewer snapshots than this per filesystem, but I backup
> more than 10 filesystems.
> If I used as many snapshots as you recommend, that would already be 230
> snapshots for 10 filesystems :)

(...once again me with my rsync :)

If you didn't use send/receive, you wouldn't be required to keep a separate
snapshot trail per filesystem backed up, one trail of snapshots for the entire
backup server would be enough. Rsync everything to subdirs within one
subvolume, then do timed or event-based snapshots of it. You only need more
than one trail if you want different retention policies for different datasets
(e.g. in my case I have 91 and 31 days).

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-02 16:59                           ` Austin S. Hemmelgarn
@ 2018-07-02 17:34                             ` Marc MERLIN
  2018-07-02 18:35                               ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-07-02 17:34 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Qu Wenruo, Su Yue, linux-btrfs

On Mon, Jul 02, 2018 at 12:59:02PM -0400, Austin S. Hemmelgarn wrote:
> > Am I supposed to put LVM thin volumes underneath so that I can share
> > the same single 10TB raid5?
>
> Actually, because of the online resize ability in BTRFS, you don't
> technically _need_ to use thin provisioning here.  It makes the maintenance
> a bit easier, but it also adds a much more complicated layer of indirection
> than just doing regular volumes.

You're right that I can use btrfs resize, but then I still need an LVM
device underneath, correct?
So, if I have 10 backup targets, I need 10 LVM LVs, I give them 10%
each of the full size available (as a guess), and then I'd have to 
- btrfs resize down one that's bigger than I need
- LVM shrink the LV
- LVM grow the other LV
- LVM resize up the other btrfs

and I think LVM resize and btrfs resize are not linked so I have to do
them separately and hope to type the right numbers each time, correct?
(or is that easier now?)

I kind of linked the thin provisioning idea because it's hands off,
which is appealing. Any reason against it?

> You could (in theory) merge the LVM and software RAID5 layers, though that
> may make handling of the RAID5 layer a bit complicated if you choose to use
> thin provisioning (for some reason, LVM is unable to do on-line checks and
> rebuilds of RAID arrays that are acting as thin pool data or metadata).
 
Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm
radi5?
But yeah, if it's incompatible with thin provisioning, it's not that
useful.

> Alternatively, you could increase your array size, remove the software RAID
> layer, and switch to using BTRFS in raid10 mode so that you could eliminate
> one of the layers, though that would probably reduce the effectiveness of
> bcache (you might want to get a bigger cache device if you do this).

Sadly that won't work. I have more data than will fit on raid10

Thanks for your suggestions though.
Still need to read up on whether I should do thin provisioning, or not.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02 17:33                           ` Roman Mamedov
@ 2018-07-02 17:39                             ` Marc MERLIN
  0 siblings, 0 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-07-02 17:39 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-btrfs

On Mon, Jul 02, 2018 at 10:33:09PM +0500, Roman Mamedov wrote:
> On Mon, 2 Jul 2018 08:19:03 -0700
> Marc MERLIN <marc@merlins.org> wrote:
> 
> > I actually have fewer snapshots than this per filesystem, but I backup
> > more than 10 filesystems.
> > If I used as many snapshots as you recommend, that would already be 230
> > snapshots for 10 filesystems :)
> 
> (...once again me with my rsync :)
> 
> If you didn't use send/receive, you wouldn't be required to keep a separate
> snapshot trail per filesystem backed up, one trail of snapshots for the entire
> backup server would be enough. Rsync everything to subdirs within one
> subvolume, then do timed or event-based snapshots of it. You only need more
> than one trail if you want different retention policies for different datasets
> (e.g. in my case I have 91 and 31 days).

This is exactly how I used to do backups before btrfs.
I did 

cp -al backup.olddate backup.newdate
rsync -avSH src/ backup.newdate/

You don't even need snapshots or btrfs anymore.
Also, sorry to say, but I have different data retention needs for
different backups. Some need to rotate more quickly than others, but if
you're using rsync, the method I gave above works fine at any rotation
interval you need.

It is almost as efficient as btrfs on space, but as I said, the time
penalty on all those stats for many files was what killed it for me.
If I go back to rsync backups (and I'm really unlikely to), then I'd
also go back to ext4. There would be no point in dealing with the
complexity and fragility of btrfs anymore.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-02 17:34                             ` Marc MERLIN
@ 2018-07-02 18:35                               ` Austin S. Hemmelgarn
  2018-07-02 19:40                                 ` Marc MERLIN
  2018-07-03  4:25                                 ` Andrei Borzenkov
  0 siblings, 2 replies; 65+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-02 18:35 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Qu Wenruo, Su Yue, linux-btrfs

On 2018-07-02 13:34, Marc MERLIN wrote:
> On Mon, Jul 02, 2018 at 12:59:02PM -0400, Austin S. Hemmelgarn wrote:
>>> Am I supposed to put LVM thin volumes underneath so that I can share
>>> the same single 10TB raid5?
>>
>> Actually, because of the online resize ability in BTRFS, you don't
>> technically _need_ to use thin provisioning here.  It makes the maintenance
>> a bit easier, but it also adds a much more complicated layer of indirection
>> than just doing regular volumes.
> 
> You're right that I can use btrfs resize, but then I still need an LVM
> device underneath, correct?
> So, if I have 10 backup targets, I need 10 LVM LVs, I give them 10%
> each of the full size available (as a guess), and then I'd have to
> - btrfs resize down one that's bigger than I need
> - LVM shrink the LV
> - LVM grow the other LV
> - LVM resize up the other btrfs
> 
> and I think LVM resize and btrfs resize are not linked so I have to do
> them separately and hope to type the right numbers each time, correct?
> (or is that easier now?)
> 
> I kind of linked the thin provisioning idea because it's hands off,
> which is appealing. Any reason against it?
No, not currently, except that it adds a whole lot more stuff between 
BTRFS and whatever layer is below it.  That increase in what's being 
done adds some overhead (it's noticeable on 7200 RPM consumer SATA 
drives, but not on decent consumer SATA SSD's).

There used to be issues running BTRFS on top of LVM thin targets which 
had zero mode turned off, but AFAIK, all of those problems were fixed 
long ago (before 4.0).
> 
>> You could (in theory) merge the LVM and software RAID5 layers, though that
>> may make handling of the RAID5 layer a bit complicated if you choose to use
>> thin provisioning (for some reason, LVM is unable to do on-line checks and
>> rebuilds of RAID arrays that are acting as thin pool data or metadata).
>   
> Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm
> radi5?
Actually, it uses MD's RAID5 implementation as a back-end.  Same for 
RAID6, and optionally for RAID0, RAID1, and RAID10.

> But yeah, if it's incompatible with thin provisioning, it's not that
> useful.
It's technically not incompatible, just a bit of a pain.  Last time I 
tried to use it, you had to jump through hoops to repair a damaged RAID 
volume that was serving as an underlying volume in a thin pool, and it 
required keeping the thin pool offline for the entire duration of the 
rebuild.
> 
>> Alternatively, you could increase your array size, remove the software RAID
>> layer, and switch to using BTRFS in raid10 mode so that you could eliminate
>> one of the layers, though that would probably reduce the effectiveness of
>> bcache (you might want to get a bigger cache device if you do this).
> 
> Sadly that won't work. I have more data than will fit on raid10
> 
> Thanks for your suggestions though.
> Still need to read up on whether I should do thin provisioning, or not.
If you do go with thin provisioning, I would encourage you to make 
certain to call fstrim on the BTRFS volumes on a semi regular basis so 
that the thin pool doesn't get filled up with old unused blocks, 
preferably when you are 100% certain that there are no ongoing writes on 
them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit 
dangerous to do it while writes are happening).

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-02 18:35                               ` Austin S. Hemmelgarn
@ 2018-07-02 19:40                                 ` Marc MERLIN
  2018-07-03  4:25                                 ` Andrei Borzenkov
  1 sibling, 0 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-07-02 19:40 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Qu Wenruo, Su Yue, linux-btrfs

On Mon, Jul 02, 2018 at 02:35:19PM -0400, Austin S. Hemmelgarn wrote:
> >I kind of linked the thin provisioning idea because it's hands off,
> >which is appealing. Any reason against it?
> No, not currently, except that it adds a whole lot more stuff between 
> BTRFS and whatever layer is below it.  That increase in what's being 
> done adds some overhead (it's noticeable on 7200 RPM consumer SATA 
> drives, but not on decent consumer SATA SSD's).
> 
> There used to be issues running BTRFS on top of LVM thin targets which 
> had zero mode turned off, but AFAIK, all of those problems were fixed 
> long ago (before 4.0).

I see, thanks for the heads up.

> >Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm
> >radi5?
> Actually, it uses MD's RAID5 implementation as a back-end.  Same for 
> RAID6, and optionally for RAID0, RAID1, and RAID10.
 
Ok, that makes me feel a bit better :)

> >But yeah, if it's incompatible with thin provisioning, it's not that
> >useful.
> It's technically not incompatible, just a bit of a pain.  Last time I 
> tried to use it, you had to jump through hoops to repair a damaged RAID 
> volume that was serving as an underlying volume in a thin pool, and it 
> required keeping the thin pool offline for the entire duration of the 
> rebuild.

Argh, not good :( / thanks for the heads up.

> If you do go with thin provisioning, I would encourage you to make 
> certain to call fstrim on the BTRFS volumes on a semi regular basis so 
> that the thin pool doesn't get filled up with old unused blocks, 

That's a very good point/reminder, thanks for that. I guess it's like
running on an ssd :)

> preferably when you are 100% certain that there are no ongoing writes on 
> them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit 
> dangerous to do it while writes are happening).
 
Argh, that will be harder, but I'll try.

Given what you said, it sounds like I'll still be best off with separate
layers to avoid the rebuild problem you mentioned.
So it'll be
swraid5 / dmcrypt / bcache / lvm dm thin / btrfs

Hopefully that will work well enough.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02 14:42                       ` Qu Wenruo
  2018-07-02 15:18                         ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
  2018-07-02 15:19                         ` So, does btrfs check lowmem take days? weeks? Marc MERLIN
@ 2018-07-03  0:31                         ` Chris Murphy
  2018-07-03  4:22                           ` Marc MERLIN
  2 siblings, 1 reply; 65+ messages in thread
From: Chris Murphy @ 2018-07-03  0:31 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Marc MERLIN, Su Yue, Btrfs BTRFS

On Mon, Jul 2, 2018 at 8:42 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
> On 2018年07月02日 22:05, Marc MERLIN wrote:
>> On Mon, Jul 02, 2018 at 02:22:20PM +0800, Su Yue wrote:
>>>> Ok, that's 29MB, so it doesn't fit on pastebin:
>>>> http://marc.merlins.org/tmp/dshelf2_inspect.txt
>>>>
>>> Sorry Marc. After offline communication with Qu, both
>>> of us think the filesystem is hard to repair.
>>> The filesystem is too large to debug step by step.
>>> Every time check and debug spent is too expensive.
>>> And it already costs serveral days.
>>>
>>> Sadly, I am afarid that you have to recreate filesystem
>>> and reback up your data. :(
>>>
>>> Sorry again and thanks for you reports and patient.
>>
>> I appreciate your help. Honestly I only wanted to help you find why the
>> tools aren't working. Fixing filesystems by hand (and remotely via Email
>> on top of that), is way too time consuming like you said.
>>
>> Is the btrfs design flawed in a way that repair tools just cannot repair
>> on their own?
>
> For short and for your case, yes, you can consider repair tool just a
> garbage and don't use them at any production system.

So the idea behind journaled file systems is that journal replay
enabled mount time "repair" that's faster than an fsck. Already Btrfs
use cases with big, but not huge, file systems makes btrfs check a
problem. Either running out of memory or it takes too long. So already
it isn't scaling as well as ext4 or XFS in this regard.

So what's the future hold? It seems like the goal is that the problems
must be avoided in the first place rather than to repair them after
the fact.

Are the problem's Marc is running into understood well enough that
there can eventually be a fix, maybe even an on-disk format change,
that prevents such problems from happening in the first place?

Or does it make sense for him to be running with btrfs debug or some
subset of btrfs integrity checking mask to try to catch the problems
in the act of them happening?



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-02 15:18                         ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
  2018-07-02 16:59                           ` Austin S. Hemmelgarn
@ 2018-07-03  0:51                           ` Paul Jones
  2018-07-03  4:06                             ` Marc MERLIN
  2018-07-03  1:37                           ` Qu Wenruo
  2 siblings, 1 reply; 65+ messages in thread
From: Paul Jones @ 2018-07-03  0:51 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> owner@vger.kernel.org> On Behalf Of Marc MERLIN
> Sent: Tuesday, 3 July 2018 1:19 AM
> To: Qu Wenruo <quwenruo.btrfs@gmx.com>
> Cc: Su Yue <suy.fnst@cn.fujitsu.com>; linux-btrfs@vger.kernel.org
> Subject: Re: how to best segment a big block device in resizeable btrfs
> filesystems?
> 
> Hi Qu,
> 
> I'll split this part into a new thread:
> 
> > 2) Don't keep unrelated snapshots in one btrfs.
> >    I totally understand that maintain different btrfs would hugely add
> >    maintenance pressure, but as explains, all snapshots share one
> >    fragile extent tree.
> 
> Yes, I understand that this is what I should do given what you explained.
> My main problem is knowing how to segment things so I don't end up with
> filesystems that are full while others are almost empty :)
> 
> Am I supposed to put LVM thin volumes underneath so that I can share the
> same single 10TB raid5?
> 
> If I do this, I would have
> software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of layers, and
> that's also starting to make me nervous :)

You could combine bcache and lvm if you are happy to use dm-cache instead (which lvm uses).
I use it myself (but without thin provisioning) and it works well.


> 
> Is there any other way that does not involve me creating smaller block
> devices for multiple btrfs filesystems and hope that they are the right size
> because I won't be able to change it later?
> 
> Thanks,
> Marc
> --
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems ....
>                                       .... what McDonalds is to gourmet cooking
> Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-02 15:18                         ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
  2018-07-02 16:59                           ` Austin S. Hemmelgarn
  2018-07-03  0:51                           ` Paul Jones
@ 2018-07-03  1:37                           ` Qu Wenruo
  2018-07-03  4:15                             ` Marc MERLIN
  2018-07-03  4:23                             ` Andrei Borzenkov
  2 siblings, 2 replies; 65+ messages in thread
From: Qu Wenruo @ 2018-07-03  1:37 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Su Yue, linux-btrfs



On 2018年07月02日 23:18, Marc MERLIN wrote:
> Hi Qu,
> 
> I'll split this part into a new thread:
> 
>> 2) Don't keep unrelated snapshots in one btrfs.
>>    I totally understand that maintain different btrfs would hugely add
>>    maintenance pressure, but as explains, all snapshots share one
>>    fragile extent tree.
> 
> Yes, I understand that this is what I should do given what you
> explained.
> My main problem is knowing how to segment things so I don't end up with
> filesystems that are full while others are almost empty :)
> 
> Am I supposed to put LVM thin volumes underneath so that I can share
> the same single 10TB raid5?
> 
> If I do this, I would have
> software raid 5 < dmcrypt < bcache < lvm < btrfs
> That's a lot of layers, and that's also starting to make me nervous :)

If you could keep the number of snapshots to minimal (less than 10) for
each btrfs (and the number of send source is less than 5), one big btrfs
may work in that case.

BTW, IMHO the bcache is not really helping for backup system, which is
more write oriented.

Thanks,
Qu

> 
> Is there any other way that does not involve me creating smaller block
> devices for multiple btrfs filesystems and hope that they are the right
> size because I won't be able to change it later?
> 
> Thanks,
> Marc
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  0:51                           ` Paul Jones
@ 2018-07-03  4:06                             ` Marc MERLIN
  2018-07-03  4:26                               ` Paul Jones
  0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-07-03  4:06 UTC (permalink / raw)
  To: Paul Jones; +Cc: linux-btrfs

On Tue, Jul 03, 2018 at 12:51:30AM +0000, Paul Jones wrote:
> You could combine bcache and lvm if you are happy to use dm-cache instead (which lvm uses).
> I use it myself (but without thin provisioning) and it works well.

Interesting point. So, I used to use lvm and then lvm2 many years ago until
I got tired with its performance, especially as asoon as I took even a
single snapshot.
But that was a long time ago now, just saying that I'm a bit rusty on LVM
itself.

That being said, if I have
raid5
dm-cache
dm-crypt
dm-thin

That's still 4 block layers under btrfs.
Am I any better off using dm-cache instead of bcache, my understanding is
that it only replaces one block layer with another one and one codebase with
another.

Mmmh, a bit of reading shows that dm-cache is now used as lvmcache, which
might change things, or not.
I'll admit that setting up and maintaining bcache is a bit of a pain, I only
used it at the time because it seemed more ready then, but we're a few years
later now.

So, what do you recommend nowadays, assuming you've used both?
(given that it's literally going to take days to recreate my array, I'd
rather do it once and the right way the first time :) )

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  1:37                           ` Qu Wenruo
@ 2018-07-03  4:15                             ` Marc MERLIN
  2018-07-03  9:55                               ` Paul Jones
  2018-07-03  4:23                             ` Andrei Borzenkov
  1 sibling, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-07-03  4:15 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Su Yue, linux-btrfs

On Tue, Jul 03, 2018 at 09:37:47AM +0800, Qu Wenruo wrote:
> > If I do this, I would have
> > software raid 5 < dmcrypt < bcache < lvm < btrfs
> > That's a lot of layers, and that's also starting to make me nervous :)
> 
> If you could keep the number of snapshots to minimal (less than 10) for
> each btrfs (and the number of send source is less than 5), one big btrfs
> may work in that case.
 
Well, we kind of discussed this already. If btrfs falls over if you reach
100 snapshots or so, and it sure seems to in my case, I won't be much better
off.
Having btrfs check --repair fail because 32GB of RAM is not enough, and it's
unable to use swap, is a big deal in my case. You also confirmed that btrfs
check lowmem does not scale to filesystems like mine, so this translates
into "if regular btrfs check repair can't fit in 32GB, I am completely out
of luck if anything happens to the filesystem"

You're correct that I could tweak my backups and snapshot rotation to get
from 250 or so down to 100, but it seems that I'll just be hoping to avoid
the problem by being just under the limit, until I'm not, again, and it'll
be too late to do anything it next time I'm in trouble again, putting me
back right in the same spot I'm in now.
Is all this fair to say, or did I misunderstand?

> BTW, IMHO the bcache is not really helping for backup system, which is
> more write oriented.

That's a good point. So, what I didn't explain is that I still have some old
filesystem that do get backed up with rsync instead of btrfs send (going
into the same filesystem, but not same subvolume).
Because rsync is so painfully slow when it needs to scan both sides before
it'll even start doing any work, bcache helps there.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03  0:31                         ` Chris Murphy
@ 2018-07-03  4:22                           ` Marc MERLIN
  2018-07-03  8:34                             ` Su Yue
  2018-07-03  8:50                             ` Qu Wenruo
  0 siblings, 2 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-07-03  4:22 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Qu Wenruo, Su Yue, Btrfs BTRFS

On Mon, Jul 02, 2018 at 06:31:43PM -0600, Chris Murphy wrote:
> So the idea behind journaled file systems is that journal replay
> enabled mount time "repair" that's faster than an fsck. Already Btrfs
> use cases with big, but not huge, file systems makes btrfs check a
> problem. Either running out of memory or it takes too long. So already
> it isn't scaling as well as ext4 or XFS in this regard.
> 
> So what's the future hold? It seems like the goal is that the problems
> must be avoided in the first place rather than to repair them after
> the fact.
> 
> Are the problem's Marc is running into understood well enough that
> there can eventually be a fix, maybe even an on-disk format change,
> that prevents such problems from happening in the first place?
> 
> Or does it make sense for him to be running with btrfs debug or some
> subset of btrfs integrity checking mask to try to catch the problems
> in the act of them happening?

Those are all good questions.
To be fair, I cannot claim that btrfs was at fault for whatever filesystem
damage I ended up with. It's very possible that it happened due to a flaky
Sata card that kicked drives off the bus when it shouldn't have.
Sure in theory a journaling filesystem can recover from unexpected power
loss and drives dropping off at bad times, but I'm going to guess that
btrfs' complexity also means that it has data structures (extent tree?) that
need to be updated completely "or else".

I'm obviously ok with a filesystem check being necessary to recover in cases
like this, afterall I still occasionally have to run e2fsck on ext4 too, but
I'm a lot less thrilled with the btrfs situation where basically the repair
tools can either completely crash your kernel, or take days and then either
get stuck in an infinite loop or hit an algorithm that can't scale if you
have too many hardlinks/snapshots.

It sounds like there may not be a fix to this problem with the filesystem's
design, outside of "do not get there, or else".
It would even be useful for btrfs tools to start computing heuristics and
output warnings like "you have more than 100 snapshots on this filesystem,
this is not recommended, please read http://url/"

Qu, Su, does that sound both reasonable and doable?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  1:37                           ` Qu Wenruo
  2018-07-03  4:15                             ` Marc MERLIN
@ 2018-07-03  4:23                             ` Andrei Borzenkov
  1 sibling, 0 replies; 65+ messages in thread
From: Andrei Borzenkov @ 2018-07-03  4:23 UTC (permalink / raw)
  To: Qu Wenruo, Marc MERLIN; +Cc: Su Yue, linux-btrfs

03.07.2018 04:37, Qu Wenruo пишет:
> 
> BTW, IMHO the bcache is not really helping for backup system, which is
> more write oriented.
> 

There is new writecache target which may help in this case.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-02 18:35                               ` Austin S. Hemmelgarn
  2018-07-02 19:40                                 ` Marc MERLIN
@ 2018-07-03  4:25                                 ` Andrei Borzenkov
  2018-07-03  7:15                                   ` Duncan
  1 sibling, 1 reply; 65+ messages in thread
From: Andrei Borzenkov @ 2018-07-03  4:25 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Marc MERLIN; +Cc: Qu Wenruo, Su Yue, linux-btrfs

02.07.2018 21:35, Austin S. Hemmelgarn пишет:
> them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit
> dangerous to do it while writes are happening).

Could you please elaborate? Do you mean btrfs can trim data before new
writes are actually committed to disk?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  4:06                             ` Marc MERLIN
@ 2018-07-03  4:26                               ` Paul Jones
  2018-07-03  5:42                                 ` Marc MERLIN
  0 siblings, 1 reply; 65+ messages in thread
From: Paul Jones @ 2018-07-03  4:26 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2449 bytes --]


> -----Original Message-----
> From: Marc MERLIN <marc@merlins.org>
> Sent: Tuesday, 3 July 2018 2:07 PM
> To: Paul Jones <paul@pauljones.id.au>
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: how to best segment a big block device in resizeable btrfs
> filesystems?
> 
> On Tue, Jul 03, 2018 at 12:51:30AM +0000, Paul Jones wrote:
> > You could combine bcache and lvm if you are happy to use dm-cache
> instead (which lvm uses).
> > I use it myself (but without thin provisioning) and it works well.
> 
> Interesting point. So, I used to use lvm and then lvm2 many years ago until I
> got tired with its performance, especially as asoon as I took even a single
> snapshot.
> But that was a long time ago now, just saying that I'm a bit rusty on LVM
> itself.
> 
> That being said, if I have
> raid5
> dm-cache
> dm-crypt
> dm-thin
> 
> That's still 4 block layers under btrfs.
> Am I any better off using dm-cache instead of bcache, my understanding is
> that it only replaces one block layer with another one and one codebase with
> another.

True, I didn't think of it like that.

> Mmmh, a bit of reading shows that dm-cache is now used as lvmcache, which
> might change things, or not.
> I'll admit that setting up and maintaining bcache is a bit of a pain, I only used it
> at the time because it seemed more ready then, but we're a few years later
> now.
> 
> So, what do you recommend nowadays, assuming you've used both?
> (given that it's literally going to take days to recreate my array, I'd rather do it
> once and the right way the first time :) )

I don't have any experience with this, but since it's the internet let me tell you how I'd do it anyway 😝
raid5
dm-crypt
lvm (using thin provisioning + cache)
btrfs

The cache mode on lvm requires you to set up all your volumes first, then add caching to those volumes last. If you need to modify the volume then you have to remove the cache, make your changes, then re-add the cache. It sounds like a pain, but having the cache separate from the data is quite handy.
Given you are running a backup server I don't think the cache would really do much unless you enable writeback mode. If you can split up your filesystem a bit to the point that btrfs check doesn't OOM that will seriously help performance as well. Rsync might be feasible again.

Paul.

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨è­Ú&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~†­†Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ߣøm

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  4:26                               ` Paul Jones
@ 2018-07-03  5:42                                 ` Marc MERLIN
  0 siblings, 0 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-07-03  5:42 UTC (permalink / raw)
  To: Paul Jones; +Cc: linux-btrfs

On Tue, Jul 03, 2018 at 04:26:37AM +0000, Paul Jones wrote:
> I don't have any experience with this, but since it's the internet let me tell you how I'd do it anyway 😝

That's the spirit :)

> raid5
> dm-crypt
> lvm (using thin provisioning + cache)
> btrfs
> 
> The cache mode on lvm requires you to set up all your volumes first, then
> add caching to those volumes last. If you need to modify the volume then
> you have to remove the cache, make your changes, then re-add the cache. It
> sounds like a pain, but having the cache separate from the data is quite
> handy.

I'm ok enough with that.

> Given you are running a backup server I don't think the cache would
> really do much unless you enable writeback mode. If you can split up your
> filesystem a bit to the point that btrfs check doesn't OOM that will
> seriously help performance as well. Rsync might be feasible again.

I'm a bit warry of write caching with the issues I've had. I may do
write-through, but not writeback :)

But caching helps indeed for my older filesystems that are still backed up
via rsync because the source fs is ext4 and not btrfs.

Thanks for the suggestions
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  4:25                                 ` Andrei Borzenkov
@ 2018-07-03  7:15                                   ` Duncan
  2018-07-06  4:28                                     ` Andrei Borzenkov
  0 siblings, 1 reply; 65+ messages in thread
From: Duncan @ 2018-07-03  7:15 UTC (permalink / raw)
  To: linux-btrfs

Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as excerpted:

> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a
>> bit dangerous to do it while writes are happening).
> 
> Could you please elaborate? Do you mean btrfs can trim data before new
> writes are actually committed to disk?

No.

But normally old roots aren't rewritten for some time simply due to odds 
(fuller filesystems will of course recycle them sooner), and the btrfs 
mount option usebackuproot (formerly recovery, until the norecovery mount 
option that parallels that of other filesystems was added and this option 
was renamed to avoid confusion) can be used to try an older root if the 
current root is too damaged to successfully mount.

But other than simply by odds not using them again immediately, btrfs has 
no special protection for those old roots, and trim/discard will recover 
them to hardware-unused as it does any other unused space, tho whether it 
simply marks them for later processing or actually processes them 
immediately is up to the individual implementation -- some do it 
immediately, killing all chances at using the backup root because it's 
already zeroed out, some don't.

In the context of the discard mount option, that can mean there's never 
any old roots available ever, as they've already been cleaned up by the 
hardware due to the discard option telling the hardware to do it.

But even not using that mount option, and simply doing the trims 
periodically, as done weekly by for instance the systemd fstrim timer and 
service units, or done manually if you prefer, obviously potentially 
wipes the old roots at that point.  If the system's effectively idle at 
the time, not much risk as the current commit is likely to represent a 
filesystem in full stasis, but if there's lots of writes going on at that 
moment *AND* the system happens to crash at just the wrong time, before 
additional commits have recreated at least a bit of root history, again, 
you'll potentially be left without any old roots for the usebackuproot 
mount option to try to fall back to, should it actually be necessary.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03  4:22                           ` Marc MERLIN
@ 2018-07-03  8:34                             ` Su Yue
  2018-07-03 21:34                               ` Chris Murphy
  2018-07-03  8:50                             ` Qu Wenruo
  1 sibling, 1 reply; 65+ messages in thread
From: Su Yue @ 2018-07-03  8:34 UTC (permalink / raw)
  To: Marc MERLIN, Chris Murphy; +Cc: Qu Wenruo, Btrfs BTRFS



On 07/03/2018 12:22 PM, Marc MERLIN wrote:
> On Mon, Jul 02, 2018 at 06:31:43PM -0600, Chris Murphy wrote:
>> So the idea behind journaled file systems is that journal replay
>> enabled mount time "repair" that's faster than an fsck. Already Btrfs
>> use cases with big, but not huge, file systems makes btrfs check a
>> problem. Either running out of memory or it takes too long. So already
>> it isn't scaling as well as ext4 or XFS in this regard.
>>
>> So what's the future hold? It seems like the goal is that the problems
>> must be avoided in the first place rather than to repair them after
>> the fact.
>>
>> Are the problem's Marc is running into understood well enough that
>> there can eventually be a fix, maybe even an on-disk format change,
>> that prevents such problems from happening in the first place?
>>
>> Or does it make sense for him to be running with btrfs debug or some
>> subset of btrfs integrity checking mask to try to catch the problems
>> in the act of them happening?
> 
> Those are all good questions.
> To be fair, I cannot claim that btrfs was at fault for whatever filesystem
> damage I ended up with. It's very possible that it happened due to a flaky
> Sata card that kicked drives off the bus when it shouldn't have.
> Sure in theory a journaling filesystem can recover from unexpected power
> loss and drives dropping off at bad times, but I'm going to guess that
> btrfs' complexity also means that it has data structures (extent tree?) that
> need to be updated completely "or else".
> 
Yes, extent tree is the hardest part for lowmem mode. I'm quite
confident the tool can deal well with file trees(which records metadata
about file and directory name, relationships).
As for extent tree, I have few confidence due to its complexity.

> I'm obviously ok with a filesystem check being necessary to recover in cases
> like this, afterall I still occasionally have to run e2fsck on ext4 too, but
> I'm a lot less thrilled with the btrfs situation where basically the repair
> tools can either completely crash your kernel, or take days and then either
> get stuck in an infinite loop or hit an algorithm that can't scale if you
> have too many hardlinks/snapshots.
> 
It's not surprising that real world filesytems have many snapshots.
Original mode repair eats large memory space, so lowmem mode is created
to save memory but costs time. The latter is just not robust to handle
complex situations.

> It sounds like there may not be a fix to this problem with the filesystem's
> design, outside of "do not get there, or else".
> It would even be useful for btrfs tools to start computing heuristics and
> output warnings like "you have more than 100 snapshots on this filesystem,
> this is not recommended, please read http://url/"
> 
> Qu, Su, does that sound both reasonable and doable?
> 
> Thanks,
> Marc
> 



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03  4:22                           ` Marc MERLIN
  2018-07-03  8:34                             ` Su Yue
@ 2018-07-03  8:50                             ` Qu Wenruo
  2018-07-03 14:38                               ` Marc MERLIN
  2018-07-03 21:46                               ` Chris Murphy
  1 sibling, 2 replies; 65+ messages in thread
From: Qu Wenruo @ 2018-07-03  8:50 UTC (permalink / raw)
  To: Marc MERLIN, Chris Murphy; +Cc: Su Yue, Btrfs BTRFS



On 2018年07月03日 12:22, Marc MERLIN wrote:
> On Mon, Jul 02, 2018 at 06:31:43PM -0600, Chris Murphy wrote:
>> So the idea behind journaled file systems is that journal replay
>> enabled mount time "repair" that's faster than an fsck. Already Btrfs
>> use cases with big, but not huge, file systems makes btrfs check a
>> problem. Either running out of memory or it takes too long. So already
>> it isn't scaling as well as ext4 or XFS in this regard.
>>
>> So what's the future hold? It seems like the goal is that the problems
>> must be avoided in the first place rather than to repair them after
>> the fact.
>>
>> Are the problem's Marc is running into understood well enough that
>> there can eventually be a fix, maybe even an on-disk format change,
>> that prevents such problems from happening in the first place?
>>
>> Or does it make sense for him to be running with btrfs debug or some
>> subset of btrfs integrity checking mask to try to catch the problems
>> in the act of them happening?
> 
> Those are all good questions.
> To be fair, I cannot claim that btrfs was at fault for whatever filesystem
> damage I ended up with. It's very possible that it happened due to a flaky
> Sata card that kicked drives off the bus when it shouldn't have.

However this still doesn't explain the problem you hit.

In theory (well, it's theory by all means), btrfs is fully atomic for
its transaction, even for its data (with csum and cow).
So even a powerloss/data corruption happens between transactions, we
should get the previous trans.

There must be something wrong, however due to the size of the fs, and
the complexity of extent tree, I can't tell.

> Sure in theory a journaling filesystem can recover from unexpected power
> loss and drives dropping off at bad times, but I'm going to guess that
> btrfs' complexity also means that it has data structures (extent tree?) that
> need to be updated completely "or else".

I'm wondering if we have some hidden bug somewhere.
For extent tree, it's metadata, and is protected by mandatory CoW, it
shouldn't be corrupted, unless we have bug in the already complex
delayed reference code, or some unexpected behavior (flush/fua failure)
due to so many layers (dmcrypt + mdraid).

Anyway, if we can't reproduce it in a controlled environment (my VM with
pretty small and plain fs), it's really hard to locate the bug.

> 
> I'm obviously ok with a filesystem check being necessary to recover in cases
> like this, afterall I still occasionally have to run e2fsck on ext4 too, but
> I'm a lot less thrilled with the btrfs situation where basically the repair
> tools can either completely crash your kernel, or take days and then either
> get stuck in an infinite loop or hit an algorithm that can't scale if you
> have too many hardlinks/snapshots.

Unfortunately, all the price is paid for the super fast snapshot creation.
The tradeoff can not be easily solved.

(Another way to implement snapshot is like LVM thin provision, each time
a snapshot is created we need to iterate all allocated blocks of the
thin LV, which can't scale very well when the fs grows, but makes its
mapping management pretty easy. But I think LVM guys have done some
trick to improve the performance)

> 
> It sounds like there may not be a fix to this problem with the filesystem's
> design, outside of "do not get there, or else".
> It would even be useful for btrfs tools to start computing heuristics and
> output warnings like "you have more than 100 snapshots on this filesystem,
> this is not recommended, please read http://url/"

This looks pretty doable, but maybe it's better to add some warning at
btrfs progs (both "subvolume snapshot" and "receive").

Thanks,
Qu

> 
> Qu, Su, does that sound both reasonable and doable?
> 
> Thanks,
> Marc
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  4:15                             ` Marc MERLIN
@ 2018-07-03  9:55                               ` Paul Jones
  2018-07-03 11:29                                 ` Qu Wenruo
  0 siblings, 1 reply; 65+ messages in thread
From: Paul Jones @ 2018-07-03  9:55 UTC (permalink / raw)
  To: Marc MERLIN, Qu Wenruo; +Cc: Su Yue, linux-btrfs

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> owner@vger.kernel.org> On Behalf Of Marc MERLIN
> Sent: Tuesday, 3 July 2018 2:16 PM
> To: Qu Wenruo <quwenruo.btrfs@gmx.com>
> Cc: Su Yue <suy.fnst@cn.fujitsu.com>; linux-btrfs@vger.kernel.org
> Subject: Re: how to best segment a big block device in resizeable btrfs
> filesystems?
> 
> On Tue, Jul 03, 2018 at 09:37:47AM +0800, Qu Wenruo wrote:
> > > If I do this, I would have
> > > software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of
> > > layers, and that's also starting to make me nervous :)
> >
> > If you could keep the number of snapshots to minimal (less than 10)
> > for each btrfs (and the number of send source is less than 5), one big
> > btrfs may work in that case.
> 
> Well, we kind of discussed this already. If btrfs falls over if you reach
> 100 snapshots or so, and it sure seems to in my case, I won't be much better
> off.
> Having btrfs check --repair fail because 32GB of RAM is not enough, and it's
> unable to use swap, is a big deal in my case. You also confirmed that btrfs
> check lowmem does not scale to filesystems like mine, so this translates into
> "if regular btrfs check repair can't fit in 32GB, I am completely out of luck if
> anything happens to the filesystem"

Just out of curiosity I had a look at my backup filesystem.
vm-server /media/backup # btrfs fi us /media/backup/
Overall:
    Device size:                   5.46TiB
    Device allocated:              3.42TiB
    Device unallocated:            2.04TiB
    Device missing:                  0.00B
    Used:                          1.80TiB
    Free (estimated):              1.83TiB      (min: 1.83TiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,RAID1: Size:1.69TiB, Used:906.26GiB
   /dev/mapper/a-backup--a         1.69TiB
   /dev/mapper/b-backup--b         1.69TiB

Metadata,RAID1: Size:19.00GiB, Used:16.90GiB
   /dev/mapper/a-backup--a        19.00GiB
   /dev/mapper/b-backup--b        19.00GiB

System,RAID1: Size:64.00MiB, Used:336.00KiB
   /dev/mapper/a-backup--a        64.00MiB
   /dev/mapper/b-backup--b        64.00MiB

Unallocated:
   /dev/mapper/a-backup--a         1.02TiB
   /dev/mapper/b-backup--b         1.02TiB

compress=zstd,space_cache=v2
202 snapshots, heavily de-duplicated
551G / 361,000 files in latest snapshot

Btrfs check normal mode took 12 mins and 11.5G ram
Lowmem mode I stopped after 4 hours, max memory usage was around 3.9G

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  9:55                               ` Paul Jones
@ 2018-07-03 11:29                                 ` Qu Wenruo
  0 siblings, 0 replies; 65+ messages in thread
From: Qu Wenruo @ 2018-07-03 11:29 UTC (permalink / raw)
  To: Paul Jones, Marc MERLIN; +Cc: Su Yue, linux-btrfs



On 2018年07月03日 17:55, Paul Jones wrote:
>> -----Original Message-----
>> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
>> owner@vger.kernel.org> On Behalf Of Marc MERLIN
>> Sent: Tuesday, 3 July 2018 2:16 PM
>> To: Qu Wenruo <quwenruo.btrfs@gmx.com>
>> Cc: Su Yue <suy.fnst@cn.fujitsu.com>; linux-btrfs@vger.kernel.org
>> Subject: Re: how to best segment a big block device in resizeable btrfs
>> filesystems?
>>
>> On Tue, Jul 03, 2018 at 09:37:47AM +0800, Qu Wenruo wrote:
>>>> If I do this, I would have
>>>> software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of
>>>> layers, and that's also starting to make me nervous :)
>>>
>>> If you could keep the number of snapshots to minimal (less than 10)
>>> for each btrfs (and the number of send source is less than 5), one big
>>> btrfs may work in that case.
>>
>> Well, we kind of discussed this already. If btrfs falls over if you reach
>> 100 snapshots or so, and it sure seems to in my case, I won't be much better
>> off.
>> Having btrfs check --repair fail because 32GB of RAM is not enough, and it's
>> unable to use swap, is a big deal in my case. You also confirmed that btrfs
>> check lowmem does not scale to filesystems like mine, so this translates into
>> "if regular btrfs check repair can't fit in 32GB, I am completely out of luck if
>> anything happens to the filesystem"
> 
> Just out of curiosity I had a look at my backup filesystem.
> vm-server /media/backup # btrfs fi us /media/backup/
> Overall:
>     Device size:                   5.46TiB
>     Device allocated:              3.42TiB
>     Device unallocated:            2.04TiB
>     Device missing:                  0.00B
>     Used:                          1.80TiB
>     Free (estimated):              1.83TiB      (min: 1.83TiB)
>     Data ratio:                       2.00
>     Metadata ratio:                   2.00
>     Global reserve:              512.00MiB      (used: 0.00B)
> 
> Data,RAID1: Size:1.69TiB, Used:906.26GiB

It doesn't affect how fast check run at all.
Unless --check-data-csum is specified.

And even --check-data-csum is specified, most read will still be
sequential, and deduped/reflink won't affect the csum verification speed.

>    /dev/mapper/a-backup--a         1.69TiB
>    /dev/mapper/b-backup--b         1.69TiB
> 
> Metadata,RAID1: Size:19.00GiB, Used:16.90GiB

This is the main factor contributing to btrfs check time.
Just consider it as the minimal amount of data btrfs check needs to read.

>    /dev/mapper/a-backup--a        19.00GiB
>    /dev/mapper/b-backup--b        19.00GiB
> 
> System,RAID1: Size:64.00MiB, Used:336.00KiB
>    /dev/mapper/a-backup--a        64.00MiB
>    /dev/mapper/b-backup--b        64.00MiB
> 
> Unallocated:
>    /dev/mapper/a-backup--a         1.02TiB
>    /dev/mapper/b-backup--b         1.02TiB
> 
> compress=zstd,space_cache=v2
> 202 snapshots, heavily de-duplicated
> 551G / 361,000 files in latest snapshot

No wonder it's so slow for lowmem mode.

> 
> Btrfs check normal mode took 12 mins and 11.5G ram
> Lowmem mode I stopped after 4 hours, max memory usage was around 3.9G

For lowmem, btrfs check will use 25% of your total memory as cache to
speed up it a little. (but as you can see, it's still slow)
Maybe we could add some option to modify how many bytes we could use for
lowmem mode.

Thanks,
Qu

> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03  8:50                             ` Qu Wenruo
@ 2018-07-03 14:38                               ` Marc MERLIN
  2018-07-03 21:46                               ` Chris Murphy
  1 sibling, 0 replies; 65+ messages in thread
From: Marc MERLIN @ 2018-07-03 14:38 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Murphy, Su Yue, Btrfs BTRFS

On Tue, Jul 03, 2018 at 04:50:48PM +0800, Qu Wenruo wrote:
> > It sounds like there may not be a fix to this problem with the filesystem's
> > design, outside of "do not get there, or else".
> > It would even be useful for btrfs tools to start computing heuristics and
> > output warnings like "you have more than 100 snapshots on this filesystem,
> > this is not recommended, please read http://url/"
> 
> This looks pretty doable, but maybe it's better to add some warning at
> btrfs progs (both "subvolume snapshot" and "receive").

This is what I meant to say, correct.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03  8:34                             ` Su Yue
@ 2018-07-03 21:34                               ` Chris Murphy
  2018-07-03 21:40                                 ` Marc MERLIN
  0 siblings, 1 reply; 65+ messages in thread
From: Chris Murphy @ 2018-07-03 21:34 UTC (permalink / raw)
  To: Su Yue; +Cc: Marc MERLIN, Chris Murphy, Qu Wenruo, Btrfs BTRFS

On Tue, Jul 3, 2018 at 2:34 AM, Su Yue <suy.fnst@cn.fujitsu.com> wrote:

> Yes, extent tree is the hardest part for lowmem mode. I'm quite
> confident the tool can deal well with file trees(which records metadata
> about file and directory name, relationships).
> As for extent tree, I have few confidence due to its complexity.

I have to ask again if there's some metadata integrity mask opion Marc
should use to try to catch the corruption cause in the first place?

His use case really can't afford either mode of btrfs check. And also
check is only backward looking, it doesn't show what was happening at
the time. And for big file systems, check rapidly doesn't scale at all
anyway.

And now he's modifying his layout to avoid the problem from happening
again which makes it less likely to catch the cause, and get it fixed.
I think if he's willing to build a kernel with integrity checker
enabled, it should be considered but only if it's likely to reveal why
the problem is happening, even if it can't repair the problem once
it's happened. He's already in that situation so masked integrity
checking is no worse, at least it gives a chance to improve Btrfs
rather than it being a mystery how it got corrupt.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03 21:34                               ` Chris Murphy
@ 2018-07-03 21:40                                 ` Marc MERLIN
  2018-07-04  1:37                                   ` Su Yue
  0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-07-03 21:40 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Btrfs BTRFS

On Tue, Jul 03, 2018 at 03:34:45PM -0600, Chris Murphy wrote:
> On Tue, Jul 3, 2018 at 2:34 AM, Su Yue <suy.fnst@cn.fujitsu.com> wrote:
> 
> > Yes, extent tree is the hardest part for lowmem mode. I'm quite
> > confident the tool can deal well with file trees(which records metadata
> > about file and directory name, relationships).
> > As for extent tree, I have few confidence due to its complexity.
> 
> I have to ask again if there's some metadata integrity mask opion Marc
> should use to try to catch the corruption cause in the first place?
> 
> His use case really can't afford either mode of btrfs check. And also
> check is only backward looking, it doesn't show what was happening at
> the time. And for big file systems, check rapidly doesn't scale at all
> anyway.
> 
> And now he's modifying his layout to avoid the problem from happening
> again which makes it less likely to catch the cause, and get it fixed.
> I think if he's willing to build a kernel with integrity checker
> enabled, it should be considered but only if it's likely to reveal why
> the problem is happening, even if it can't repair the problem once
> it's happened. He's already in that situation so masked integrity
> checking is no worse, at least it gives a chance to improve Btrfs
> rather than it being a mystery how it got corrupt.

Yeah, I'm fine waiting a few more ays with this down and gather data if
that helps.
But due to the size, a full btrfs image may be a bit larger than we
want, not counting some confidential data in some filenames.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03  8:50                             ` Qu Wenruo
  2018-07-03 14:38                               ` Marc MERLIN
@ 2018-07-03 21:46                               ` Chris Murphy
  2018-07-03 22:00                                 ` Marc MERLIN
  1 sibling, 1 reply; 65+ messages in thread
From: Chris Murphy @ 2018-07-03 21:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Marc MERLIN, Chris Murphy, Su Yue, Btrfs BTRFS

On Tue, Jul 3, 2018 at 2:50 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
> There must be something wrong, however due to the size of the fs, and
> the complexity of extent tree, I can't tell.

Right, which is why I'm asking if any of the metadata integrity
checker mask options might reveal what's going wrong?

I guess the big issues are:
a. compile kernel with CONFIG_BTRFS_FS_CHECK_INTEGRITY=y is necessary
b. it can come with a high resource burden depending on the mask and
where the log is being written (write system logs to a different file
system for sure)
c. the granularity offered in the integrity checker might not be enough.
d. might take a while before corruptions are injected before
corruption is noticed and flagged.

So it might be pointless, no idea.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03 21:46                               ` Chris Murphy
@ 2018-07-03 22:00                                 ` Marc MERLIN
  2018-07-03 22:52                                   ` Qu Wenruo
  0 siblings, 1 reply; 65+ messages in thread
From: Marc MERLIN @ 2018-07-03 22:00 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Qu Wenruo, Su Yue, Btrfs BTRFS

On Tue, Jul 03, 2018 at 03:46:59PM -0600, Chris Murphy wrote:
> On Tue, Jul 3, 2018 at 2:50 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >
> >
> > There must be something wrong, however due to the size of the fs, and
> > the complexity of extent tree, I can't tell.
> 
> Right, which is why I'm asking if any of the metadata integrity
> checker mask options might reveal what's going wrong?
> 
> I guess the big issues are:
> a. compile kernel with CONFIG_BTRFS_FS_CHECK_INTEGRITY=y is necessary
> b. it can come with a high resource burden depending on the mask and
> where the log is being written (write system logs to a different file
> system for sure)
> c. the granularity offered in the integrity checker might not be enough.
> d. might take a while before corruptions are injected before
> corruption is noticed and flagged.

Back to where I'm at right now. I'm going to delete this filesystem and
start over very soon. Tomorrow or the day after.
I'm happy to get more data off it if someone wants it for posterity, but
I indeed need to recover soon since being with a dead backup server is
not a good place to be in :)

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03 22:00                                 ` Marc MERLIN
@ 2018-07-03 22:52                                   ` Qu Wenruo
  0 siblings, 0 replies; 65+ messages in thread
From: Qu Wenruo @ 2018-07-03 22:52 UTC (permalink / raw)
  To: Marc MERLIN, Chris Murphy; +Cc: Su Yue, Btrfs BTRFS



On 2018年07月04日 06:00, Marc MERLIN wrote:
> On Tue, Jul 03, 2018 at 03:46:59PM -0600, Chris Murphy wrote:
>> On Tue, Jul 3, 2018 at 2:50 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>
>>>
>>> There must be something wrong, however due to the size of the fs, and
>>> the complexity of extent tree, I can't tell.
>>
>> Right, which is why I'm asking if any of the metadata integrity
>> checker mask options might reveal what's going wrong?
>>
>> I guess the big issues are:
>> a. compile kernel with CONFIG_BTRFS_FS_CHECK_INTEGRITY=y is necessary
>> b. it can come with a high resource burden depending on the mask and
>> where the log is being written (write system logs to a different file
>> system for sure)
>> c. the granularity offered in the integrity checker might not be enough.
>> d. might take a while before corruptions are injected before
>> corruption is noticed and flagged.
> 
> Back to where I'm at right now. I'm going to delete this filesystem and
> start over very soon. Tomorrow or the day after.
> I'm happy to get more data off it if someone wants it for posterity, but
> I indeed need to recover soon since being with a dead backup server is
> not a good place to be in :)

Feel free to recover asap, as the extent tree is really too large for
human to analyse manually.

Thanks,
Qu

> 
> Thanks,
> Marc
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03 21:40                                 ` Marc MERLIN
@ 2018-07-04  1:37                                   ` Su Yue
  0 siblings, 0 replies; 65+ messages in thread
From: Su Yue @ 2018-07-04  1:37 UTC (permalink / raw)
  To: Marc MERLIN, Chris Murphy; +Cc: Qu Wenruo, Btrfs BTRFS



On 07/04/2018 05:40 AM, Marc MERLIN wrote:
> On Tue, Jul 03, 2018 at 03:34:45PM -0600, Chris Murphy wrote:
>> On Tue, Jul 3, 2018 at 2:34 AM, Su Yue <suy.fnst@cn.fujitsu.com> wrote:
>>
>>> Yes, extent tree is the hardest part for lowmem mode. I'm quite
>>> confident the tool can deal well with file trees(which records metadata
>>> about file and directory name, relationships).
>>> As for extent tree, I have few confidence due to its complexity.
>>
>> I have to ask again if there's some metadata integrity mask opion Marc
>> should use to try to catch the corruption cause in the first place?
>>
>> His use case really can't afford either mode of btrfs check. And also
>> check is only backward looking, it doesn't show what was happening at
>> the time. And for big file systems, check rapidly doesn't scale at all
>> anyway.
>>
>> And now he's modifying his layout to avoid the problem from happening
>> again which makes it less likely to catch the cause, and get it fixed.
>> I think if he's willing to build a kernel with integrity checker
>> enabled, it should be considered but only if it's likely to reveal why
>> the problem is happening, even if it can't repair the problem once
>> it's happened. He's already in that situation so masked integrity
>> checking is no worse, at least it gives a chance to improve Btrfs
>> rather than it being a mystery how it got corrupt.
> 
> Yeah, I'm fine waiting a few more ays with this down and gather data if
> that helps.
Thanks! I will write a special version which skips to check wrong extent 
items and print debug log.
And it must run faster to help us locate the stuck problem.

Su
> But due to the size, a full btrfs image may be a bit larger than we
> want, not counting some confidential data in some filenames.
> 
> Marc
> 



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  7:15                                   ` Duncan
@ 2018-07-06  4:28                                     ` Andrei Borzenkov
  2018-07-08  8:05                                       ` Duncan
  0 siblings, 1 reply; 65+ messages in thread
From: Andrei Borzenkov @ 2018-07-06  4:28 UTC (permalink / raw)
  To: Duncan, linux-btrfs

03.07.2018 10:15, Duncan пишет:
> Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as excerpted:
> 
>> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
>>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a
>>> bit dangerous to do it while writes are happening).
>>
>> Could you please elaborate? Do you mean btrfs can trim data before new
>> writes are actually committed to disk?
> 
> No.
> 
> But normally old roots aren't rewritten for some time simply due to odds 
> (fuller filesystems will of course recycle them sooner), and the btrfs 
> mount option usebackuproot (formerly recovery, until the norecovery mount 
> option that parallels that of other filesystems was added and this option 
> was renamed to avoid confusion) can be used to try an older root if the 
> current root is too damaged to successfully mount.
> > But other than simply by odds not using them again immediately, btrfs has
> no special protection for those old roots, and trim/discard will recover 
> them to hardware-unused as it does any other unused space, tho whether it 
> simply marks them for later processing or actually processes them 
> immediately is up to the individual implementation -- some do it 
> immediately, killing all chances at using the backup root because it's 
> already zeroed out, some don't.
> 

How is it relevant to "while writes are happening"? Will trimming old
tress immediately after writes have stopped be any different? Why?

> In the context of the discard mount option, that can mean there's never 
> any old roots available ever, as they've already been cleaned up by the 
> hardware due to the discard option telling the hardware to do it.
> 
> But even not using that mount option, and simply doing the trims 
> periodically, as done weekly by for instance the systemd fstrim timer and 
> service units, or done manually if you prefer, obviously potentially 
> wipes the old roots at that point.  If the system's effectively idle at 
> the time, not much risk as the current commit is likely to represent a 
> filesystem in full stasis, but if there's lots of writes going on at that 
> moment *AND* the system happens to crash at just the wrong time, before 
> additional commits have recreated at least a bit of root history, again, 
> you'll potentially be left without any old roots for the usebackuproot 
> mount option to try to fall back to, should it actually be necessary.
> 

Sorry? You are just saying that "previous state can be discarded before
new state is committed", just more verbosely.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-06  4:28                                     ` Andrei Borzenkov
@ 2018-07-08  8:05                                       ` Duncan
  0 siblings, 0 replies; 65+ messages in thread
From: Duncan @ 2018-07-08  8:05 UTC (permalink / raw)
  To: linux-btrfs

Andrei Borzenkov posted on Fri, 06 Jul 2018 07:28:48 +0300 as excerpted:

> 03.07.2018 10:15, Duncan пишет:
>> Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as
>> excerpted:
>> 
>>> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
>>>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a
>>>> bit dangerous to do it while writes are happening).
>>>
>>> Could you please elaborate? Do you mean btrfs can trim data before new
>>> writes are actually committed to disk?
>> 
>> No.
>> 
>> But normally old roots aren't rewritten for some time simply due to
>> odds (fuller filesystems will of course recycle them sooner), and the
>> btrfs mount option usebackuproot (formerly recovery, until the
>> norecovery mount option that parallels that of other filesystems was
>> added and this option was renamed to avoid confusion) can be used to
>> try an older root if the current root is too damaged to successfully
>> mount.

>> But other than simply by odds not using them again immediately, btrfs
>> has
>> no special protection for those old roots, and trim/discard will
>> recover them to hardware-unused as it does any other unused space, tho
>> whether it simply marks them for later processing or actually processes
>> them immediately is up to the individual implementation -- some do it
>> immediately, killing all chances at using the backup root because it's
>> already zeroed out, some don't.
>> 
>> 
> How is it relevant to "while writes are happening"? Will trimming old
> tress immediately after writes have stopped be any different? Why?

Define "while writes are happening" vs. "immediately after writes have 
stopped".  How soon is "immediately", and does the writes stopped 
condition account for data that has reached the device-hardware write 
buffer (so is no longer being transmitted to the device across the bus) 
but not been actually written to media, or not?

On a reasonably quiescent system, multiple empty write cycles are likely 
to have occurred since the last write barrier, and anything in-process is 
likely to have made it to media even if software is missing a write 
barrier it needs (software bug) or the hardware lies about honoring the 
write barrier (hardware bug, allegedly sometimes deliberate on hardware 
willing to gamble with your data that a crash won't happen in a critical 
moment, a somewhat rare occurrence, in ordered to improve normal 
operation performance metrics).

On an IO-maxed system, data and write-barriers are coming down as fast as 
the system can handle them, and write-barriers become critical -- crash 
after something was supposed to get to media but didn't, either because 
of a missing write barrier or because the hardware/firmware lied about 
the barrier and said the data it was supposed to ensure was on-media was, 
when it wasn't, and the btrfs atomic-cow commit guarantees of consistent 
state at each commit go out the window.

At this point it becomes useful to have a number of previous "guaranteed 
consistent state" roots to fall back on, with the /hope/ being that at 
least /one/ of them is usably consistent.  If all but the last one are 
wiped due to trim...

When the system isn't write-maxed the write will have almost certainly 
made it regardless of whether the barrier is there or not, because 
there's enough idle time to finish the current write before another one 
comes down the pipe, so the last-written root is almost certain to be 
fine regardless of barriers, and the history of past roots doesn't matter 
even if there's a crash.

If "immediately after writes have stopped" is strictly defined as a 
condition when all writes including the btrfs commit updating the current 
root and the superblock pointers to the current root have completed, with 
no new writes coming down the pipe in the mean time that might have 
delayed a critical update if a barrier was missed, then trimming old 
roots in this state should be entirely safe, and the distinction between 
that state and the "while writes are happening" is clear.

But if "immediately after writes have stopped" is less strictly defined, 
then the distinction between that state and "while writes are happening" 
remains blurry at best, and having old roots around to fall back on in 
case a write-barrier was missed (for whatever reason, hardware or 
software) becomes a very good thing.

Of course the fact that trim/discard itself is an instruction written to 
the device in the combined command/data stream complexifies the picture 
substantially.  If those write barriers get missed who knows what state 
the new root is in, and if the old ones got erased...  But again, on a 
mostly idle system, it'll probably all "just work", because the writes 
will likely all make it to media, regardless, because there's not a bunch 
of other writes competing for limited write bandwidth and making ordering 
critical.

>> In the context of the discard mount option, that can mean there's never
>> any old roots available ever, as they've already been cleaned up by the
>> hardware due to the discard option telling the hardware to do it.
>> 
>> But even not using that mount option, and simply doing the trims
>> periodically, as done weekly by for instance the systemd fstrim timer
>> and service units, or done manually if you prefer, obviously
>> potentially wipes the old roots at that point.  If the system's
>> effectively idle at the time, not much risk as the current commit is
>> likely to represent a filesystem in full stasis, but if there's lots of
>> writes going on at that moment *AND* the system happens to crash at
>> just the wrong time, before additional commits have recreated at least
>> a bit of root history, again, you'll potentially be left without any
>> old roots for the usebackuproot mount option to try to fall back to,
>> should it actually be necessary.
>> 
>> 
> Sorry? You are just saying that "previous state can be discarded before
> new state is committed", just more verbosely.

No, it's more the new state gets committed before the old is trimmed, but 
should it turn out to be unusable (due to missing write barriers, etc, 
which is more of an issue on a write-bottlenecked system), having a 
history of old roots/states around to fall back to can be very useful.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2018-07-08  8:07 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-29  4:27 So, does btrfs check lowmem take days? weeks? Marc MERLIN
2018-06-29  5:07 ` Qu Wenruo
2018-06-29  5:28   ` Marc MERLIN
2018-06-29  5:48     ` Qu Wenruo
2018-06-29  6:06       ` Marc MERLIN
2018-06-29  6:29         ` Qu Wenruo
2018-06-29  6:59           ` Marc MERLIN
2018-06-29  7:09             ` Roman Mamedov
2018-06-29  7:22               ` Marc MERLIN
2018-06-29  7:34                 ` Roman Mamedov
2018-06-29  8:04                 ` Lionel Bouton
2018-06-29 16:24                   ` btrfs send/receive vs rsync Marc MERLIN
2018-06-30  8:18                     ` Duncan
2018-06-29  7:20             ` So, does btrfs check lowmem take days? weeks? Qu Wenruo
2018-06-29  7:28               ` Marc MERLIN
2018-06-29 17:10                 ` Marc MERLIN
2018-06-30  0:04                   ` Chris Murphy
2018-06-30  2:44                   ` Marc MERLIN
2018-06-30 14:49                     ` Qu Wenruo
2018-06-30 21:06                       ` Marc MERLIN
2018-06-29  6:02     ` Su Yue
2018-06-29  6:10       ` Marc MERLIN
2018-06-29  6:32         ` Su Yue
2018-06-29  6:43           ` Marc MERLIN
2018-07-01 23:22             ` Marc MERLIN
2018-07-02  2:02               ` Su Yue
2018-07-02  3:22                 ` Marc MERLIN
2018-07-02  6:22                   ` Su Yue
2018-07-02 14:05                     ` Marc MERLIN
2018-07-02 14:42                       ` Qu Wenruo
2018-07-02 15:18                         ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
2018-07-02 16:59                           ` Austin S. Hemmelgarn
2018-07-02 17:34                             ` Marc MERLIN
2018-07-02 18:35                               ` Austin S. Hemmelgarn
2018-07-02 19:40                                 ` Marc MERLIN
2018-07-03  4:25                                 ` Andrei Borzenkov
2018-07-03  7:15                                   ` Duncan
2018-07-06  4:28                                     ` Andrei Borzenkov
2018-07-08  8:05                                       ` Duncan
2018-07-03  0:51                           ` Paul Jones
2018-07-03  4:06                             ` Marc MERLIN
2018-07-03  4:26                               ` Paul Jones
2018-07-03  5:42                                 ` Marc MERLIN
2018-07-03  1:37                           ` Qu Wenruo
2018-07-03  4:15                             ` Marc MERLIN
2018-07-03  9:55                               ` Paul Jones
2018-07-03 11:29                                 ` Qu Wenruo
2018-07-03  4:23                             ` Andrei Borzenkov
2018-07-02 15:19                         ` So, does btrfs check lowmem take days? weeks? Marc MERLIN
2018-07-02 17:08                           ` Austin S. Hemmelgarn
2018-07-02 17:33                           ` Roman Mamedov
2018-07-02 17:39                             ` Marc MERLIN
2018-07-03  0:31                         ` Chris Murphy
2018-07-03  4:22                           ` Marc MERLIN
2018-07-03  8:34                             ` Su Yue
2018-07-03 21:34                               ` Chris Murphy
2018-07-03 21:40                                 ` Marc MERLIN
2018-07-04  1:37                                   ` Su Yue
2018-07-03  8:50                             ` Qu Wenruo
2018-07-03 14:38                               ` Marc MERLIN
2018-07-03 21:46                               ` Chris Murphy
2018-07-03 22:00                                 ` Marc MERLIN
2018-07-03 22:52                                   ` Qu Wenruo
2018-06-29  5:35   ` Su Yue
2018-06-29  5:46     ` Marc MERLIN

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.