All of lore.kernel.org
 help / color / mirror / Atom feed
* So, does btrfs check lowmem take days? weeks?
@ 2018-06-29  4:27 Marc MERLIN
  2018-06-29  5:07 ` Qu Wenruo
  0 siblings, 1 reply; 72+ messages in thread
From: Marc MERLIN @ 2018-06-29  4:27 UTC (permalink / raw)
  To: linux-btrfs

Regular btrfs check --repair has a nice progress option. It wasn't
perfect, but it showed something.

But then it also takes all your memory quicker than the linux kernel can
defend itself and reliably completely kills my 32GB server quicker than
it can OOM anything.

lowmem repair seems to be going still, but it's been days and -p seems
to do absolutely nothing.

My filesystem is "only" 10TB or so, albeit with a lot of files.

2 things that come to mind
1) can lowmem have some progress working so that I know if I'm looking
at days, weeks, or even months before it will be done?

2) non lowmem is more efficient obviously when it doesn't completely
crash your machine, but could lowmem be given an amount of memory to use
for caching, or maybe use some heuristics based on RAM free so that it's
not so excrutiatingly slow?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  4:27 So, does btrfs check lowmem take days? weeks? Marc MERLIN
@ 2018-06-29  5:07 ` Qu Wenruo
  2018-06-29  5:28   ` Marc MERLIN
  2018-06-29  5:35   ` Su Yue
  0 siblings, 2 replies; 72+ messages in thread
From: Qu Wenruo @ 2018-06-29  5:07 UTC (permalink / raw)
  To: Marc MERLIN, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1790 bytes --]



On 2018年06月29日 12:27, Marc MERLIN wrote:
> Regular btrfs check --repair has a nice progress option. It wasn't
> perfect, but it showed something.
> 
> But then it also takes all your memory quicker than the linux kernel can
> defend itself and reliably completely kills my 32GB server quicker than
> it can OOM anything.
> 
> lowmem repair seems to be going still, but it's been days and -p seems
> to do absolutely nothing.

I'm a afraid you hit a bug in lowmem repair code.
By all means, --repair shouldn't really be used unless you're pretty
sure the problem is something btrfs check can handle.

That's also why --repair is still marked as dangerous.
Especially when it's combined with experimental lowmem mode.

> 
> My filesystem is "only" 10TB or so, albeit with a lot of files.

Unless you have tons of snapshots and reflinked (deduped) files, it
shouldn't take so long.

> 
> 2 things that come to mind
> 1) can lowmem have some progress working so that I know if I'm looking
> at days, weeks, or even months before it will be done?

It's hard to estimate, especially when every cross check involves a lot
of disk IO.

But at least, we could add such indicator to show we're doing something.

> 
> 2) non lowmem is more efficient obviously when it doesn't completely
> crash your machine, but could lowmem be given an amount of memory to use
> for caching, or maybe use some heuristics based on RAM free so that it's
> not so excrutiatingly slow?

IIRC recent commit has added the ability.
a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers")

That's already included in btrfs-progs v4.13.2.
So it should be a dead loop which lowmem repair code can't handle.

Thanks,
Qu

> 
> Thanks,
> Marc
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  5:07 ` Qu Wenruo
@ 2018-06-29  5:28   ` Marc MERLIN
  2018-06-29  5:48     ` Qu Wenruo
  2018-06-29  6:02     ` Su Yue
  2018-06-29  5:35   ` Su Yue
  1 sibling, 2 replies; 72+ messages in thread
From: Marc MERLIN @ 2018-06-29  5:28 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Jun 29, 2018 at 01:07:20PM +0800, Qu Wenruo wrote:
> > lowmem repair seems to be going still, but it's been days and -p seems
> > to do absolutely nothing.
> 
> I'm a afraid you hit a bug in lowmem repair code.
> By all means, --repair shouldn't really be used unless you're pretty
> sure the problem is something btrfs check can handle.
> 
> That's also why --repair is still marked as dangerous.
> Especially when it's combined with experimental lowmem mode.

Understood, but btrfs got corrupted (by itself or not, I don't know)
I cannot mount the filesystem read/write
I cannot btrfs check --repair it since that code will kill my machine
What do I have left?

> > My filesystem is "only" 10TB or so, albeit with a lot of files.
> 
> Unless you have tons of snapshots and reflinked (deduped) files, it
> shouldn't take so long.

I may have a fair amount.
gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2 
enabling repair mode
WARNING: low-memory mode repair support is only partial
Checking filesystem on /dev/mapper/dshelf2
UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
Fixed 0 roots.
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4
Created new chunk [18457780224000 1073741824]
Delete backref in extent [84302495744 69632]
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4
Delete backref in extent [84302495744 69632]
ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240
Delete backref in extent [125712527360 12214272]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431
Delete backref in extent [129952120832 20242432]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433
Delete backref in extent [129952120832 20242432]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249
Delete backref in extent [147895111680 12345344]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251
Delete backref in extent [147895111680 12345344]
ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 348, have: 418
Delete backref in extent [150850146304 17522688]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 555, have: 1449
Deleted root 2 item[156909494272, 178, 5476627808561673095]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 556, have: 1452
Deleted root 2 item[156909494272, 178, 7338474132555182983]
ERROR: file extent[374857 235184128] root 21872 owner 21872 backref lost
Add one extent data backref [156909494272 55320576]
ERROR: file extent[374857 235184128] root 22911 owner 22911 backref lost
Add one extent data backref [156909494272 55320576]

The last two ERROR lines took over a day to get generated, so I'm not sure if it's still working, but just slowly.
For what it's worth non lowmem check used to take 12 to 24H on that filesystem back when it still worked.

> > 2 things that come to mind
> > 1) can lowmem have some progress working so that I know if I'm looking
> > at days, weeks, or even months before it will be done?
> 
> It's hard to estimate, especially when every cross check involves a lot
> of disk IO.
> But at least, we could add such indicator to show we're doing something.

Yes, anything to show that I should still wait is still good :)

> > 2) non lowmem is more efficient obviously when it doesn't completely
> > crash your machine, but could lowmem be given an amount of memory to use
> > for caching, or maybe use some heuristics based on RAM free so that it's
> > not so excrutiatingly slow?
> 
> IIRC recent commit has added the ability.
> a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers")
 
Oh, good.

> That's already included in btrfs-progs v4.13.2.
> So it should be a dead loop which lowmem repair code can't handle.

I see. Is there any reasonably easy way to check on this running process?

Both top and iotop show that it's working, but of course I can't tell if
it's looping, or not.

Then again, maybe it already fixed enough that I can mount my filesystem again.

But back to the main point, it's sad that after so many years, the
repair situation is still so suboptimal, especially when it's apparently
pretty easy for btrfs to get damaged (through its own fault or not, hard
to say).

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  5:07 ` Qu Wenruo
  2018-06-29  5:28   ` Marc MERLIN
@ 2018-06-29  5:35   ` Su Yue
  2018-06-29  5:46     ` Marc MERLIN
  1 sibling, 1 reply; 72+ messages in thread
From: Su Yue @ 2018-06-29  5:35 UTC (permalink / raw)
  To: Qu Wenruo, Marc MERLIN, linux-btrfs



On 06/29/2018 01:07 PM, Qu Wenruo wrote:
> 
> 
> On 2018年06月29日 12:27, Marc MERLIN wrote:
>> Regular btrfs check --repair has a nice progress option. It wasn't
>> perfect, but it showed something.
>>
>> But then it also takes all your memory quicker than the linux kernel can
>> defend itself and reliably completely kills my 32GB server quicker than
>> it can OOM anything.
>>
>> lowmem repair seems to be going still, but it's been days and -p seems
>> to do absolutely nothing.
> 
> I'm a afraid you hit a bug in lowmem repair code.
> By all means, --repair shouldn't really be used unless you're pretty
> sure the problem is something btrfs check can handle.
> 
> That's also why --repair is still marked as dangerous.
> Especially when it's combined with experimental lowmem mode.
>
>>
>> My filesystem is "only" 10TB or so, albeit with a lot of files.
> 
> Unless you have tons of snapshots and reflinked (deduped) files, it
> shouldn't take so long.
> 
>>
>> 2 things that come to mind
>> 1) can lowmem have some progress working so that I know if I'm looking
>> at days, weeks, or even months before it will be done?
> 
> It's hard to estimate, especially when every cross check involves a lot
> of disk IO.
> 
> But at least, we could add such indicator to show we're doing something.
> Maybe we can account all roots in root tree first, before checking a
tree, report i/num_roots. So users can see the what is the check doing
something meaningful or silly dead looping.

Thanks,
Su

>>
>> 2) non lowmem is more efficient obviously when it doesn't completely
>> crash your machine, but could lowmem be given an amount of memory to use
>> for caching, or maybe use some heuristics based on RAM free so that it's
>> not so excrutiatingly slow?
> 
> IIRC recent commit has added the ability.
> a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers")
> 
> That's already included in btrfs-progs v4.13.2.
> So it should be a dead loop which lowmem repair code can't handle.
> 
> Thanks,
> Qu
> 
>>
>> Thanks,
>> Marc
>>
> 



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  5:35   ` Su Yue
@ 2018-06-29  5:46     ` Marc MERLIN
  0 siblings, 0 replies; 72+ messages in thread
From: Marc MERLIN @ 2018-06-29  5:46 UTC (permalink / raw)
  To: Su Yue; +Cc: Qu Wenruo, linux-btrfs

On Fri, Jun 29, 2018 at 01:35:06PM +0800, Su Yue wrote:
> > It's hard to estimate, especially when every cross check involves a lot
> > of disk IO.
> > 
> > But at least, we could add such indicator to show we're doing something.
> > Maybe we can account all roots in root tree first, before checking a
> tree, report i/num_roots. So users can see the what is the check doing
> something meaningful or silly dead looping.

Sounds reasonable.
Do you want to submit something in git master for btrfs-progs, I pull
it, and just my btrfs check again?

In the meantime, how sane does the output I just posted, look?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  5:28   ` Marc MERLIN
@ 2018-06-29  5:48     ` Qu Wenruo
  2018-06-29  6:06       ` Marc MERLIN
  2018-06-29  6:02     ` Su Yue
  1 sibling, 1 reply; 72+ messages in thread
From: Qu Wenruo @ 2018-06-29  5:48 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 7460 bytes --]



On 2018年06月29日 13:28, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 01:07:20PM +0800, Qu Wenruo wrote:
>>> lowmem repair seems to be going still, but it's been days and -p seems
>>> to do absolutely nothing.
>>
>> I'm a afraid you hit a bug in lowmem repair code.
>> By all means, --repair shouldn't really be used unless you're pretty
>> sure the problem is something btrfs check can handle.
>>
>> That's also why --repair is still marked as dangerous.
>> Especially when it's combined with experimental lowmem mode.
> 
> Understood, but btrfs got corrupted (by itself or not, I don't know)
> I cannot mount the filesystem read/write
> I cannot btrfs check --repair it since that code will kill my machine
> What do I have left?

Just normal btrfs check, and post the output.
If normal check eats up all your memory, btrfs check --mode=lowmem.

--repair should be considered as the last method.

> 
>>> My filesystem is "only" 10TB or so, albeit with a lot of files.
>>
>> Unless you have tons of snapshots and reflinked (deduped) files, it
>> shouldn't take so long.
> 
> I may have a fair amount.
> gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2 
> enabling repair mode
> WARNING: low-memory mode repair support is only partial
> Checking filesystem on /dev/mapper/dshelf2
> UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
> Fixed 0 roots.
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4
> Created new chunk [18457780224000 1073741824]
> Delete backref in extent [84302495744 69632]
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4
> Delete backref in extent [84302495744 69632]
> ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240
> Delete backref in extent [125712527360 12214272]
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115
> Delete backref in extent [125730848768 5111808]
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115
> Delete backref in extent [125730848768 5111808]
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143
> Delete backref in extent [125736914944 6037504]
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143
> Delete backref in extent [125736914944 6037504]
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431
> Delete backref in extent [129952120832 20242432]
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433
> Delete backref in extent [129952120832 20242432]
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240
> Delete backref in extent [134925357056 11829248]
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240
> Delete backref in extent [134925357056 11829248]
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249
> Delete backref in extent [147895111680 12345344]
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251
> Delete backref in extent [147895111680 12345344]
> ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 348, have: 418
> Delete backref in extent [150850146304 17522688]
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 555, have: 1449
> Deleted root 2 item[156909494272, 178, 5476627808561673095]
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 556, have: 1452
> Deleted root 2 item[156909494272, 178, 7338474132555182983]
> ERROR: file extent[374857 235184128] root 21872 owner 21872 backref lost
> Add one extent data backref [156909494272 55320576]
> ERROR: file extent[374857 235184128] root 22911 owner 22911 backref lost
> Add one extent data backref [156909494272 55320576]
> 
> The last two ERROR lines took over a day to get generated, so I'm not sure if it's still working, but just slowly.

OK, that explains something.

One extent is referred hundreds times, no wonder it will take a long time.

Just one tip here, there are really too many snapshots/reflinked files.
It's highly recommended to keep the number of snapshots to a reasonable
number (lower two digits).
Although btrfs snapshot is super fast, it puts a lot of pressure on its
extent tree, so there is no free lunch here.

> For what it's worth non lowmem check used to take 12 to 24H on that filesystem back when it still worked.
> 
>>> 2 things that come to mind
>>> 1) can lowmem have some progress working so that I know if I'm looking
>>> at days, weeks, or even months before it will be done?
>>
>> It's hard to estimate, especially when every cross check involves a lot
>> of disk IO.
>> But at least, we could add such indicator to show we're doing something.
> 
> Yes, anything to show that I should still wait is still good :)
> 
>>> 2) non lowmem is more efficient obviously when it doesn't completely
>>> crash your machine, but could lowmem be given an amount of memory to use
>>> for caching, or maybe use some heuristics based on RAM free so that it's
>>> not so excrutiatingly slow?
>>
>> IIRC recent commit has added the ability.
>> a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers")
>  
> Oh, good.
> 
>> That's already included in btrfs-progs v4.13.2.
>> So it should be a dead loop which lowmem repair code can't handle.
> 
> I see. Is there any reasonably easy way to check on this running process?

GDB attach would be good.
Interrupt and check the inode number if it's checking fs tree.
Check the extent bytenr number if it's checking extent tree.

But considering how many snapshots there are, it's really hard to determine.

In this case, the super large extent tree is causing a lot of problem,
maybe it's a good idea to allow btrfs check to skip extent tree check?

> 
> Both top and iotop show that it's working, but of course I can't tell if
> it's looping, or not.
> 
> Then again, maybe it already fixed enough that I can mount my filesystem again.

This needs the initial btrfs check report and the kernel messages how it
fails to mount.

> 
> But back to the main point, it's sad that after so many years, the
> repair situation is still so suboptimal, especially when it's apparently
> pretty easy for btrfs to get damaged (through its own fault or not, hard
> to say).

Unfortunately, yes.
Especially the extent tree is pretty fragile and hard to repair.

Thanks,
Qu

> 
> Thanks,
> Marc
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  5:28   ` Marc MERLIN
  2018-06-29  5:48     ` Qu Wenruo
@ 2018-06-29  6:02     ` Su Yue
  2018-06-29  6:10       ` Marc MERLIN
  1 sibling, 1 reply; 72+ messages in thread
From: Su Yue @ 2018-06-29  6:02 UTC (permalink / raw)
  To: Marc MERLIN, Qu Wenruo; +Cc: linux-btrfs



On 06/29/2018 01:28 PM, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 01:07:20PM +0800, Qu Wenruo wrote:
>>> lowmem repair seems to be going still, but it's been days and -p seems
>>> to do absolutely nothing.
>>
>> I'm a afraid you hit a bug in lowmem repair code.
>> By all means, --repair shouldn't really be used unless you're pretty
>> sure the problem is something btrfs check can handle.
>>
>> That's also why --repair is still marked as dangerous.
>> Especially when it's combined with experimental lowmem mode.
> 
> Understood, but btrfs got corrupted (by itself or not, I don't know)
> I cannot mount the filesystem read/write
> I cannot btrfs check --repair it since that code will kill my machine
> What do I have left?
> 
>>> My filesystem is "only" 10TB or so, albeit with a lot of files.
>>
>> Unless you have tons of snapshots and reflinked (deduped) files, it
>> shouldn't take so long.
> 
> I may have a fair amount.
> gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2
> enabling repair mode
> WARNING: low-memory mode repair support is only partial
> Checking filesystem on /dev/mapper/dshelf2
> UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
> Fixed 0 roots.
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4
> Created new chunk [18457780224000 1073741824]
> Delete backref in extent [84302495744 69632]
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4
> Delete backref in extent [84302495744 69632]
> ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240
> Delete backref in extent [125712527360 12214272]
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115
> Delete backref in extent [125730848768 5111808]
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115
> Delete backref in extent [125730848768 5111808]
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143
> Delete backref in extent [125736914944 6037504]
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143
> Delete backref in extent [125736914944 6037504]
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431
> Delete backref in extent [129952120832 20242432]
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433
> Delete backref in extent [129952120832 20242432]
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240
> Delete backref in extent [134925357056 11829248]
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240
> Delete backref in extent [134925357056 11829248]
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249
> Delete backref in extent [147895111680 12345344]
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251
> Delete backref in extent [147895111680 12345344]
> ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 348, have: 418
> Delete backref in extent [150850146304 17522688]
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 555, have: 1449
> Deleted root 2 item[156909494272, 178, 5476627808561673095]
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 556, have: 1452
> Deleted root 2 item[156909494272, 178, 7338474132555182983]
> ERROR: file extent[374857 235184128] root 21872 owner 21872 backref lost
> Add one extent data backref [156909494272 55320576]
> ERROR: file extent[374857 235184128] root 22911 owner 22911 backref lost
> Add one extent data backref [156909494272 55320576]
> 
My bad.
It's almost possiblelly a bug about extent of lowmem check which
was reported by Chris too.
The extent check was wrong, the the repair did wrong things.

I have figured out the bug is lowmem check can't deal with shared tree 
block in reloc tree. The fix is simple, you can try the follow repo:

https://github.com/Damenly/btrfs-progs/tree/tmp1

Please run lowmem check "without =--repair" first to be sure whether
your filesystem is fine.

Though the bug and phenomenon are clear enough, before sending my patch,
I have to make a test image. I have spent a week to study btrfs balance
but it seems a liitle hard for me.

Thanks,
Su

> The last two ERROR lines took over a day to get generated, so I'm not sure if it's still working, but just slowly.
> For what it's worth non lowmem check used to take 12 to 24H on that filesystem back when it still worked.
> 
>>> 2 things that come to mind
>>> 1) can lowmem have some progress working so that I know if I'm looking
>>> at days, weeks, or even months before it will be done?
>>
>> It's hard to estimate, especially when every cross check involves a lot
>> of disk IO.
>> But at least, we could add such indicator to show we're doing something.
> 
> Yes, anything to show that I should still wait is still good :)
> 
>>> 2) non lowmem is more efficient obviously when it doesn't completely
>>> crash your machine, but could lowmem be given an amount of memory to use
>>> for caching, or maybe use some heuristics based on RAM free so that it's
>>> not so excrutiatingly slow?
>>
>> IIRC recent commit has added the ability.
>> a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers")
>   
> Oh, good.
> 
>> That's already included in btrfs-progs v4.13.2.
>> So it should be a dead loop which lowmem repair code can't handle.
> 
> I see. Is there any reasonably easy way to check on this running process?
> 
> Both top and iotop show that it's working, but of course I can't tell if
> it's looping, or not.
> 
> Then again, maybe it already fixed enough that I can mount my filesystem again.
> 
> But back to the main point, it's sad that after so many years, the
> repair situation is still so suboptimal, especially when it's apparently
> pretty easy for btrfs to get damaged (through its own fault or not, hard
> to say).
> 
> Thanks,
> Marc
> 



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  5:48     ` Qu Wenruo
@ 2018-06-29  6:06       ` Marc MERLIN
  2018-06-29  6:29         ` Qu Wenruo
  0 siblings, 1 reply; 72+ messages in thread
From: Marc MERLIN @ 2018-06-29  6:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Jun 29, 2018 at 01:48:17PM +0800, Qu Wenruo wrote:
> Just normal btrfs check, and post the output.
> If normal check eats up all your memory, btrfs check --mode=lowmem.
 
Does check without --repair eat less RAM?

> --repair should be considered as the last method.

If --repair doesn't work, check is useless to me sadly. I know that for
FS analysis and bug reporting, you want to have the FS without changing
it to something maybe worse, but for my use, if it can't be mounted and
can't be fixed, then it gets deleted which is even worse than check
doing the wrong thing.

> > The last two ERROR lines took over a day to get generated, so I'm not sure if it's still working, but just slowly.
> 
> OK, that explains something.
> 
> One extent is referred hundreds times, no wonder it will take a long time.
> 
> Just one tip here, there are really too many snapshots/reflinked files.
> It's highly recommended to keep the number of snapshots to a reasonable
> number (lower two digits).
> Although btrfs snapshot is super fast, it puts a lot of pressure on its
> extent tree, so there is no free lunch here.
 
Agreed, I doubt I have over or much over 100 snapshots though (but I
can't check right now).
Sadly I'm not allowed to mount even read only while check is running:
gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2
mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy

> > I see. Is there any reasonably easy way to check on this running process?
> 
> GDB attach would be good.
> Interrupt and check the inode number if it's checking fs tree.
> Check the extent bytenr number if it's checking extent tree.
> 
> But considering how many snapshots there are, it's really hard to determine.
> 
> In this case, the super large extent tree is causing a lot of problem,
> maybe it's a good idea to allow btrfs check to skip extent tree check?

I only see --init-extent-tree in the man page, which option did you have
in mind?

> > Then again, maybe it already fixed enough that I can mount my filesystem again.
> 
> This needs the initial btrfs check report and the kernel messages how it
> fails to mount.

mount command hangs, kernel does not show anything special outside of disk access hanging.

Jun 23 17:23:26 gargamel kernel: [  341.802696] BTRFS warning (device dm-2): 'recovery' is deprecated, use 'useback
uproot' instead
Jun 23 17:23:26 gargamel kernel: [  341.828743] BTRFS info (device dm-2): trying to use backup root at mount time
Jun 23 17:23:26 gargamel kernel: [  341.850180] BTRFS info (device dm-2): disk space caching is enabled
Jun 23 17:23:26 gargamel kernel: [  341.869014] BTRFS info (device dm-2): has skinny extents
Jun 23 17:23:26 gargamel kernel: [  342.206289] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
Jun 23 17:26:26 gargamel kernel: [  521.571392] BTRFS info (device dm-2): enabling ssd optimizations
Jun 23 17:55:58 gargamel kernel: [ 2293.914867] perf: interrupt took too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
Jun 23 17:56:22 gargamel kernel: [ 2317.718406] BTRFS info (device dm-2): disk space caching is enabled
Jun 23 17:56:22 gargamel kernel: [ 2317.737277] BTRFS info (device dm-2): has skinny extents
Jun 23 17:56:22 gargamel kernel: [ 2318.069461] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
Jun 23 17:59:22 gargamel kernel: [ 2498.256167] BTRFS info (device dm-2): enabling ssd optimizations
Jun 23 18:05:23 gargamel kernel: [ 2859.107057] BTRFS info (device dm-2): disk space caching is enabled
Jun 23 18:05:23 gargamel kernel: [ 2859.125883] BTRFS info (device dm-2): has skinny extents
Jun 23 18:05:24 gargamel kernel: [ 2859.448018] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
Jun 23 18:08:23 gargamel kernel: [ 3039.023305] BTRFS info (device dm-2): enabling ssd optimizations
Jun 23 18:13:41 gargamel kernel: [ 3356.626037] perf: interrupt took too long (3143 > 3133), lowering kernel.perf_event_max_sample_rate to 63500
Jun 23 18:17:23 gargamel kernel: [ 3578.937225] Process accounting resumed
Jun 23 18:33:47 gargamel kernel: [ 4563.356252] JFS: nTxBlock = 8192, nTxLock = 65536
Jun 23 18:33:48 gargamel kernel: [ 4563.446715] ntfs: driver 2.1.32 [Flags: R/W MODULE].
Jun 23 18:42:20 gargamel kernel: [ 5075.995254] INFO: task sync:20253 blocked for more than 120 seconds.
Jun 23 18:42:20 gargamel kernel: [ 5076.015729]       Not tainted 4.17.2-amd64-preempt-sysrq-20180817 #1
Jun 23 18:42:20 gargamel kernel: [ 5076.036141] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 23 18:42:20 gargamel kernel: [ 5076.060637] sync            D    0 20253  15327 0x20020080
Jun 23 18:42:20 gargamel kernel: [ 5076.078032] Call Trace:
Jun 23 18:42:20 gargamel kernel: [ 5076.086366]  ? __schedule+0x53e/0x59b
Jun 23 18:42:20 gargamel kernel: [ 5076.098311]  schedule+0x7f/0x98
Jun 23 18:42:20 gargamel kernel: [ 5076.108665]  __rwsem_down_read_failed_common+0x127/0x1a8
Jun 23 18:42:20 gargamel kernel: [ 5076.125565]  ? sync_fs_one_sb+0x20/0x20
Jun 23 18:42:20 gargamel kernel: [ 5076.137982]  ? call_rwsem_down_read_failed+0x14/0x30
Jun 23 18:42:20 gargamel kernel: [ 5076.154081]  call_rwsem_down_read_failed+0x14/0x30
Jun 23 18:42:20 gargamel kernel: [ 5076.169429]  down_read+0x13/0x25
Jun 23 18:42:20 gargamel kernel: [ 5076.180444]  iterate_supers+0x57/0xbe
Jun 23 18:42:20 gargamel kernel: [ 5076.192619]  ksys_sync+0x40/0xa4
Jun 23 18:42:20 gargamel kernel: [ 5076.203192]  __ia32_sys_sync+0xa/0xd
Jun 23 18:42:20 gargamel kernel: [ 5076.214774]  do_fast_syscall_32+0xaf/0xf3
Jun 23 18:42:20 gargamel kernel: [ 5076.227740]  entry_SYSENTER_compat+0x7f/0x91
Jun 23 18:44:21 gargamel kernel: [ 5196.828764] INFO: task sync:20253 blocked for more than 120 seconds.
Jun 23 18:44:21 gargamel kernel: [ 5196.848724]       Not tainted 4.17.2-amd64-preempt-sysrq-20180817 #1
Jun 23 18:44:21 gargamel kernel: [ 5196.868789] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 23 18:44:21 gargamel kernel: [ 5196.893615] sync            D    0 20253  15327 0x20020080

> > But back to the main point, it's sad that after so many years, the
> > repair situation is still so suboptimal, especially when it's apparently
> > pretty easy for btrfs to get damaged (through its own fault or not, hard
> > to say).
> 
> Unfortunately, yes.
> Especially the extent tree is pretty fragile and hard to repair.

So, I don't know the code, but if I may make a suggestion (which maybe
is totally wrong, if so forgive me):
I would love for a repair mode that gives me a back a fixed
filesystem. I don't really care how much data is lost (although ideally
it would give me a list of files lost), but I want a working filesystem
at the end. I can then decide if there is enough data left on it to
restore what's missing or if I'm better off starting from scratch.

Is that possible at all?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  6:02     ` Su Yue
@ 2018-06-29  6:10       ` Marc MERLIN
  2018-06-29  6:32         ` Su Yue
  0 siblings, 1 reply; 72+ messages in thread
From: Marc MERLIN @ 2018-06-29  6:10 UTC (permalink / raw)
  To: Su Yue; +Cc: Qu Wenruo, linux-btrfs

On Fri, Jun 29, 2018 at 02:02:19PM +0800, Su Yue wrote:
> I have figured out the bug is lowmem check can't deal with shared tree block
> in reloc tree. The fix is simple, you can try the follow repo:
> 
> https://github.com/Damenly/btrfs-progs/tree/tmp1

Not sure if I undertand that you meant, here.

> Please run lowmem check "without =--repair" first to be sure whether
> your filesystem is fine.
 
The filesystem is not fine, it caused btrfs balance to hang, whether
balance actually broke it further or caused the breakage, I can't say.

Then mount hangs, even with recovery, unless I use ro.

This filesystem is trash to me and will require over a week to rebuild
manually if I can't repair it.
Running check without repair for likely several days just to know that
my filesystem is not clear (I already know this) isn't useful :)
Or am I missing something?

> Though the bug and phenomenon are clear enough, before sending my patch,
> I have to make a test image. I have spent a week to study btrfs balance
> but it seems a liitle hard for me.

thanks for having a look, either way.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  6:06       ` Marc MERLIN
@ 2018-06-29  6:29         ` Qu Wenruo
  2018-06-29  6:59           ` Marc MERLIN
  0 siblings, 1 reply; 72+ messages in thread
From: Qu Wenruo @ 2018-06-29  6:29 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 8658 bytes --]



On 2018年06月29日 14:06, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 01:48:17PM +0800, Qu Wenruo wrote:
>> Just normal btrfs check, and post the output.
>> If normal check eats up all your memory, btrfs check --mode=lowmem.
>  
> Does check without --repair eat less RAM?

Unfortunately, no.

> 
>> --repair should be considered as the last method.
> 
> If --repair doesn't work, check is useless to me sadly.

Not exactly.
Although it's time consuming, I have manually patched several users fs,
which normally ends pretty well.

If it's not a wide-spread problem but some small fatal one, it may be fixed.

> I know that for
> FS analysis and bug reporting, you want to have the FS without changing
> it to something maybe worse, but for my use, if it can't be mounted and
> can't be fixed, then it gets deleted which is even worse than check
> doing the wrong thing.
> 
>>> The last two ERROR lines took over a day to get generated, so I'm not sure if it's still working, but just slowly.
>>
>> OK, that explains something.
>>
>> One extent is referred hundreds times, no wonder it will take a long time.
>>
>> Just one tip here, there are really too many snapshots/reflinked files.
>> It's highly recommended to keep the number of snapshots to a reasonable
>> number (lower two digits).
>> Although btrfs snapshot is super fast, it puts a lot of pressure on its
>> extent tree, so there is no free lunch here.
>  
> Agreed, I doubt I have over or much over 100 snapshots though (but I
> can't check right now).
> Sadly I'm not allowed to mount even read only while check is running:
> gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2
> mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy
> 
>>> I see. Is there any reasonably easy way to check on this running process?
>>
>> GDB attach would be good.
>> Interrupt and check the inode number if it's checking fs tree.
>> Check the extent bytenr number if it's checking extent tree.
>>
>> But considering how many snapshots there are, it's really hard to determine.
>>
>> In this case, the super large extent tree is causing a lot of problem,
>> maybe it's a good idea to allow btrfs check to skip extent tree check?
> 
> I only see --init-extent-tree in the man page, which option did you have
> in mind?

That feature is just in my mind, not even implemented yet.

> 
>>> Then again, maybe it already fixed enough that I can mount my filesystem again.
>>
>> This needs the initial btrfs check report and the kernel messages how it
>> fails to mount.
> 
> mount command hangs, kernel does not show anything special outside of disk access hanging.
> 
> Jun 23 17:23:26 gargamel kernel: [  341.802696] BTRFS warning (device dm-2): 'recovery' is deprecated, use 'useback
> uproot' instead
> Jun 23 17:23:26 gargamel kernel: [  341.828743] BTRFS info (device dm-2): trying to use backup root at mount time
> Jun 23 17:23:26 gargamel kernel: [  341.850180] BTRFS info (device dm-2): disk space caching is enabled
> Jun 23 17:23:26 gargamel kernel: [  341.869014] BTRFS info (device dm-2): has skinny extents
> Jun 23 17:23:26 gargamel kernel: [  342.206289] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
> Jun 23 17:26:26 gargamel kernel: [  521.571392] BTRFS info (device dm-2): enabling ssd optimizations
> Jun 23 17:55:58 gargamel kernel: [ 2293.914867] perf: interrupt took too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
> Jun 23 17:56:22 gargamel kernel: [ 2317.718406] BTRFS info (device dm-2): disk space caching is enabled
> Jun 23 17:56:22 gargamel kernel: [ 2317.737277] BTRFS info (device dm-2): has skinny extents
> Jun 23 17:56:22 gargamel kernel: [ 2318.069461] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0
> Jun 23 17:59:22 gargamel kernel: [ 2498.256167] BTRFS info (device dm-2): enabling ssd optimizations
> Jun 23 18:05:23 gargamel kernel: [ 2859.107057] BTRFS info (device dm-2): disk space caching is enabled
> Jun 23 18:05:23 gargamel kernel: [ 2859.125883] BTRFS info (device dm-2): has skinny extents
> Jun 23 18:05:24 gargamel kernel: [ 2859.448018] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0 , flush 0, corrupt 2, gen 0

This looks like super block corruption?

What about "btrfs inspect dump-super -fFa /dev/mapper/dshelf2"?

And what about "skip_balance" mount option?

Another problem is, with so many snapshots, balance is also hugely
slowed, thus I'm not 100% sure if it's really a hang.

> Jun 23 18:08:23 gargamel kernel: [ 3039.023305] BTRFS info (device dm-2): enabling ssd optimizations
> Jun 23 18:13:41 gargamel kernel: [ 3356.626037] perf: interrupt took too long (3143 > 3133), lowering kernel.perf_event_max_sample_rate to 63500
> Jun 23 18:17:23 gargamel kernel: [ 3578.937225] Process accounting resumed
> Jun 23 18:33:47 gargamel kernel: [ 4563.356252] JFS: nTxBlock = 8192, nTxLock = 65536
> Jun 23 18:33:48 gargamel kernel: [ 4563.446715] ntfs: driver 2.1.32 [Flags: R/W MODULE].
> Jun 23 18:42:20 gargamel kernel: [ 5075.995254] INFO: task sync:20253 blocked for more than 120 seconds.
> Jun 23 18:42:20 gargamel kernel: [ 5076.015729]       Not tainted 4.17.2-amd64-preempt-sysrq-20180817 #1
> Jun 23 18:42:20 gargamel kernel: [ 5076.036141] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 23 18:42:20 gargamel kernel: [ 5076.060637] sync            D    0 20253  15327 0x20020080
> Jun 23 18:42:20 gargamel kernel: [ 5076.078032] Call Trace:
> Jun 23 18:42:20 gargamel kernel: [ 5076.086366]  ? __schedule+0x53e/0x59b
> Jun 23 18:42:20 gargamel kernel: [ 5076.098311]  schedule+0x7f/0x98
> Jun 23 18:42:20 gargamel kernel: [ 5076.108665]  __rwsem_down_read_failed_common+0x127/0x1a8
> Jun 23 18:42:20 gargamel kernel: [ 5076.125565]  ? sync_fs_one_sb+0x20/0x20
> Jun 23 18:42:20 gargamel kernel: [ 5076.137982]  ? call_rwsem_down_read_failed+0x14/0x30
> Jun 23 18:42:20 gargamel kernel: [ 5076.154081]  call_rwsem_down_read_failed+0x14/0x30
> Jun 23 18:42:20 gargamel kernel: [ 5076.169429]  down_read+0x13/0x25
> Jun 23 18:42:20 gargamel kernel: [ 5076.180444]  iterate_supers+0x57/0xbe
> Jun 23 18:42:20 gargamel kernel: [ 5076.192619]  ksys_sync+0x40/0xa4
> Jun 23 18:42:20 gargamel kernel: [ 5076.203192]  __ia32_sys_sync+0xa/0xd
> Jun 23 18:42:20 gargamel kernel: [ 5076.214774]  do_fast_syscall_32+0xaf/0xf3
> Jun 23 18:42:20 gargamel kernel: [ 5076.227740]  entry_SYSENTER_compat+0x7f/0x91
> Jun 23 18:44:21 gargamel kernel: [ 5196.828764] INFO: task sync:20253 blocked for more than 120 seconds.
> Jun 23 18:44:21 gargamel kernel: [ 5196.848724]       Not tainted 4.17.2-amd64-preempt-sysrq-20180817 #1
> Jun 23 18:44:21 gargamel kernel: [ 5196.868789] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 23 18:44:21 gargamel kernel: [ 5196.893615] sync            D    0 20253  15327 0x20020080
> 
>>> But back to the main point, it's sad that after so many years, the
>>> repair situation is still so suboptimal, especially when it's apparently
>>> pretty easy for btrfs to get damaged (through its own fault or not, hard
>>> to say).
>>
>> Unfortunately, yes.
>> Especially the extent tree is pretty fragile and hard to repair.
> 
> So, I don't know the code, but if I may make a suggestion (which maybe
> is totally wrong, if so forgive me):
> I would love for a repair mode that gives me a back a fixed
> filesystem. I don't really care how much data is lost (although ideally
> it would give me a list of files lost), but I want a working filesystem
> at the end. I can then decide if there is enough data left on it to
> restore what's missing or if I'm better off starting from scratch.

If for that usage, btrfs-restore would fit your use case more,
Unfortunately it needs extra disk space and isn't good at restoring
subvolume/snapshots.
(Although it's much faster than repairing the possible corrupted extent
tree)

> 
> Is that possible at all?

At least for file recovery (fs tree repair), we have such behavior.

However, the problem you hit (and a lot of users hit) is all about
extent tree repair, which doesn't even goes to file recovery.

All the hassle are in extent tree, and for extent tree, it's just good
or bad. Any corruption in extent tree may lead to later bugs.
The only way to avoid extent tree problems is to mount the fs RO.

So, I'm afraid it is at least impossible for recent years.

Thanks,
Qu

> 
> Thanks,
> Marc
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  6:10       ` Marc MERLIN
@ 2018-06-29  6:32         ` Su Yue
  2018-06-29  6:43           ` Marc MERLIN
  0 siblings, 1 reply; 72+ messages in thread
From: Su Yue @ 2018-06-29  6:32 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Qu Wenruo, linux-btrfs



On 06/29/2018 02:10 PM, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 02:02:19PM +0800, Su Yue wrote:
>> I have figured out the bug is lowmem check can't deal with shared tree block
>> in reloc tree. The fix is simple, you can try the follow repo:
>>
>> https://github.com/Damenly/btrfs-progs/tree/tmp1
> 
> Not sure if I undertand that you meant, here.
> 
Sorry for my unclear words.
Simply speaking, I suggest you to stop current running check.
Then, clone above branch to compile binary then run
'btrfs check --mode=lowmem $dev'.

>> Please run lowmem check "without =--repair" first to be sure whether
>> your filesystem is fine.
>   
> The filesystem is not fine, it caused btrfs balance to hang, whether
> balance actually broke it further or caused the breakage, I can't say.
> 
> Then mount hangs, even with recovery, unless I use ro.
> 
> This filesystem is trash to me and will require over a week to rebuild
> manually if I can't repair it.

Understood your anxiety, a log of check without '--repair' will help
us to make clear what's wrong with your filesystem.

Thanks,
Su
> Running check without repair for likely several days just to know that
> my filesystem is not clear (I already know this) isn't useful :)
> Or am I missing something?
> 
>> Though the bug and phenomenon are clear enough, before sending my patch,
>> I have to make a test image. I have spent a week to study btrfs balance
>> but it seems a liitle hard for me.
> 
> thanks for having a look, either way.
> 
> Marc
> 



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  6:32         ` Su Yue
@ 2018-06-29  6:43           ` Marc MERLIN
  2018-07-01 23:22             ` Marc MERLIN
  0 siblings, 1 reply; 72+ messages in thread
From: Marc MERLIN @ 2018-06-29  6:43 UTC (permalink / raw)
  To: Su Yue; +Cc: Qu Wenruo, linux-btrfs

On Fri, Jun 29, 2018 at 02:32:44PM +0800, Su Yue wrote:
> > > https://github.com/Damenly/btrfs-progs/tree/tmp1
> > 
> > Not sure if I undertand that you meant, here.
> > 
> Sorry for my unclear words.
> Simply speaking, I suggest you to stop current running check.
> Then, clone above branch to compile binary then run
> 'btrfs check --mode=lowmem $dev'.
 
I understand, I'll build and try it.

> > This filesystem is trash to me and will require over a week to rebuild
> > manually if I can't repair it.
> 
> Understood your anxiety, a log of check without '--repair' will help
> us to make clear what's wrong with your filesystem.

Ok, I'll run your new code without repair and report back. It will
likely take over a day though.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  6:29         ` Qu Wenruo
@ 2018-06-29  6:59           ` Marc MERLIN
  2018-06-29  7:09             ` Roman Mamedov
  2018-06-29  7:20             ` So, does btrfs check lowmem take days? weeks? Qu Wenruo
  0 siblings, 2 replies; 72+ messages in thread
From: Marc MERLIN @ 2018-06-29  6:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Jun 29, 2018 at 02:29:10PM +0800, Qu Wenruo wrote:
> > If --repair doesn't work, check is useless to me sadly.
> 
> Not exactly.
> Although it's time consuming, I have manually patched several users fs,
> which normally ends pretty well.
 
Ok I understand now.

> > Agreed, I doubt I have over or much over 100 snapshots though (but I
> > can't check right now).
> > Sadly I'm not allowed to mount even read only while check is running:
> > gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2
> > mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy

Ok, so I just checked now, 270 snapshots, but not because I'm crazy,
because I use btrfs send a lot :)

> This looks like super block corruption?
> 
> What about "btrfs inspect dump-super -fFa /dev/mapper/dshelf2"?

Sure, there you go: https://pastebin.com/uF1pHTsg

> And what about "skip_balance" mount option?
 
I have this in my fstab :)

> Another problem is, with so many snapshots, balance is also hugely
> slowed, thus I'm not 100% sure if it's really a hang.

I sent another thread about this last week, balance got hung after 2
days of doing nothing and just moving a single chunk.

Ok, I was able to remount the filesystem read only. I was wrong, I have
270 snapshots:
gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup/'
74
gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup-btrfssend/'
196

It's a backup server, I use btrfs send for many machines and for each btrs
send, I keep history, maybe 10 or so backups. So it adds up in the end.

Is btrfs unable to deal with this well enough?

> If for that usage, btrfs-restore would fit your use case more,
> Unfortunately it needs extra disk space and isn't good at restoring
> subvolume/snapshots.
> (Although it's much faster than repairing the possible corrupted extent
> tree)

It's a backup server, it only contains data from other machines.
If the filesystem cannot be recovered to a working state, I will need
over a week to restart the many btrfs send commands from many servers.
This is why anything other than --repair is useless ot me, I don't need
the data back, it's still on the original machines, I need the
filesystem to work again so that I don't waste a week recreating the
many btrfs send/receive relationships.

> > Is that possible at all?
> 
> At least for file recovery (fs tree repair), we have such behavior.
> 
> However, the problem you hit (and a lot of users hit) is all about
> extent tree repair, which doesn't even goes to file recovery.
> 
> All the hassle are in extent tree, and for extent tree, it's just good
> or bad. Any corruption in extent tree may lead to later bugs.
> The only way to avoid extent tree problems is to mount the fs RO.
> 
> So, I'm afraid it is at least impossible for recent years.

Understood, thanks for answering.

Does the pastebin help and is 270 snapshots ok enough?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  6:59           ` Marc MERLIN
@ 2018-06-29  7:09             ` Roman Mamedov
  2018-06-29  7:22               ` Marc MERLIN
  2018-06-29  7:20             ` So, does btrfs check lowmem take days? weeks? Qu Wenruo
  1 sibling, 1 reply; 72+ messages in thread
From: Roman Mamedov @ 2018-06-29  7:09 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

On Thu, 28 Jun 2018 23:59:03 -0700
Marc MERLIN <marc@merlins.org> wrote:

> I don't waste a week recreating the many btrfs send/receive relationships.

Consider not using send/receive, and switching to regular rsync instead.
Send/receive is very limiting and cumbersome, including because of what you
described. And it doesn't gain you much over an incremental rsync. As for
snapshots on the backup server, you can either automate making one as soon as a
backup has finished, or simply make them once/twice a day, during a period
when no backups are ongoing.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  6:59           ` Marc MERLIN
  2018-06-29  7:09             ` Roman Mamedov
@ 2018-06-29  7:20             ` Qu Wenruo
  2018-06-29  7:28               ` Marc MERLIN
  1 sibling, 1 reply; 72+ messages in thread
From: Qu Wenruo @ 2018-06-29  7:20 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3808 bytes --]



On 2018年06月29日 14:59, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 02:29:10PM +0800, Qu Wenruo wrote:
>>> If --repair doesn't work, check is useless to me sadly.
>>
>> Not exactly.
>> Although it's time consuming, I have manually patched several users fs,
>> which normally ends pretty well.
>  
> Ok I understand now.
> 
>>> Agreed, I doubt I have over or much over 100 snapshots though (but I
>>> can't check right now).
>>> Sadly I'm not allowed to mount even read only while check is running:
>>> gargamel:~# mount -o ro /dev/mapper/dshelf2 /mnt/mnt2
>>> mount: /dev/mapper/dshelf2 already mounted or /mnt/mnt2 busy
> 
> Ok, so I just checked now, 270 snapshots, but not because I'm crazy,
> because I use btrfs send a lot :)
> 
>> This looks like super block corruption?
>>
>> What about "btrfs inspect dump-super -fFa /dev/mapper/dshelf2"?
> 
> Sure, there you go: https://pastebin.com/uF1pHTsg
> 
>> And what about "skip_balance" mount option?
>  
> I have this in my fstab :)
> 
>> Another problem is, with so many snapshots, balance is also hugely
>> slowed, thus I'm not 100% sure if it's really a hang.
> 
> I sent another thread about this last week, balance got hung after 2
> days of doing nothing and just moving a single chunk.
> 
> Ok, I was able to remount the filesystem read only. I was wrong, I have
> 270 snapshots:
> gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup/'
> 74
> gargamel:/mnt/mnt# btrfs subvolume list . | grep -c 'path backup-btrfssend/'
> 196
> 
> It's a backup server, I use btrfs send for many machines and for each btrs
> send, I keep history, maybe 10 or so backups. So it adds up in the end.
> 
> Is btrfs unable to deal with this well enough?

It depends.
For certain and rare case, if the only operations to the filesystem are
non-btrfs specific operations (POSIX file operations), then you're fine.
(Maybe you can go thousands snapshots before any obvious performance
degrade)

If certain btrfs specific operations are involved, it's definitely not OK:
1) Balance
2) Quota
3) Btrfs check

> 
>> If for that usage, btrfs-restore would fit your use case more,
>> Unfortunately it needs extra disk space and isn't good at restoring
>> subvolume/snapshots.
>> (Although it's much faster than repairing the possible corrupted extent
>> tree)
> 
> It's a backup server, it only contains data from other machines.
> If the filesystem cannot be recovered to a working state, I will need
> over a week to restart the many btrfs send commands from many servers.
> This is why anything other than --repair is useless ot me, I don't need
> the data back, it's still on the original machines, I need the
> filesystem to work again so that I don't waste a week recreating the
> many btrfs send/receive relationships.

Now totally understand why you need to repair the fs.

> 
>>> Is that possible at all?
>>
>> At least for file recovery (fs tree repair), we have such behavior.
>>
>> However, the problem you hit (and a lot of users hit) is all about
>> extent tree repair, which doesn't even goes to file recovery.
>>
>> All the hassle are in extent tree, and for extent tree, it's just good
>> or bad. Any corruption in extent tree may lead to later bugs.
>> The only way to avoid extent tree problems is to mount the fs RO.
>>
>> So, I'm afraid it is at least impossible for recent years.
> 
> Understood, thanks for answering.
> 
> Does the pastebin help and is 270 snapshots ok enough?

The super dump doesn't show anything wrong.

So the problem may be in the super large extent tree.

In this case, plain check result with Su's patch would help more, other
than the not so interesting super dump.

Thanks,
Qu

> 
> Thanks,
> Marc
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  7:09             ` Roman Mamedov
@ 2018-06-29  7:22               ` Marc MERLIN
  2018-06-29  7:34                 ` Roman Mamedov
  2018-06-29  8:04                 ` Lionel Bouton
  0 siblings, 2 replies; 72+ messages in thread
From: Marc MERLIN @ 2018-06-29  7:22 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-btrfs

On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
> On Thu, 28 Jun 2018 23:59:03 -0700
> Marc MERLIN <marc@merlins.org> wrote:
> 
> > I don't waste a week recreating the many btrfs send/receive relationships.
> 
> Consider not using send/receive, and switching to regular rsync instead.
> Send/receive is very limiting and cumbersome, including because of what you
> described. And it doesn't gain you much over an incremental rsync. As for

Err, sorry but I cannot agree with you here, at all :)

btrfs send/receive is pretty much the only reason I use btrfs. 
rsync takes hours on big filesystems scanning every single inode on both
sides and then seeing what changed, and only then sends the differences
It's super inefficient.
btrfs send knows in seconds what needs to be sent, and works on it right
away.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  7:20             ` So, does btrfs check lowmem take days? weeks? Qu Wenruo
@ 2018-06-29  7:28               ` Marc MERLIN
  2018-06-29 17:10                 ` Marc MERLIN
  0 siblings, 1 reply; 72+ messages in thread
From: Marc MERLIN @ 2018-06-29  7:28 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Jun 29, 2018 at 03:20:42PM +0800, Qu Wenruo wrote:
> If certain btrfs specific operations are involved, it's definitely not OK:
> 1) Balance
> 2) Quota
> 3) Btrfs check

Ok, I understand. I'll try to balance almost never then. My problems did
indeed start because I ran balance and it got stuck 2 days with 0
progress.
That still seems like a bug though. I'm ok with slow, but stuck for 2
days with only 270 snapshots or so means there is a bug, or the
algorithm is so expensive that 270 snapshots could cause it to take days
or weeks to proceed?

> > It's a backup server, it only contains data from other machines.
> > If the filesystem cannot be recovered to a working state, I will need
> > over a week to restart the many btrfs send commands from many servers.
> > This is why anything other than --repair is useless ot me, I don't need
> > the data back, it's still on the original machines, I need the
> > filesystem to work again so that I don't waste a week recreating the
> > many btrfs send/receive relationships.
> 
> Now totally understand why you need to repair the fs.

I also understand that my use case is atypical :)
But I guess this also means that using btrfs for a lot of send/receive
on a backup server is not going to work well unfortunately :-/

Now I'm wondering if I'm the only person even doing this.

> > Does the pastebin help and is 270 snapshots ok enough?
> 
> The super dump doesn't show anything wrong.
> 
> So the problem may be in the super large extent tree.
> 
> In this case, plain check result with Su's patch would help more, other
> than the not so interesting super dump.

First I tried to mount with skip balance after the partial repair, and
it hung a long time:
[445635.716318] BTRFS info (device dm-2): disk space caching is enabled
[445635.736229] BTRFS info (device dm-2): has skinny extents
[445636.101999] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[445825.053205] BTRFS info (device dm-2): enabling ssd optimizations
[446511.006588] BTRFS info (device dm-2): disk space caching is enabled
[446511.026737] BTRFS info (device dm-2): has skinny extents
[446511.325470] BTRFS info (device dm-2): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[446699.593501] BTRFS info (device dm-2): enabling ssd optimizations
[446964.077045] INFO: task btrfs-transacti:9211 blocked for more than 120 seconds.
[446964.099802]       Not tainted 4.17.2-amd64-preempt-sysrq-20180818 #3
[446964.120004] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

So, I rebooted, and will now run Su's btrfs check without repair and
report back.

Thanks both for your help.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  7:22               ` Marc MERLIN
@ 2018-06-29  7:34                 ` Roman Mamedov
  2018-06-29  8:04                 ` Lionel Bouton
  1 sibling, 0 replies; 72+ messages in thread
From: Roman Mamedov @ 2018-06-29  7:34 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

On Fri, 29 Jun 2018 00:22:10 -0700
Marc MERLIN <marc@merlins.org> wrote:

> On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
> > On Thu, 28 Jun 2018 23:59:03 -0700
> > Marc MERLIN <marc@merlins.org> wrote:
> > 
> > > I don't waste a week recreating the many btrfs send/receive relationships.
> > 
> > Consider not using send/receive, and switching to regular rsync instead.
> > Send/receive is very limiting and cumbersome, including because of what you
> > described. And it doesn't gain you much over an incremental rsync. As for
> 
> Err, sorry but I cannot agree with you here, at all :)
> 
> btrfs send/receive is pretty much the only reason I use btrfs. 
> rsync takes hours on big filesystems scanning every single inode on both
> sides and then seeing what changed, and only then sends the differences

I use it for backing up root filesystems of about 20 hosts, and for syncing
large multi-terabyte media collections -- it's fast enough in both.
Admittedly neither of those case has millions of subdirs or files where
scanning may take a long time. And in the former case it's also all from and
to SSDs. Maybe your use case is different where it doesn't work as well. But
perhaps then general day-to-day performance is not great either, so I'd suggest
looking into SSD-based LVM caching, it really works wonders with Btrfs.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  7:22               ` Marc MERLIN
  2018-06-29  7:34                 ` Roman Mamedov
@ 2018-06-29  8:04                 ` Lionel Bouton
  2018-06-29 16:24                   ` btrfs send/receive vs rsync Marc MERLIN
  1 sibling, 1 reply; 72+ messages in thread
From: Lionel Bouton @ 2018-06-29  8:04 UTC (permalink / raw)
  To: Marc MERLIN, Roman Mamedov; +Cc: linux-btrfs

Hi,

On 29/06/2018 09:22, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
>> On Thu, 28 Jun 2018 23:59:03 -0700
>> Marc MERLIN <marc@merlins.org> wrote:
>>
>>> I don't waste a week recreating the many btrfs send/receive relationships.
>> Consider not using send/receive, and switching to regular rsync instead.
>> Send/receive is very limiting and cumbersome, including because of what you
>> described. And it doesn't gain you much over an incremental rsync. As for
> Err, sorry but I cannot agree with you here, at all :)
>
> btrfs send/receive is pretty much the only reason I use btrfs. 
> rsync takes hours on big filesystems scanning every single inode on both
> sides and then seeing what changed, and only then sends the differences
> It's super inefficient.
> btrfs send knows in seconds what needs to be sent, and works on it right
> away.

I've not yet tried send/receive but I feel the pain of rsyncing millions
of files (I had to use lsyncd to limit the problem to the time the
origin servers reboot which is a relatively rare event) so this thread
picked my attention. Looking at the whole thread I wonder if you could
get a more manageable solution by splitting the filesystem.

If instead of using a single BTRFS filesystem you used LVM volumes
(maybe with Thin provisioning and monitoring of the volume group free
space) for each of your servers to backup with one BTRFS filesystem per
volume you would have less snapshots per filesystem and isolate problems
in case of corruption. If you eventually decide to start from scratch
again this might help a lot in your case.

Lionel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: btrfs send/receive vs rsync
  2018-06-29  8:04                 ` Lionel Bouton
@ 2018-06-29 16:24                   ` Marc MERLIN
  2018-06-30  8:18                     ` Duncan
  0 siblings, 1 reply; 72+ messages in thread
From: Marc MERLIN @ 2018-06-29 16:24 UTC (permalink / raw)
  To: Lionel Bouton; +Cc: Roman Mamedov, linux-btrfs

On Fri, Jun 29, 2018 at 10:04:02AM +0200, Lionel Bouton wrote:
> Hi,
> 
> On 29/06/2018 09:22, Marc MERLIN wrote:
> > On Fri, Jun 29, 2018 at 12:09:54PM +0500, Roman Mamedov wrote:
> >> On Thu, 28 Jun 2018 23:59:03 -0700
> >> Marc MERLIN <marc@merlins.org> wrote:
> >>
> >>> I don't waste a week recreating the many btrfs send/receive relationships.
> >> Consider not using send/receive, and switching to regular rsync instead.
> >> Send/receive is very limiting and cumbersome, including because of what you
> >> described. And it doesn't gain you much over an incremental rsync. As for
> > Err, sorry but I cannot agree with you here, at all :)
> >
> > btrfs send/receive is pretty much the only reason I use btrfs. 
> > rsync takes hours on big filesystems scanning every single inode on both
> > sides and then seeing what changed, and only then sends the differences
> > It's super inefficient.
> > btrfs send knows in seconds what needs to be sent, and works on it right
> > away.
> 
> I've not yet tried send/receive but I feel the pain of rsyncing millions
> of files (I had to use lsyncd to limit the problem to the time the
> origin servers reboot which is a relatively rare event) so this thread
> picked my attention. Looking at the whole thread I wonder if you could
> get a more manageable solution by splitting the filesystem.

So, let's be clear. I did backups with rsync for 10+ years. It was slow
and painful. On my laptop an hourly rsync between 2 drives slowed down
my machine to a crawl while everything was being stat'ed, it took
forever.
Now with btrfs send/receive, it just works, I don't even see it
happening in the background.

Here is a page I wrote about it in 2014:
http://marc.merlins.org/perso/btrfs/2014-03.html#Btrfs-Tips_-Doing-Fast-Incremental-Backups-With-Btrfs-Send-and-Receive

Here is a talk I gave in 2014 too, scroll to the bottom of the page, and
the bottom of the talk outline:
http://marc.merlins.org/perso/btrfs/2014-05.html#My-Btrfs-Talk-at-Linuxcon-JP-2014
and click on 'Btrfs send/receive'

> If instead of using a single BTRFS filesystem you used LVM volumes
> (maybe with Thin provisioning and monitoring of the volume group free
> space) for each of your servers to backup with one BTRFS filesystem per
> volume you would have less snapshots per filesystem and isolate problems
> in case of corruption. If you eventually decide to start from scratch
> again this might help a lot in your case.

So, I already have problems due to too many block layers:
- raid 5 + ssd
- bcache
- dmcrypt
- btrfs

I get occasional deadlocks due to upper layers sending more data to the
lower layer (bcache) than it can process. I'm a bit warry of adding yet
another layer (LVM), but you're otherwise correct than keeping smaller
btrfs filesystems would help with performance and containing possible
damage.

Has anyone actually done this? :)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  7:28               ` Marc MERLIN
@ 2018-06-29 17:10                 ` Marc MERLIN
  2018-06-30  0:04                   ` Chris Murphy
  2018-06-30  2:44                   ` Marc MERLIN
  0 siblings, 2 replies; 72+ messages in thread
From: Marc MERLIN @ 2018-06-29 17:10 UTC (permalink / raw)
  To: Qu Wenruo, suy.fnst; +Cc: linux-btrfs

On Fri, Jun 29, 2018 at 12:28:31AM -0700, Marc MERLIN wrote:
> So, I rebooted, and will now run Su's btrfs check without repair and
> report back.

As expected, it will likely still take days, here's the start:

gargamel:~# btrfs check --mode=lowmem  -p /dev/mapper/dshelf2  
Checking filesystem on /dev/mapper/dshelf2 
UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d 
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 2, have: 4
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 2, have: 4
ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 180, have: 240
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 67, have: 115
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 67, have: 115
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 114, have: 143
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 114, have: 143
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 301, have: 431
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 355, have: 433
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 160, have: 240
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 161, have: 240
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 169, have: 249
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 171, have: 251
ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 347, have: 418
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 1, have: 1449
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 1, have: 1452

Mmmh, these look similar (but not identical) to the last run earlier in this thread:
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4
Created new chunk [18457780224000 1073741824]
Delete backref in extent [84302495744 69632]
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4
Delete backref in extent [84302495744 69632]
ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240
Delete backref in extent [125712527360 12214272]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431
Delete backref in extent [129952120832 20242432]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433
Delete backref in extent [129952120832 20242432]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249
Delete backref in extent [147895111680 12345344]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251
Delete backref in extent [147895111680 12345344]
ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 348, have: 418
Delete backref in extent [150850146304 17522688]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 555, have: 1449
Deleted root 2 item[156909494272, 178, 5476627808561673095]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 556, have: 1452
Deleted root 2 item[156909494272, 178, 7338474132555182983]

I guess the last repair didn't repair things in a way that they're working now?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29 17:10                 ` Marc MERLIN
@ 2018-06-30  0:04                   ` Chris Murphy
  2018-06-30  2:44                   ` Marc MERLIN
  1 sibling, 0 replies; 72+ messages in thread
From: Chris Murphy @ 2018-06-30  0:04 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Qu Wenruo, Su Yue, Btrfs BTRFS

I've got about 1/2 the snapshots and less than 1/10th the data...but
my btrfs check times are much shorter than either: 15 minutes and 65
minutes (lowmem).


[chris@f28s ~]$ sudo btrfs fi us /mnt/first
Overall:
    Device size:        1024.00GiB
    Device allocated:         774.12GiB
    Device unallocated:         249.87GiB
    Device missing:             0.00B
    Used:             760.48GiB
    Free (estimated):         256.95GiB    (min: 132.01GiB)
    Data ratio:                  1.00
    Metadata ratio:              2.00
    Global reserve:         512.00MiB    (used: 0.00B)

Data,single: Size:761.00GiB, Used:753.93GiB
   /dev/mapper/first     761.00GiB

Metadata,DUP: Size:6.50GiB, Used:3.28GiB
   /dev/mapper/first      13.00GiB

System,DUP: Size:64.00MiB, Used:112.00KiB
   /dev/mapper/first     128.00MiB

Unallocated:
   /dev/mapper/first     249.87GiB


146 subvolumes
137 snapshots

total csum bytes: 790549924
total tree bytes: 3519250432
total fs tree bytes: 2546073600
total extent tree bytes: 131350528


Original mode check takes ~15 minutes
Lowmem mode takes ~65 minutes

RAM: 4G
CPU: Intel(R) Pentium(R) CPU  N3700  @ 1.60GHz



Chris Murphy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29 17:10                 ` Marc MERLIN
  2018-06-30  0:04                   ` Chris Murphy
@ 2018-06-30  2:44                   ` Marc MERLIN
  2018-06-30 14:49                     ` Qu Wenruo
  1 sibling, 1 reply; 72+ messages in thread
From: Marc MERLIN @ 2018-06-30  2:44 UTC (permalink / raw)
  To: Qu Wenruo, suy.fnst; +Cc: linux-btrfs

Well, there goes that. After about 18H:
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 1, have: 1452 
backref.c:466: __add_missing_keys: Assertion `ref->root_id` failed, value 0 
btrfs(+0x3a232)[0x56091704f232] 
btrfs(+0x3ab46)[0x56091704fb46] 
btrfs(+0x3b9f5)[0x5609170509f5] 
btrfs(btrfs_find_all_roots+0x9)[0x560917050a45] 
btrfs(+0x572ff)[0x56091706c2ff] 
btrfs(+0x60b13)[0x560917075b13] 
btrfs(cmd_check+0x2634)[0x56091707d431] 
btrfs(main+0x88)[0x560917027260] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f93aa508561] 
btrfs(_start+0x2a)[0x560917026dfa] 
Aborted 

That's https://github.com/Damenly/btrfs-progs.git

Whoops, I didn't use the tmp1 branch, let me try again with that and
report back, although the problem above is still going to be there since
I think the only difference will be this, correct?
https://github.com/Damenly/btrfs-progs/commit/b5851513a12237b3e19a3e71f3ad00b966d25b3a

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: btrfs send/receive vs rsync
  2018-06-29 16:24                   ` btrfs send/receive vs rsync Marc MERLIN
@ 2018-06-30  8:18                     ` Duncan
  0 siblings, 0 replies; 72+ messages in thread
From: Duncan @ 2018-06-30  8:18 UTC (permalink / raw)
  To: linux-btrfs

Marc MERLIN posted on Fri, 29 Jun 2018 09:24:20 -0700 as excerpted:

>> If instead of using a single BTRFS filesystem you used LVM volumes
>> (maybe with Thin provisioning and monitoring of the volume group free
>> space) for each of your servers to backup with one BTRFS filesystem per
>> volume you would have less snapshots per filesystem and isolate
>> problems in case of corruption. If you eventually decide to start from
>> scratch again this might help a lot in your case.
> 
> So, I already have problems due to too many block layers:
> - raid 5 + ssd - bcache - dmcrypt - btrfs
> 
> I get occasional deadlocks due to upper layers sending more data to the
> lower layer (bcache) than it can process. I'm a bit warry of adding yet
> another layer (LVM), but you're otherwise correct than keeping smaller
> btrfs filesystems would help with performance and containing possible
> damage.
> 
> Has anyone actually done this? :)

So I definitely use (and advocate!) the split-em-up strategy, and I use 
btrfs, but that's pretty much all the similarity we have.

I'm all ssd, having left spinning rust behind.  My strategy avoids 
unnecessary layers like lvm (tho crypt can arguably be necessary), 
preferring direct on-device (gpt) partitioning for simplicity of 
management and disaster recovery.  And my backup and recovery strategy is 
an equally simple mkfs and full-filesystem-fileset copy to an identically 
sized filesystem, with backups easily bootable/mountable in place of the 
working copy if necessary, and multiple backups so if disaster takes out 
the backup I was writing at the same time as the working copy, I still 
have a backup to fall back to.

So it's different enough I'm not sure how much my experience will help 
you.  But I /can/ say the subdivision is nice, as it means I can keep my 
root filesystem read-only by default for reliability, my most-at-risk log 
filesystem tiny for near-instant scrub/balance/check, and my also at risk 
home small as well, with the big media files being on a different 
filesystem that's mostly read-only, so less at risk and needing less 
frequent backups.  The tiny boot and large updates (distro repo, sources, 
ccache) are also separate, and mounted only for boot maintenance or 
updates.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-30  2:44                   ` Marc MERLIN
@ 2018-06-30 14:49                     ` Qu Wenruo
  2018-06-30 21:06                       ` Marc MERLIN
  0 siblings, 1 reply; 72+ messages in thread
From: Qu Wenruo @ 2018-06-30 14:49 UTC (permalink / raw)
  To: Marc MERLIN, suy.fnst; +Cc: linux-btrfs



On 2018年06月30日 10:44, Marc MERLIN wrote:
> Well, there goes that. After about 18H:
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 1, have: 1452 
> backref.c:466: __add_missing_keys: Assertion `ref->root_id` failed, value 0 
> btrfs(+0x3a232)[0x56091704f232] 
> btrfs(+0x3ab46)[0x56091704fb46] 
> btrfs(+0x3b9f5)[0x5609170509f5] 
> btrfs(btrfs_find_all_roots+0x9)[0x560917050a45] 
> btrfs(+0x572ff)[0x56091706c2ff] 
> btrfs(+0x60b13)[0x560917075b13] 
> btrfs(cmd_check+0x2634)[0x56091707d431] 
> btrfs(main+0x88)[0x560917027260] 
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f93aa508561] 
> btrfs(_start+0x2a)[0x560917026dfa] 
> Aborted 

I think that's the root cause.
Some invalid extent tree backref or bad tree block blow up backref code.

All previous error message may be garbage unless you're using Su's
latest branch, as lowmem mode tends to report false alerts on refrencer
count mismatch.

But the last abort looks pretty possible to be the culprit.

Would you try to dump the extent tree?
# btrfs inspect dump-tree -t extent <device> | grep -A50 156909494272

It should help us locate the culprit and hopefully get some chance to
fix it.

Thanks,
Qu

> 
> That's https://github.com/Damenly/btrfs-progs.git
> 
> Whoops, I didn't use the tmp1 branch, let me try again with that and
> report back, although the problem above is still going to be there since
> I think the only difference will be this, correct?
> https://github.com/Damenly/btrfs-progs/commit/b5851513a12237b3e19a3e71f3ad00b966d25b3a
> 
> Marc
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-30 14:49                     ` Qu Wenruo
@ 2018-06-30 21:06                       ` Marc MERLIN
  0 siblings, 0 replies; 72+ messages in thread
From: Marc MERLIN @ 2018-06-30 21:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: suy.fnst, linux-btrfs

On Sat, Jun 30, 2018 at 10:49:07PM +0800, Qu Wenruo wrote:
> But the last abort looks pretty possible to be the culprit.
> 
> Would you try to dump the extent tree?
> # btrfs inspect dump-tree -t extent <device> | grep -A50 156909494272

Sure, there you go:

	item 25 key (156909494272 EXTENT_ITEM 55320576) itemoff 14943 itemsize 24
		refs 19715 gen 31575 flags DATA
	item 26 key (156909494272 EXTENT_DATA_REF 571620086735451015) itemoff 14915 itemsize 28
		extent data backref root 21641 objectid 374857 offset 235175936 count 1452
	item 27 key (156909494272 EXTENT_DATA_REF 1765833482087969671) itemoff 14887 itemsize 28
		extent data backref root 23094 objectid 374857 offset 235175936 count 1442
	item 28 key (156909494272 EXTENT_DATA_REF 1807626434455810951) itemoff 14859 itemsize 28
		extent data backref root 21503 objectid 374857 offset 235175936 count 1454
	item 29 key (156909494272 EXTENT_DATA_REF 1879818091602916231) itemoff 14831 itemsize 28
		extent data backref root 21462 objectid 374857 offset 235175936 count 1454
	item 30 key (156909494272 EXTENT_DATA_REF 3610854505775117191) itemoff 14803 itemsize 28
		extent data backref root 23134 objectid 374857 offset 235175936 count 1442
	item 31 key (156909494272 EXTENT_DATA_REF 3754675454231458695) itemoff 14775 itemsize 28
		extent data backref root 23052 objectid 374857 offset 235175936 count 1442
	item 32 key (156909494272 EXTENT_DATA_REF 5060494667839714183) itemoff 14747 itemsize 28
		extent data backref root 23174 objectid 374857 offset 235175936 count 1440
	item 33 key (156909494272 EXTENT_DATA_REF 5476627808561673095) itemoff 14719 itemsize 28
		extent data backref root 22911 objectid 374857 offset 235175936 count 1
	item 34 key (156909494272 EXTENT_DATA_REF 6378484416458011527) itemoff 14691 itemsize 28
		extent data backref root 23012 objectid 374857 offset 235175936 count 1442
	item 35 key (156909494272 EXTENT_DATA_REF 7338474132555182983) itemoff 14663 itemsize 28
		extent data backref root 21872 objectid 374857 offset 235175936 count 1
	item 36 key (156909494272 EXTENT_DATA_REF 7516565391717970823) itemoff 14635 itemsize 28
		extent data backref root 21826 objectid 374857 offset 235175936 count 1452
	item 37 key (156909494272 SHARED_DATA_REF 14871537025024) itemoff 14631 itemsize 4
		shared data backref count 10
	item 38 key (156909494272 SHARED_DATA_REF 14871617568768) itemoff 14627 itemsize 4
		shared data backref count 73
	item 39 key (156909494272 SHARED_DATA_REF 14871619846144) itemoff 14623 itemsize 4
		shared data backref count 59
	item 40 key (156909494272 SHARED_DATA_REF 14871623270400) itemoff 14619 itemsize 4
		shared data backref count 68
	item 41 key (156909494272 SHARED_DATA_REF 14871623532544) itemoff 14615 itemsize 4
		shared data backref count 70
	item 42 key (156909494272 SHARED_DATA_REF 14871626383360) itemoff 14611 itemsize 4
		shared data backref count 76
	item 43 key (156909494272 SHARED_DATA_REF 14871635132416) itemoff 14607 itemsize 4
		shared data backref count 60
	item 44 key (156909494272 SHARED_DATA_REF 14871649533952) itemoff 14603 itemsize 4
		shared data backref count 79
	item 45 key (156909494272 SHARED_DATA_REF 14871862378496) itemoff 14599 itemsize 4
		shared data backref count 70
	item 46 key (156909494272 SHARED_DATA_REF 14909667098624) itemoff 14595 itemsize 4
		shared data backref count 72
	item 47 key (156909494272 SHARED_DATA_REF 14909669720064) itemoff 14591 itemsize 4
		shared data backref count 58
	item 48 key (156909494272 SHARED_DATA_REF 14909734567936) itemoff 14587 itemsize 4
		shared data backref count 73
	item 49 key (156909494272 SHARED_DATA_REF 14909920477184) itemoff 14583 itemsize 4
		shared data backref count 79
	item 50 key (156909494272 SHARED_DATA_REF 14942279335936) itemoff 14579 itemsize 4
		shared data backref count 79
	item 51 key (156909494272 SHARED_DATA_REF 14942304862208) itemoff 14575 itemsize 4
		shared data backref count 72
	item 52 key (156909494272 SHARED_DATA_REF 14942348378112) itemoff 14571 itemsize 4
		shared data backref count 67
	item 53 key (156909494272 SHARED_DATA_REF 14942366138368) itemoff 14567 itemsize 4
		shared data backref count 51
	item 54 key (156909494272 SHARED_DATA_REF 14942384799744) itemoff 14563 itemsize 4
		shared data backref count 64
	item 55 key (156909494272 SHARED_DATA_REF 14978234613760) itemoff 14559 itemsize 4
		shared data backref count 61
	item 56 key (156909494272 SHARED_DATA_REF 14978246459392) itemoff 14555 itemsize 4
		shared data backref count 56
	item 57 key (156909494272 SHARED_DATA_REF 14978256879616) itemoff 14551 itemsize 4
		shared data backref count 75
	item 58 key (156909494272 SHARED_DATA_REF 15001465749504) itemoff 14547 itemsize 4
		shared data backref count 77
	item 59 key (156909494272 SHARED_DATA_REF 18215010877440) itemoff 14543 itemsize 4
		shared data backref count 79
	item 60 key (156909494272 SHARED_DATA_REF 18215045660672) itemoff 14539 itemsize 4
		shared data backref count 10
	item 61 key (156909494272 SHARED_DATA_REF 18215099023360) itemoff 14535 itemsize 4
		shared data backref count 56
	item 62 key (156909494272 SHARED_DATA_REF 18215114522624) itemoff 14531 itemsize 4
		shared data backref count 70
	item 63 key (156909494272 SHARED_DATA_REF 18215129874432) itemoff 14527 itemsize 4
		shared data backref count 68
	item 64 key (156909494272 SHARED_DATA_REF 18215130267648) itemoff 14523 itemsize 4
		shared data backref count 72
	item 65 key (156909494272 SHARED_DATA_REF 18215136264192) itemoff 14519 itemsize 4
		shared data backref count 64
	item 66 key (156909494272 SHARED_DATA_REF 18215138623488) itemoff 14515 itemsize 4
		shared data backref count 72
	item 67 key (156909494272 SHARED_DATA_REF 18215188414464) itemoff 14511 itemsize 4
		shared data backref count 58
	item 68 key (156909494272 SHARED_DATA_REF 18215188447232) itemoff 14507 itemsize 4
		shared data backref count 74
	item 69 key (156909494272 SHARED_DATA_REF 18215188529152) itemoff 14503 itemsize 4
		shared data backref count 69
	item 70 key (156909494272 SHARED_DATA_REF 18215204896768) itemoff 14499 itemsize 4
		shared data backref count 67
	item 71 key (156909494272 SHARED_DATA_REF 18215228358656) itemoff 14495 itemsize 4
		shared data backref count 68
	item 72 key (156909494272 SHARED_DATA_REF 18215228899328) itemoff 14491 itemsize 4
		shared data backref count 81
	item 73 key (156909494272 SHARED_DATA_REF 18215240892416) itemoff 14487 itemsize 4
		shared data backref count 78
	item 74 key (156909494272 SHARED_DATA_REF 18215244251136) itemoff 14483 itemsize 4
		shared data backref count 58
	item 75 key (156909494272 SHARED_DATA_REF 18215244365824) itemoff 14479 itemsize 4
		shared data backref count 63
	item 76 key (156909494272 SHARED_DATA_REF 18215252770816) itemoff 14475 itemsize 4
		shared data backref count 76
	item 77 key (156909494272 SHARED_DATA_REF 18215264337920) itemoff 14471 itemsize 4
		shared data backref count 76
	item 78 key (156909494272 SHARED_DATA_REF 18215270055936) itemoff 14467 itemsize 4
		shared data backref count 73
	item 79 key (156909494272 SHARED_DATA_REF 18215290601472) itemoff 14463 itemsize 4
		shared data backref count 63
	item 80 key (156909494272 SHARED_DATA_REF 18215290617856) itemoff 14459 itemsize 4
		shared data backref count 54
	item 81 key (156909494272 SHARED_DATA_REF 18244453154816) itemoff 14455 itemsize 4
		shared data backref count 79
	item 82 key (156909494272 SHARED_DATA_REF 18244454383616) itemoff 14451 itemsize 4
		shared data backref count 71
	item 83 key (156909494272 SHARED_DATA_REF 18249494151168) itemoff 14447 itemsize 4
		shared data backref count 79
	item 84 key (156909494272 SHARED_DATA_REF 18249500721152) itemoff 14443 itemsize 4
		shared data backref count 71
	item 85 key (156909494272 SHARED_DATA_REF 18249523789824) itemoff 14439 itemsize 4
		shared data backref count 51
	item 86 key (156909494272 SHARED_DATA_REF 18249586802688) itemoff 14435 itemsize 4
		shared data backref count 68
	item 87 key (156909494272 SHARED_DATA_REF 18249587703808) itemoff 14431 itemsize 4
		shared data backref count 70
	item 88 key (156909494272 SHARED_DATA_REF 18249588178944) itemoff 14427 itemsize 4
		shared data backref count 72
	item 89 key (156909494272 SHARED_DATA_REF 18249591291904) itemoff 14423 itemsize 4
		shared data backref count 67
	item 90 key (156909494272 SHARED_DATA_REF 18249598238720) itemoff 14419 itemsize 4
		shared data backref count 74
	item 91 key (156909494272 SHARED_DATA_REF 18249602285568) itemoff 14415 itemsize 4
		shared data backref count 79
	item 92 key (156909494272 SHARED_DATA_REF 18249611378688) itemoff 14411 itemsize 4
		shared data backref count 65
	item 93 key (156909494272 SHARED_DATA_REF 18249613082624) itemoff 14407 itemsize 4
		shared data backref count 55
	item 94 key (156909494272 SHARED_DATA_REF 18249642229760) itemoff 14403 itemsize 4
		shared data backref count 75
	item 95 key (156909494272 SHARED_DATA_REF 18249643458560) itemoff 14399 itemsize 4
		shared data backref count 68
	item 96 key (156909494272 SHARED_DATA_REF 18250800021504) itemoff 14395 itemsize 4
		shared data backref count 79
	item 97 key (156909494272 SHARED_DATA_REF 18250814963712) itemoff 14391 itemsize 4
		shared data backref count 71
	item 98 key (156909494272 SHARED_DATA_REF 18252047237120) itemoff 14387 itemsize 4
		shared data backref count 55
	item 99 key (156909494272 SHARED_DATA_REF 18252132515840) itemoff 14383 itemsize 4
		shared data backref count 68
	item 100 key (156909494272 SHARED_DATA_REF 18252134236160) itemoff 14379 itemsize 4
		shared data backref count 72
	item 101 key (156909494272 SHARED_DATA_REF 18252274827264) itemoff 14375 itemsize 4
		shared data backref count 68
	item 102 key (156909494272 SHARED_DATA_REF 18252313460736) itemoff 14371 itemsize 4
		shared data backref count 67
	item 103 key (156909494272 SHARED_DATA_REF 18252335906816) itemoff 14367 itemsize 4
		shared data backref count 79
	item 104 key (156909494272 SHARED_DATA_REF 18252336742400) itemoff 14363 itemsize 4
		shared data backref count 74
	item 105 key (156909494272 SHARED_DATA_REF 18254150631424) itemoff 14359 itemsize 4
		shared data backref count 56
	item 106 key (156909494272 SHARED_DATA_REF 18254342537216) itemoff 14355 itemsize 4
		shared data backref count 67
	item 107 key (156909494272 SHARED_DATA_REF 18255671017472) itemoff 14351 itemsize 4
		shared data backref count 72
	item 108 key (156909494272 SHARED_DATA_REF 18255806038016) itemoff 14347 itemsize 4
		shared data backref count 69
	item 109 key (156909494272 SHARED_DATA_REF 18255821996032) itemoff 14343 itemsize 4
		shared data backref count 67
	item 110 key (156909494272 SHARED_DATA_REF 18256006414336) itemoff 14339 itemsize 4
		shared data backref count 79
	item 111 key (156909494272 SHARED_DATA_REF 18256021012480) itemoff 14335 itemsize 4
		shared data backref count 74
	item 112 key (156909494272 SHARED_DATA_REF 18260113752064) itemoff 14331 itemsize 4
		shared data backref count 75
	item 113 key (156909494272 SHARED_DATA_REF 18260113883136) itemoff 14327 itemsize 4
		shared data backref count 65
	item 114 key (156909494272 SHARED_DATA_REF 18260114849792) itemoff 14323 itemsize 4
		shared data backref count 51
	item 115 key (156909494272 SHARED_DATA_REF 18260115013632) itemoff 14319 itemsize 4
		shared data backref count 70
	item 116 key (156909494272 SHARED_DATA_REF 18261625552896) itemoff 14315 itemsize 4
		shared data backref count 75
	item 117 key (156909494272 SHARED_DATA_REF 18261631107072) itemoff 14311 itemsize 4
		shared data backref count 65
	item 118 key (156909494272 SHARED_DATA_REF 18261652078592) itemoff 14307 itemsize 4
		shared data backref count 52
	item 119 key (156909494272 SHARED_DATA_REF 18261658025984) itemoff 14303 itemsize 4
		shared data backref count 70
	item 120 key (156964814848 EXTENT_ITEM 7487488) itemoff 13856 itemsize 447
		refs 2505 gen 31575 flags DATA
		extent data backref root 21826 objectid 374857 offset 290496512 count 192
		extent data backref root 21872 objectid 374857 offset 290496512 count 192
		extent data backref root 23012 objectid 374857 offset 290496512 count 193
		extent data backref root 22911 objectid 374857 offset 290496512 count 192
		extent data backref root 23174 objectid 374857 offset 290496512 count 193
		extent data backref root 23052 objectid 374857 offset 290496512 count 193
		extent data backref root 23134 objectid 374857 offset 290496512 count 193
		extent data backref root 21462 objectid 374857 offset 290496512 count 192
		extent data backref root 21503 objectid 374857 offset 290496512 count 192
		extent data backref root 23094 objectid 374857 offset 290496512 count 193
		extent data backref root 21641 objectid 374857 offset 290496512 count 192
		shared data backref parent 18215389659136 count 55
		shared data backref parent 18215388102656 count 63
		shared data backref parent 18215294795776 count 69
		shared data backref parent 18215244365824 count 7
		shared data backref parent 14978251440128 count 55
		shared data backref parent 14978250768384 count 63
		shared data backref parent 14978248212480 count 69
		shared data backref parent 14978246459392 count 7
	item 121 key (156972302336 EXTENT_ITEM 8192) itemoff 13487 itemsize 369
		refs 13 gen 31575 flags DATA
		extent data backref root 21826 objectid 374857 offset 297984000 count 1
		extent data backref root 21872 objectid 374857 offset 297984000 count 1
		extent data backref root 23012 objectid 374857 offset 297984000 count 1
		extent data backref root 22911 objectid 374857 offset 297984000 count 1
		extent data backref root 23174 objectid 374857 offset 297984000 count 1
		extent data backref root 23052 objectid 374857 offset 297984000 count 1
		extent data backref root 23134 objectid 374857 offset 297984000 count 1
		extent data backref root 21462 objectid 374857 offset 297984000 count 1
		extent data backref root 21503 objectid 374857 offset 297984000 count 1
		extent data backref root 23094 objectid 374857 offset 297984000 count 1
		extent data backref root 21641 objectid 374857 offset 297984000 count 1
		shared data backref parent 18215389659136 count 1
		shared data backref parent 14978251440128 count 1
	item 122 key (156972310528 EXTENT_ITEM 102400) itemoff 13450 itemsize 37
		refs 1 gen 31631 flags DATA
		shared data backref parent 17763118120960 count 1
	item 123 key (156972412928 EXTENT_ITEM 102400) itemoff 13413 itemsize 37
		refs 1 gen 31631 flags DATA
		shared data backref parent 17763118120960 count 1
	item 124 key (156972515328 EXTENT_ITEM 102400) itemoff 13376 itemsize 37
		refs 1 gen 31631 flags DATA
		shared data backref parent 17763118120960 count 1
	item 125 key (156972617728 EXTENT_ITEM 102400) itemoff 13339 itemsize 37
		refs 1 gen 31631 flags DATA
		shared data backref parent 17763118120960 count 1
	item 126 key (156972720128 EXTENT_ITEM 98304) itemoff 13302 itemsize 37
--
	item 30 key (1569094942720 EXTENT_ITEM 24576) itemoff 14678 itemsize 53
		refs 1 gen 97048 flags DATA
		extent data backref root 21462 objectid 374857 offset 90849280 count 1
	item 31 key (1569094967296 EXTENT_ITEM 94208) itemoff 14625 itemsize 53
		refs 1 gen 94313 flags DATA
		extent data backref root 19852 objectid 67985779 offset 0 count 1
	item 32 key (1569095061504 EXTENT_ITEM 299008) itemoff 14572 itemsize 53
		refs 1 gen 136347 flags DATA
		extent data backref root 19852 objectid 129958928 offset 0 count 1
	item 33 key (1569095360512 EXTENT_ITEM 40960) itemoff 14519 itemsize 53
		refs 1 gen 95673 flags DATA
		extent data backref root 19852 objectid 70844817 offset 0 count 1
	item 34 key (1569095475200 EXTENT_ITEM 36864) itemoff 14466 itemsize 53
		refs 1 gen 134400 flags DATA
		extent data backref root 19852 objectid 123134122 offset 0 count 1
	item 35 key (1569095536640 EXTENT_ITEM 16384) itemoff 14413 itemsize 53
		refs 1 gen 134270 flags DATA
		extent data backref root 19852 objectid 122565390 offset 0 count 1
	item 36 key (1569095557120 EXTENT_ITEM 286720) itemoff 14360 itemsize 53
		refs 1 gen 97139 flags DATA
		extent data backref root 19852 objectid 75280458 offset 0 count 1
	item 37 key (1569095843840 EXTENT_ITEM 8192) itemoff 14323 itemsize 37
		refs 1 gen 88571 flags DATA
		shared data backref parent 14909069754368 count 1
	item 38 key (1569095852032 EXTENT_ITEM 122880) itemoff 14270 itemsize 53
		refs 1 gen 76214 flags DATA
		extent data backref root 19852 objectid 35849748 offset 0 count 1
	item 39 key (1569095974912 EXTENT_ITEM 8192) itemoff 14220 itemsize 50
		refs 2 gen 88571 flags DATA
		shared data backref parent 18214784647168 count 1
		shared data backref parent 14909069754368 count 1
	item 40 key (1569095983104 EXTENT_ITEM 8192) itemoff 14170 itemsize 50
		refs 2 gen 88571 flags DATA
		shared data backref parent 18214784647168 count 1
		shared data backref parent 14909069754368 count 1
	item 41 key (1569096114176 EXTENT_ITEM 286720) itemoff 14117 itemsize 53
		refs 1 gen 95205 flags DATA
		extent data backref root 19852 objectid 69436429 offset 0 count 1
	item 42 key (1569096400896 EXTENT_ITEM 122880) itemoff 14064 itemsize 53
		refs 1 gen 92983 flags DATA
		extent data backref root 19852 objectid 66052505 offset 0 count 1
	item 43 key (1569096523776 EXTENT_ITEM 270336) itemoff 14011 itemsize 53
		refs 1 gen 94720 flags DATA
		extent data backref root 19852 objectid 68432863 offset 0 count 1
	item 44 key (1569097105408 EXTENT_ITEM 45056) itemoff 13958 itemsize 53
		refs 1 gen 96865 flags DATA
		extent data backref root 19852 objectid 74357290 offset 0 count 1
	item 45 key (1569097150464 EXTENT_ITEM 8192) itemoff 13905 itemsize 53
		refs 1 gen 97048 flags DATA
		extent data backref root 21462 objectid 374857 offset 99221504 count 1
	item 46 key (1569097158656 EXTENT_ITEM 110592) itemoff 13868 itemsize 37

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-06-29  6:43           ` Marc MERLIN
@ 2018-07-01 23:22             ` Marc MERLIN
  2018-07-02  2:02               ` Su Yue
  0 siblings, 1 reply; 72+ messages in thread
From: Marc MERLIN @ 2018-07-01 23:22 UTC (permalink / raw)
  To: Su Yue; +Cc: Qu Wenruo, linux-btrfs

On Thu, Jun 28, 2018 at 11:43:54PM -0700, Marc MERLIN wrote:
> On Fri, Jun 29, 2018 at 02:32:44PM +0800, Su Yue wrote:
> > > > https://github.com/Damenly/btrfs-progs/tree/tmp1
> > > 
> > > Not sure if I undertand that you meant, here.
> > > 
> > Sorry for my unclear words.
> > Simply speaking, I suggest you to stop current running check.
> > Then, clone above branch to compile binary then run
> > 'btrfs check --mode=lowmem $dev'.
>  
> I understand, I'll build and try it.
> 
> > > This filesystem is trash to me and will require over a week to rebuild
> > > manually if I can't repair it.
> > 
> > Understood your anxiety, a log of check without '--repair' will help
> > us to make clear what's wrong with your filesystem.
> 
> Ok, I'll run your new code without repair and report back. It will
> likely take over a day though.

Well, it got stuck for over a day, and then I had to reboot :(

saruman:/var/local/src/btrfs-progs.sy# git remote -v
origin	https://github.com/Damenly/btrfs-progs.git (fetch)
origin	https://github.com/Damenly/btrfs-progs.git (push)
saruman:/var/local/src/btrfs-progs.sy# git branch
  master
* tmp1
saruman:/var/local/src/btrfs-progs.sy# git pull
Already up to date.
saruman:/var/local/src/btrfs-progs.sy# make
Making all in Documentation
make[1]: Nothing to be done for 'all'.

However, it still got stuck here:
gargamel:~# btrfs check --mode=lowmem  -p /dev/mapper/dshelf2   
Checking filesystem on /dev/mapper/dshelf2  
UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d  
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 2
have: 3  
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 2
have: 4  
ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wan
d: 180, have: 181  
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) want
: 67, have: 68  
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) want
: 67, have: 115  
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) want
: 114, have: 115  
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) want
: 114, have: 143  
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wan
d: 301, have: 302  
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wan
d: 355, have: 433  
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wan
d: 160, have: 161  
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wan
d: 161, have: 240  
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wan
d: 169, have: 170  
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wan
d: 171, have: 251  
ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wan
d: 347, have: 348  
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wan
d: 1, have: 1449  
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wan
d: 1, have: 556  

What should I try next?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-01 23:22             ` Marc MERLIN
@ 2018-07-02  2:02               ` Su Yue
  2018-07-02  3:22                 ` Marc MERLIN
  0 siblings, 1 reply; 72+ messages in thread
From: Su Yue @ 2018-07-02  2:02 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Qu Wenruo, linux-btrfs



On 07/02/2018 07:22 AM, Marc MERLIN wrote:
> On Thu, Jun 28, 2018 at 11:43:54PM -0700, Marc MERLIN wrote:
>> On Fri, Jun 29, 2018 at 02:32:44PM +0800, Su Yue wrote:
>>>>> https://github.com/Damenly/btrfs-progs/tree/tmp1
>>>>
>>>> Not sure if I undertand that you meant, here.
>>>>
>>> Sorry for my unclear words.
>>> Simply speaking, I suggest you to stop current running check.
>>> Then, clone above branch to compile binary then run
>>> 'btrfs check --mode=lowmem $dev'.
>>   
>> I understand, I'll build and try it.
>>
>>>> This filesystem is trash to me and will require over a week to rebuild
>>>> manually if I can't repair it.
>>>
>>> Understood your anxiety, a log of check without '--repair' will help
>>> us to make clear what's wrong with your filesystem.
>>
>> Ok, I'll run your new code without repair and report back. It will
>> likely take over a day though.
> 
> Well, it got stuck for over a day, and then I had to reboot :(
> 
> saruman:/var/local/src/btrfs-progs.sy# git remote -v
> origin	https://github.com/Damenly/btrfs-progs.git (fetch)
> origin	https://github.com/Damenly/btrfs-progs.git (push)
> saruman:/var/local/src/btrfs-progs.sy# git branch
>    master
> * tmp1
> saruman:/var/local/src/btrfs-progs.sy# git pull
> Already up to date.
> saruman:/var/local/src/btrfs-progs.sy# make
> Making all in Documentation
> make[1]: Nothing to be done for 'all'.
> 
> However, it still got stuck here:
Thanks, I saw. Some Clues found.

Could you try follow dumps? They shouldn't cost much time.

#btrfs inspect dump-tree -t 21872 <device> | grep -C 50 "374857 
EXTENT_DATA "

#btrfs inspect dump-tree -t 22911 <device> | grep -C 50 "374857 
EXTENT_DATA "

Thanks,
Su

> gargamel:~# btrfs check --mode=lowmem  -p /dev/mapper/dshelf2
> Checking filesystem on /dev/mapper/dshelf2
> UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 2
> have: 3
> ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 2
> have: 4
> ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wan
> d: 180, have: 181
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) want
> : 67, have: 68
> ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) want
> : 67, have: 115
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) want
> : 114, have: 115
> ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) want
> : 114, have: 143
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wan
> d: 301, have: 302
> ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wan
> d: 355, have: 433
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wan
> d: 160, have: 161
> ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wan
> d: 161, have: 240
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wan
> d: 169, have: 170
> ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wan
> d: 171, have: 251
> ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wan
> d: 347, have: 348
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wan
> d: 1, have: 1449
> ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wan
> d: 1, have: 556
> 
> What should I try next?
> 
> Thanks,
> Marc
> 



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02  2:02               ` Su Yue
@ 2018-07-02  3:22                 ` Marc MERLIN
  2018-07-02  6:22                   ` Su Yue
  0 siblings, 1 reply; 72+ messages in thread
From: Marc MERLIN @ 2018-07-02  3:22 UTC (permalink / raw)
  To: Su Yue; +Cc: Qu Wenruo, linux-btrfs

On Mon, Jul 02, 2018 at 10:02:33AM +0800, Su Yue wrote:
> Could you try follow dumps? They shouldn't cost much time.
> 
> #btrfs inspect dump-tree -t 21872 <device> | grep -C 50 "374857 
> EXTENT_DATA "
> 
> #btrfs inspect dump-tree -t 22911 <device> | grep -C 50 "374857 
> EXTENT_DATA "

Ok, that's 29MB, so it doesn't fit on pastebin:
http://marc.merlins.org/tmp/dshelf2_inspect.txt

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02  3:22                 ` Marc MERLIN
@ 2018-07-02  6:22                   ` Su Yue
  2018-07-02 14:05                     ` Marc MERLIN
  0 siblings, 1 reply; 72+ messages in thread
From: Su Yue @ 2018-07-02  6:22 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Qu Wenruo, linux-btrfs



On 07/02/2018 11:22 AM, Marc MERLIN wrote:
> On Mon, Jul 02, 2018 at 10:02:33AM +0800, Su Yue wrote:
>> Could you try follow dumps? They shouldn't cost much time.
>>
>> #btrfs inspect dump-tree -t 21872 <device> | grep -C 50 "374857
>> EXTENT_DATA "
>>
>> #btrfs inspect dump-tree -t 22911 <device> | grep -C 50 "374857
>> EXTENT_DATA "
> 
> Ok, that's 29MB, so it doesn't fit on pastebin:
> http://marc.merlins.org/tmp/dshelf2_inspect.txt
> 
Sorry Marc. After offline communication with Qu, both
of us think the filesystem is hard to repair.
The filesystem is too large to debug step by step.
Every time check and debug spent is too expensive.
And it already costs serveral days.

Sadly, I am afarid that you have to recreate filesystem
and reback up your data. :(

Sorry again and thanks for you reports and patient.

Su
> Marc
> 



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02  6:22                   ` Su Yue
@ 2018-07-02 14:05                     ` Marc MERLIN
  2018-07-02 14:42                       ` Qu Wenruo
  0 siblings, 1 reply; 72+ messages in thread
From: Marc MERLIN @ 2018-07-02 14:05 UTC (permalink / raw)
  To: Su Yue; +Cc: Qu Wenruo, linux-btrfs

On Mon, Jul 02, 2018 at 02:22:20PM +0800, Su Yue wrote:
> > Ok, that's 29MB, so it doesn't fit on pastebin:
> > http://marc.merlins.org/tmp/dshelf2_inspect.txt
> > 
> Sorry Marc. After offline communication with Qu, both
> of us think the filesystem is hard to repair.
> The filesystem is too large to debug step by step.
> Every time check and debug spent is too expensive.
> And it already costs serveral days.
> 
> Sadly, I am afarid that you have to recreate filesystem
> and reback up your data. :(
> 
> Sorry again and thanks for you reports and patient.

I appreciate your help. Honestly I only wanted to help you find why the
tools aren't working. Fixing filesystems by hand (and remotely via Email
on top of that), is way too time consuming like you said.

Is the btrfs design flawed in a way that repair tools just cannot repair
on their own? 
I understand that data can be lost, but I don't understand how the tools
just either keep crashing for me, go in infinite loops, or otherwise
fail to give me back a stable filesystem, even if some data is missing
after that.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02 14:05                     ` Marc MERLIN
@ 2018-07-02 14:42                       ` Qu Wenruo
  2018-07-02 15:18                         ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
                                           ` (2 more replies)
  0 siblings, 3 replies; 72+ messages in thread
From: Qu Wenruo @ 2018-07-02 14:42 UTC (permalink / raw)
  To: Marc MERLIN, Su Yue; +Cc: linux-btrfs



On 2018年07月02日 22:05, Marc MERLIN wrote:
> On Mon, Jul 02, 2018 at 02:22:20PM +0800, Su Yue wrote:
>>> Ok, that's 29MB, so it doesn't fit on pastebin:
>>> http://marc.merlins.org/tmp/dshelf2_inspect.txt
>>>
>> Sorry Marc. After offline communication with Qu, both
>> of us think the filesystem is hard to repair.
>> The filesystem is too large to debug step by step.
>> Every time check and debug spent is too expensive.
>> And it already costs serveral days.
>>
>> Sadly, I am afarid that you have to recreate filesystem
>> and reback up your data. :(
>>
>> Sorry again and thanks for you reports and patient.
> 
> I appreciate your help. Honestly I only wanted to help you find why the
> tools aren't working. Fixing filesystems by hand (and remotely via Email
> on top of that), is way too time consuming like you said.
> 
> Is the btrfs design flawed in a way that repair tools just cannot repair
> on their own? 

For short and for your case, yes, you can consider repair tool just a
garbage and don't use them at any production system.

For full, it depends. (but for most real world case, it's still flawed)
We have small and crafted images as test cases, which btrfs check can
repair without problem at all.
But such images are *SMALL*, and only have *ONE* type of corruption,
which can represent real world case at all.

> I understand that data can be lost, but I don't understand how the tools
> just either keep crashing for me, go in infinite loops, or otherwise
> fail to give me back a stable filesystem, even if some data is missing
> after that.

There are several reasons here that repair tool can't help much:

1) Too large fs (especially too many snapshots)
   The use case (too many snapshots and shared extents, a lot of extents
   get shared over 1000 times) is in fact a super large challenge for
   lowmem mode check/repair.
   It needs O(n^2) or even O(n^3) to check each backref, which hugely
   slow the progress and make us hard to locate the real bug.

2) Corruption in extent tree and our objective is to mount RW
   Extent tree is almost useless if we just want to read data.
   But when we do any write, we needs it and if it goes wrong even a
   tiny bit, your fs could be damaged really badly.

   For other corruption, like some fs tree corruption, we could do
   something to discard some corrupted files, but if it's extent tree,
   we either mount RO and grab anything we have, or hopes the
   almost-never-working --init-extent-tree can work (that's mostly
   miracle).

So, I feel very sorry that we can't provide enough help for your case.

But still, we hope to provide some tips on next build if you still want
to choose btrfs.

1) Don't keep too many snapshots.
   Really, this is the core.
   For send/receive backup, IIRC it only needs the parent subvolume
   exists, there is no need to keep the whole history of all those
   snapshots.
   Keep the number of snapshots to minimal does greatly improve the
   possibility (both manual patch or check repair) of a successful
   repair.
   Normally I would suggest 4 hourly snapshots, 7 daily snapshots, 12
   monthly snapshots.

2) Don't keep unrelated snapshots in one btrfs.
   I totally understand that maintain different btrfs would hugely add
   maintenance pressure, but as explains, all snapshots share one
   fragile extent tree.
   If we limit the fragile extent tree from each other fs, it's less
   possible a single extent tree corruption to take down the whole fs.

Thanks,
Qu

> 
> Thanks,
> Marc
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-02 14:42                       ` Qu Wenruo
@ 2018-07-02 15:18                         ` Marc MERLIN
  2018-07-02 16:59                           ` Austin S. Hemmelgarn
                                             ` (2 more replies)
  2018-07-02 15:19                         ` So, does btrfs check lowmem take days? weeks? Marc MERLIN
  2018-07-03  0:31                         ` Chris Murphy
  2 siblings, 3 replies; 72+ messages in thread
From: Marc MERLIN @ 2018-07-02 15:18 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Su Yue, linux-btrfs

Hi Qu,

I'll split this part into a new thread:

> 2) Don't keep unrelated snapshots in one btrfs.
>    I totally understand that maintain different btrfs would hugely add
>    maintenance pressure, but as explains, all snapshots share one
>    fragile extent tree.

Yes, I understand that this is what I should do given what you
explained.
My main problem is knowing how to segment things so I don't end up with
filesystems that are full while others are almost empty :)

Am I supposed to put LVM thin volumes underneath so that I can share
the same single 10TB raid5?

If I do this, I would have
software raid 5 < dmcrypt < bcache < lvm < btrfs
That's a lot of layers, and that's also starting to make me nervous :)

Is there any other way that does not involve me creating smaller block
devices for multiple btrfs filesystems and hope that they are the right
size because I won't be able to change it later?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02 14:42                       ` Qu Wenruo
  2018-07-02 15:18                         ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
@ 2018-07-02 15:19                         ` Marc MERLIN
  2018-07-02 17:08                           ` Austin S. Hemmelgarn
  2018-07-02 17:33                           ` Roman Mamedov
  2018-07-03  0:31                         ` Chris Murphy
  2 siblings, 2 replies; 72+ messages in thread
From: Marc MERLIN @ 2018-07-02 15:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Su Yue, linux-btrfs

Hi Qu,

thanks for the detailled and honest answer.
A few comments inline.

On Mon, Jul 02, 2018 at 10:42:40PM +0800, Qu Wenruo wrote:
> For full, it depends. (but for most real world case, it's still flawed)
> We have small and crafted images as test cases, which btrfs check can
> repair without problem at all.
> But such images are *SMALL*, and only have *ONE* type of corruption,
> which can represent real world case at all.
 
right, they're just unittest images, I understand.

> 1) Too large fs (especially too many snapshots)
>    The use case (too many snapshots and shared extents, a lot of extents
>    get shared over 1000 times) is in fact a super large challenge for
>    lowmem mode check/repair.
>    It needs O(n^2) or even O(n^3) to check each backref, which hugely
>    slow the progress and make us hard to locate the real bug.
 
So, the non lowmem version would work better, but it's a problem if it
doesn't fit in RAM.
I've always considered it a grave bug that btrfs check repair can use so
much kernel memory that it will crash the entire system. This should not
be possible.
While it won't help me here, can btrfs check be improved not to suck all
the kernel memory, and ideally even allow using swap space if the RAM is
not enough?

Is btrfs check regular mode still being maintained? I think it's still
better than lowmem, correct?

> 2) Corruption in extent tree and our objective is to mount RW
>    Extent tree is almost useless if we just want to read data.
>    But when we do any write, we needs it and if it goes wrong even a
>    tiny bit, your fs could be damaged really badly.
> 
>    For other corruption, like some fs tree corruption, we could do
>    something to discard some corrupted files, but if it's extent tree,
>    we either mount RO and grab anything we have, or hopes the
>    almost-never-working --init-extent-tree can work (that's mostly
>    miracle).
 
I understand that it's the weak point of btrfs, thanks for explaining.

> 1) Don't keep too many snapshots.
>    Really, this is the core.
>    For send/receive backup, IIRC it only needs the parent subvolume
>    exists, there is no need to keep the whole history of all those
>    snapshots.

You are correct on history. The reason I keep history is because I may
want to recover a file from last week or 2 weeks ago after I finally
notice that it's gone. 
I have terabytes of space on the backup server, so it's easier to keep
history there than on the client which may not have enough space to keep
a month's worth of history.
As you know, back when we did tape backups, we also kept history of at
least several weeks (usually several months, but that's too much for
btrfs snapshots).

>    Keep the number of snapshots to minimal does greatly improve the
>    possibility (both manual patch or check repair) of a successful
>    repair.
>    Normally I would suggest 4 hourly snapshots, 7 daily snapshots, 12
>    monthly snapshots.

I actually have fewer snapshots than this per filesystem, but I backup
more than 10 filesystems.
If I used as many snapshots as you recommend, that would already be 230
snapshots for 10 filesystems :)

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-02 15:18                         ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
@ 2018-07-02 16:59                           ` Austin S. Hemmelgarn
  2018-07-02 17:34                             ` Marc MERLIN
  2018-07-03  0:51                           ` Paul Jones
  2018-07-03  1:37                           ` Qu Wenruo
  2 siblings, 1 reply; 72+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-02 16:59 UTC (permalink / raw)
  To: Marc MERLIN, Qu Wenruo; +Cc: Su Yue, linux-btrfs

On 2018-07-02 11:18, Marc MERLIN wrote:
> Hi Qu,
> 
> I'll split this part into a new thread:
> 
>> 2) Don't keep unrelated snapshots in one btrfs.
>>     I totally understand that maintain different btrfs would hugely add
>>     maintenance pressure, but as explains, all snapshots share one
>>     fragile extent tree.
> 
> Yes, I understand that this is what I should do given what you
> explained.
> My main problem is knowing how to segment things so I don't end up with
> filesystems that are full while others are almost empty :)
> 
> Am I supposed to put LVM thin volumes underneath so that I can share
> the same single 10TB raid5?
Actually, because of the online resize ability in BTRFS, you don't 
technically _need_ to use thin provisioning here.  It makes the 
maintenance a bit easier, but it also adds a much more complicated layer 
of indirection than just doing regular volumes.
> 
> If I do this, I would have
> software raid 5 < dmcrypt < bcache < lvm < btrfs
> That's a lot of layers, and that's also starting to make me nervous :)
> 
> Is there any other way that does not involve me creating smaller block
> devices for multiple btrfs filesystems and hope that they are the right
> size because I won't be able to change it later?
You could (in theory) merge the LVM and software RAID5 layers, though 
that may make handling of the RAID5 layer a bit complicated if you 
choose to use thin provisioning (for some reason, LVM is unable to do 
on-line checks and rebuilds of RAID arrays that are acting as thin pool 
data or metadata).

Alternatively, you could increase your array size, remove the software 
RAID layer, and switch to using BTRFS in raid10 mode so that you could 
eliminate one of the layers, though that would probably reduce the 
effectiveness of bcache (you might want to get a bigger cache device if 
you do this).

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02 15:19                         ` So, does btrfs check lowmem take days? weeks? Marc MERLIN
@ 2018-07-02 17:08                           ` Austin S. Hemmelgarn
  2018-07-02 17:33                           ` Roman Mamedov
  1 sibling, 0 replies; 72+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-02 17:08 UTC (permalink / raw)
  To: Marc MERLIN, Qu Wenruo; +Cc: Su Yue, linux-btrfs

On 2018-07-02 11:19, Marc MERLIN wrote:
> Hi Qu,
> 
> thanks for the detailled and honest answer.
> A few comments inline.
> 
> On Mon, Jul 02, 2018 at 10:42:40PM +0800, Qu Wenruo wrote:
>> For full, it depends. (but for most real world case, it's still flawed)
>> We have small and crafted images as test cases, which btrfs check can
>> repair without problem at all.
>> But such images are *SMALL*, and only have *ONE* type of corruption,
>> which can represent real world case at all.
>   
> right, they're just unittest images, I understand.
> 
>> 1) Too large fs (especially too many snapshots)
>>     The use case (too many snapshots and shared extents, a lot of extents
>>     get shared over 1000 times) is in fact a super large challenge for
>>     lowmem mode check/repair.
>>     It needs O(n^2) or even O(n^3) to check each backref, which hugely
>>     slow the progress and make us hard to locate the real bug.
>   
> So, the non lowmem version would work better, but it's a problem if it
> doesn't fit in RAM.
> I've always considered it a grave bug that btrfs check repair can use so
> much kernel memory that it will crash the entire system. This should not
> be possible.
> While it won't help me here, can btrfs check be improved not to suck all
> the kernel memory, and ideally even allow using swap space if the RAM is
> not enough?
> 
> Is btrfs check regular mode still being maintained? I think it's still
> better than lowmem, correct?
> 
>> 2) Corruption in extent tree and our objective is to mount RW
>>     Extent tree is almost useless if we just want to read data.
>>     But when we do any write, we needs it and if it goes wrong even a
>>     tiny bit, your fs could be damaged really badly.
>>
>>     For other corruption, like some fs tree corruption, we could do
>>     something to discard some corrupted files, but if it's extent tree,
>>     we either mount RO and grab anything we have, or hopes the
>>     almost-never-working --init-extent-tree can work (that's mostly
>>     miracle).
>   
> I understand that it's the weak point of btrfs, thanks for explaining.
> 
>> 1) Don't keep too many snapshots.
>>     Really, this is the core.
>>     For send/receive backup, IIRC it only needs the parent subvolume
>>     exists, there is no need to keep the whole history of all those
>>     snapshots.
> 
> You are correct on history. The reason I keep history is because I may
> want to recover a file from last week or 2 weeks ago after I finally
> notice that it's gone.
> I have terabytes of space on the backup server, so it's easier to keep
> history there than on the client which may not have enough space to keep
> a month's worth of history.
> As you know, back when we did tape backups, we also kept history of at
> least several weeks (usually several months, but that's too much for
> btrfs snapshots).
Bit of a case-study here, but it may be of interest.  We do something 
kind of similar where I work for our internal file servers.  We've got 
daily snapshots of the whole server kept on the server itself for 7 days 
(we usually see less than 5% of the total amount of data in changes on 
weekdays, and essentially 0 on weekends, so the snapshots rarely take up 
more than ab out 25% of the size of the live data), and then we 
additionally do daily backups which we retain for 6 months.  I've 
written up a short (albeit rather system specific script) for recovering 
old versions of a file that first scans the snapshots, and then pulls it 
out of the backups if it's not there.  I've found this works remarkably 
well for our use case (almost all the data on the file server follows a 
WORM access pattern with most of the files being between 100kB and 100MB 
in size).

We actually did try moving it all over to BTRFS for a while before we 
finally ended up with the setup we currently have, but aside from the 
whole issue with massive numbers of snapshots, we found that for us at 
least, Amanda actually outperforms BTRFS send/receive for everything 
except full backups and uses less storage space (though that last bit is 
largely because we use really aggressive compression).


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02 15:19                         ` So, does btrfs check lowmem take days? weeks? Marc MERLIN
  2018-07-02 17:08                           ` Austin S. Hemmelgarn
@ 2018-07-02 17:33                           ` Roman Mamedov
  2018-07-02 17:39                             ` Marc MERLIN
  1 sibling, 1 reply; 72+ messages in thread
From: Roman Mamedov @ 2018-07-02 17:33 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

On Mon, 2 Jul 2018 08:19:03 -0700
Marc MERLIN <marc@merlins.org> wrote:

> I actually have fewer snapshots than this per filesystem, but I backup
> more than 10 filesystems.
> If I used as many snapshots as you recommend, that would already be 230
> snapshots for 10 filesystems :)

(...once again me with my rsync :)

If you didn't use send/receive, you wouldn't be required to keep a separate
snapshot trail per filesystem backed up, one trail of snapshots for the entire
backup server would be enough. Rsync everything to subdirs within one
subvolume, then do timed or event-based snapshots of it. You only need more
than one trail if you want different retention policies for different datasets
(e.g. in my case I have 91 and 31 days).

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-02 16:59                           ` Austin S. Hemmelgarn
@ 2018-07-02 17:34                             ` Marc MERLIN
  2018-07-02 18:35                               ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 72+ messages in thread
From: Marc MERLIN @ 2018-07-02 17:34 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Qu Wenruo, Su Yue, linux-btrfs

On Mon, Jul 02, 2018 at 12:59:02PM -0400, Austin S. Hemmelgarn wrote:
> > Am I supposed to put LVM thin volumes underneath so that I can share
> > the same single 10TB raid5?
>
> Actually, because of the online resize ability in BTRFS, you don't
> technically _need_ to use thin provisioning here.  It makes the maintenance
> a bit easier, but it also adds a much more complicated layer of indirection
> than just doing regular volumes.

You're right that I can use btrfs resize, but then I still need an LVM
device underneath, correct?
So, if I have 10 backup targets, I need 10 LVM LVs, I give them 10%
each of the full size available (as a guess), and then I'd have to 
- btrfs resize down one that's bigger than I need
- LVM shrink the LV
- LVM grow the other LV
- LVM resize up the other btrfs

and I think LVM resize and btrfs resize are not linked so I have to do
them separately and hope to type the right numbers each time, correct?
(or is that easier now?)

I kind of linked the thin provisioning idea because it's hands off,
which is appealing. Any reason against it?

> You could (in theory) merge the LVM and software RAID5 layers, though that
> may make handling of the RAID5 layer a bit complicated if you choose to use
> thin provisioning (for some reason, LVM is unable to do on-line checks and
> rebuilds of RAID arrays that are acting as thin pool data or metadata).
 
Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm
radi5?
But yeah, if it's incompatible with thin provisioning, it's not that
useful.

> Alternatively, you could increase your array size, remove the software RAID
> layer, and switch to using BTRFS in raid10 mode so that you could eliminate
> one of the layers, though that would probably reduce the effectiveness of
> bcache (you might want to get a bigger cache device if you do this).

Sadly that won't work. I have more data than will fit on raid10

Thanks for your suggestions though.
Still need to read up on whether I should do thin provisioning, or not.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02 17:33                           ` Roman Mamedov
@ 2018-07-02 17:39                             ` Marc MERLIN
  0 siblings, 0 replies; 72+ messages in thread
From: Marc MERLIN @ 2018-07-02 17:39 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-btrfs

On Mon, Jul 02, 2018 at 10:33:09PM +0500, Roman Mamedov wrote:
> On Mon, 2 Jul 2018 08:19:03 -0700
> Marc MERLIN <marc@merlins.org> wrote:
> 
> > I actually have fewer snapshots than this per filesystem, but I backup
> > more than 10 filesystems.
> > If I used as many snapshots as you recommend, that would already be 230
> > snapshots for 10 filesystems :)
> 
> (...once again me with my rsync :)
> 
> If you didn't use send/receive, you wouldn't be required to keep a separate
> snapshot trail per filesystem backed up, one trail of snapshots for the entire
> backup server would be enough. Rsync everything to subdirs within one
> subvolume, then do timed or event-based snapshots of it. You only need more
> than one trail if you want different retention policies for different datasets
> (e.g. in my case I have 91 and 31 days).

This is exactly how I used to do backups before btrfs.
I did 

cp -al backup.olddate backup.newdate
rsync -avSH src/ backup.newdate/

You don't even need snapshots or btrfs anymore.
Also, sorry to say, but I have different data retention needs for
different backups. Some need to rotate more quickly than others, but if
you're using rsync, the method I gave above works fine at any rotation
interval you need.

It is almost as efficient as btrfs on space, but as I said, the time
penalty on all those stats for many files was what killed it for me.
If I go back to rsync backups (and I'm really unlikely to), then I'd
also go back to ext4. There would be no point in dealing with the
complexity and fragility of btrfs anymore.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-02 17:34                             ` Marc MERLIN
@ 2018-07-02 18:35                               ` Austin S. Hemmelgarn
  2018-07-02 19:40                                 ` Marc MERLIN
  2018-07-03  4:25                                 ` Andrei Borzenkov
  0 siblings, 2 replies; 72+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-02 18:35 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Qu Wenruo, Su Yue, linux-btrfs

On 2018-07-02 13:34, Marc MERLIN wrote:
> On Mon, Jul 02, 2018 at 12:59:02PM -0400, Austin S. Hemmelgarn wrote:
>>> Am I supposed to put LVM thin volumes underneath so that I can share
>>> the same single 10TB raid5?
>>
>> Actually, because of the online resize ability in BTRFS, you don't
>> technically _need_ to use thin provisioning here.  It makes the maintenance
>> a bit easier, but it also adds a much more complicated layer of indirection
>> than just doing regular volumes.
> 
> You're right that I can use btrfs resize, but then I still need an LVM
> device underneath, correct?
> So, if I have 10 backup targets, I need 10 LVM LVs, I give them 10%
> each of the full size available (as a guess), and then I'd have to
> - btrfs resize down one that's bigger than I need
> - LVM shrink the LV
> - LVM grow the other LV
> - LVM resize up the other btrfs
> 
> and I think LVM resize and btrfs resize are not linked so I have to do
> them separately and hope to type the right numbers each time, correct?
> (or is that easier now?)
> 
> I kind of linked the thin provisioning idea because it's hands off,
> which is appealing. Any reason against it?
No, not currently, except that it adds a whole lot more stuff between 
BTRFS and whatever layer is below it.  That increase in what's being 
done adds some overhead (it's noticeable on 7200 RPM consumer SATA 
drives, but not on decent consumer SATA SSD's).

There used to be issues running BTRFS on top of LVM thin targets which 
had zero mode turned off, but AFAIK, all of those problems were fixed 
long ago (before 4.0).
> 
>> You could (in theory) merge the LVM and software RAID5 layers, though that
>> may make handling of the RAID5 layer a bit complicated if you choose to use
>> thin provisioning (for some reason, LVM is unable to do on-line checks and
>> rebuilds of RAID arrays that are acting as thin pool data or metadata).
>   
> Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm
> radi5?
Actually, it uses MD's RAID5 implementation as a back-end.  Same for 
RAID6, and optionally for RAID0, RAID1, and RAID10.

> But yeah, if it's incompatible with thin provisioning, it's not that
> useful.
It's technically not incompatible, just a bit of a pain.  Last time I 
tried to use it, you had to jump through hoops to repair a damaged RAID 
volume that was serving as an underlying volume in a thin pool, and it 
required keeping the thin pool offline for the entire duration of the 
rebuild.
> 
>> Alternatively, you could increase your array size, remove the software RAID
>> layer, and switch to using BTRFS in raid10 mode so that you could eliminate
>> one of the layers, though that would probably reduce the effectiveness of
>> bcache (you might want to get a bigger cache device if you do this).
> 
> Sadly that won't work. I have more data than will fit on raid10
> 
> Thanks for your suggestions though.
> Still need to read up on whether I should do thin provisioning, or not.
If you do go with thin provisioning, I would encourage you to make 
certain to call fstrim on the BTRFS volumes on a semi regular basis so 
that the thin pool doesn't get filled up with old unused blocks, 
preferably when you are 100% certain that there are no ongoing writes on 
them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit 
dangerous to do it while writes are happening).

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-02 18:35                               ` Austin S. Hemmelgarn
@ 2018-07-02 19:40                                 ` Marc MERLIN
  2018-07-03  4:25                                 ` Andrei Borzenkov
  1 sibling, 0 replies; 72+ messages in thread
From: Marc MERLIN @ 2018-07-02 19:40 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Qu Wenruo, Su Yue, linux-btrfs

On Mon, Jul 02, 2018 at 02:35:19PM -0400, Austin S. Hemmelgarn wrote:
> >I kind of linked the thin provisioning idea because it's hands off,
> >which is appealing. Any reason against it?
> No, not currently, except that it adds a whole lot more stuff between 
> BTRFS and whatever layer is below it.  That increase in what's being 
> done adds some overhead (it's noticeable on 7200 RPM consumer SATA 
> drives, but not on decent consumer SATA SSD's).
> 
> There used to be issues running BTRFS on top of LVM thin targets which 
> had zero mode turned off, but AFAIK, all of those problems were fixed 
> long ago (before 4.0).

I see, thanks for the heads up.

> >Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm
> >radi5?
> Actually, it uses MD's RAID5 implementation as a back-end.  Same for 
> RAID6, and optionally for RAID0, RAID1, and RAID10.
 
Ok, that makes me feel a bit better :)

> >But yeah, if it's incompatible with thin provisioning, it's not that
> >useful.
> It's technically not incompatible, just a bit of a pain.  Last time I 
> tried to use it, you had to jump through hoops to repair a damaged RAID 
> volume that was serving as an underlying volume in a thin pool, and it 
> required keeping the thin pool offline for the entire duration of the 
> rebuild.

Argh, not good :( / thanks for the heads up.

> If you do go with thin provisioning, I would encourage you to make 
> certain to call fstrim on the BTRFS volumes on a semi regular basis so 
> that the thin pool doesn't get filled up with old unused blocks, 

That's a very good point/reminder, thanks for that. I guess it's like
running on an ssd :)

> preferably when you are 100% certain that there are no ongoing writes on 
> them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit 
> dangerous to do it while writes are happening).
 
Argh, that will be harder, but I'll try.

Given what you said, it sounds like I'll still be best off with separate
layers to avoid the rebuild problem you mentioned.
So it'll be
swraid5 / dmcrypt / bcache / lvm dm thin / btrfs

Hopefully that will work well enough.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-02 14:42                       ` Qu Wenruo
  2018-07-02 15:18                         ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
  2018-07-02 15:19                         ` So, does btrfs check lowmem take days? weeks? Marc MERLIN
@ 2018-07-03  0:31                         ` Chris Murphy
  2018-07-03  4:22                           ` Marc MERLIN
  2 siblings, 1 reply; 72+ messages in thread
From: Chris Murphy @ 2018-07-03  0:31 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Marc MERLIN, Su Yue, Btrfs BTRFS

On Mon, Jul 2, 2018 at 8:42 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
> On 2018年07月02日 22:05, Marc MERLIN wrote:
>> On Mon, Jul 02, 2018 at 02:22:20PM +0800, Su Yue wrote:
>>>> Ok, that's 29MB, so it doesn't fit on pastebin:
>>>> http://marc.merlins.org/tmp/dshelf2_inspect.txt
>>>>
>>> Sorry Marc. After offline communication with Qu, both
>>> of us think the filesystem is hard to repair.
>>> The filesystem is too large to debug step by step.
>>> Every time check and debug spent is too expensive.
>>> And it already costs serveral days.
>>>
>>> Sadly, I am afarid that you have to recreate filesystem
>>> and reback up your data. :(
>>>
>>> Sorry again and thanks for you reports and patient.
>>
>> I appreciate your help. Honestly I only wanted to help you find why the
>> tools aren't working. Fixing filesystems by hand (and remotely via Email
>> on top of that), is way too time consuming like you said.
>>
>> Is the btrfs design flawed in a way that repair tools just cannot repair
>> on their own?
>
> For short and for your case, yes, you can consider repair tool just a
> garbage and don't use them at any production system.

So the idea behind journaled file systems is that journal replay
enabled mount time "repair" that's faster than an fsck. Already Btrfs
use cases with big, but not huge, file systems makes btrfs check a
problem. Either running out of memory or it takes too long. So already
it isn't scaling as well as ext4 or XFS in this regard.

So what's the future hold? It seems like the goal is that the problems
must be avoided in the first place rather than to repair them after
the fact.

Are the problem's Marc is running into understood well enough that
there can eventually be a fix, maybe even an on-disk format change,
that prevents such problems from happening in the first place?

Or does it make sense for him to be running with btrfs debug or some
subset of btrfs integrity checking mask to try to catch the problems
in the act of them happening?



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-02 15:18                         ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
  2018-07-02 16:59                           ` Austin S. Hemmelgarn
@ 2018-07-03  0:51                           ` Paul Jones
  2018-07-03  4:06                             ` Marc MERLIN
  2018-07-03  1:37                           ` Qu Wenruo
  2 siblings, 1 reply; 72+ messages in thread
From: Paul Jones @ 2018-07-03  0:51 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> owner@vger.kernel.org> On Behalf Of Marc MERLIN
> Sent: Tuesday, 3 July 2018 1:19 AM
> To: Qu Wenruo <quwenruo.btrfs@gmx.com>
> Cc: Su Yue <suy.fnst@cn.fujitsu.com>; linux-btrfs@vger.kernel.org
> Subject: Re: how to best segment a big block device in resizeable btrfs
> filesystems?
> 
> Hi Qu,
> 
> I'll split this part into a new thread:
> 
> > 2) Don't keep unrelated snapshots in one btrfs.
> >    I totally understand that maintain different btrfs would hugely add
> >    maintenance pressure, but as explains, all snapshots share one
> >    fragile extent tree.
> 
> Yes, I understand that this is what I should do given what you explained.
> My main problem is knowing how to segment things so I don't end up with
> filesystems that are full while others are almost empty :)
> 
> Am I supposed to put LVM thin volumes underneath so that I can share the
> same single 10TB raid5?
> 
> If I do this, I would have
> software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of layers, and
> that's also starting to make me nervous :)

You could combine bcache and lvm if you are happy to use dm-cache instead (which lvm uses).
I use it myself (but without thin provisioning) and it works well.


> 
> Is there any other way that does not involve me creating smaller block
> devices for multiple btrfs filesystems and hope that they are the right size
> because I won't be able to change it later?
> 
> Thanks,
> Marc
> --
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems ....
>                                       .... what McDonalds is to gourmet cooking
> Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-02 15:18                         ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
  2018-07-02 16:59                           ` Austin S. Hemmelgarn
  2018-07-03  0:51                           ` Paul Jones
@ 2018-07-03  1:37                           ` Qu Wenruo
  2018-07-03  4:15                             ` Marc MERLIN
  2018-07-03  4:23                             ` Andrei Borzenkov
  2 siblings, 2 replies; 72+ messages in thread
From: Qu Wenruo @ 2018-07-03  1:37 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Su Yue, linux-btrfs



On 2018年07月02日 23:18, Marc MERLIN wrote:
> Hi Qu,
> 
> I'll split this part into a new thread:
> 
>> 2) Don't keep unrelated snapshots in one btrfs.
>>    I totally understand that maintain different btrfs would hugely add
>>    maintenance pressure, but as explains, all snapshots share one
>>    fragile extent tree.
> 
> Yes, I understand that this is what I should do given what you
> explained.
> My main problem is knowing how to segment things so I don't end up with
> filesystems that are full while others are almost empty :)
> 
> Am I supposed to put LVM thin volumes underneath so that I can share
> the same single 10TB raid5?
> 
> If I do this, I would have
> software raid 5 < dmcrypt < bcache < lvm < btrfs
> That's a lot of layers, and that's also starting to make me nervous :)

If you could keep the number of snapshots to minimal (less than 10) for
each btrfs (and the number of send source is less than 5), one big btrfs
may work in that case.

BTW, IMHO the bcache is not really helping for backup system, which is
more write oriented.

Thanks,
Qu

> 
> Is there any other way that does not involve me creating smaller block
> devices for multiple btrfs filesystems and hope that they are the right
> size because I won't be able to change it later?
> 
> Thanks,
> Marc
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  0:51                           ` Paul Jones
@ 2018-07-03  4:06                             ` Marc MERLIN
  2018-07-03  4:26                               ` Paul Jones
  0 siblings, 1 reply; 72+ messages in thread
From: Marc MERLIN @ 2018-07-03  4:06 UTC (permalink / raw)
  To: Paul Jones; +Cc: linux-btrfs

On Tue, Jul 03, 2018 at 12:51:30AM +0000, Paul Jones wrote:
> You could combine bcache and lvm if you are happy to use dm-cache instead (which lvm uses).
> I use it myself (but without thin provisioning) and it works well.

Interesting point. So, I used to use lvm and then lvm2 many years ago until
I got tired with its performance, especially as asoon as I took even a
single snapshot.
But that was a long time ago now, just saying that I'm a bit rusty on LVM
itself.

That being said, if I have
raid5
dm-cache
dm-crypt
dm-thin

That's still 4 block layers under btrfs.
Am I any better off using dm-cache instead of bcache, my understanding is
that it only replaces one block layer with another one and one codebase with
another.

Mmmh, a bit of reading shows that dm-cache is now used as lvmcache, which
might change things, or not.
I'll admit that setting up and maintaining bcache is a bit of a pain, I only
used it at the time because it seemed more ready then, but we're a few years
later now.

So, what do you recommend nowadays, assuming you've used both?
(given that it's literally going to take days to recreate my array, I'd
rather do it once and the right way the first time :) )

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  1:37                           ` Qu Wenruo
@ 2018-07-03  4:15                             ` Marc MERLIN
  2018-07-03  9:55                               ` Paul Jones
  2018-07-03  4:23                             ` Andrei Borzenkov
  1 sibling, 1 reply; 72+ messages in thread
From: Marc MERLIN @ 2018-07-03  4:15 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Su Yue, linux-btrfs

On Tue, Jul 03, 2018 at 09:37:47AM +0800, Qu Wenruo wrote:
> > If I do this, I would have
> > software raid 5 < dmcrypt < bcache < lvm < btrfs
> > That's a lot of layers, and that's also starting to make me nervous :)
> 
> If you could keep the number of snapshots to minimal (less than 10) for
> each btrfs (and the number of send source is less than 5), one big btrfs
> may work in that case.
 
Well, we kind of discussed this already. If btrfs falls over if you reach
100 snapshots or so, and it sure seems to in my case, I won't be much better
off.
Having btrfs check --repair fail because 32GB of RAM is not enough, and it's
unable to use swap, is a big deal in my case. You also confirmed that btrfs
check lowmem does not scale to filesystems like mine, so this translates
into "if regular btrfs check repair can't fit in 32GB, I am completely out
of luck if anything happens to the filesystem"

You're correct that I could tweak my backups and snapshot rotation to get
from 250 or so down to 100, but it seems that I'll just be hoping to avoid
the problem by being just under the limit, until I'm not, again, and it'll
be too late to do anything it next time I'm in trouble again, putting me
back right in the same spot I'm in now.
Is all this fair to say, or did I misunderstand?

> BTW, IMHO the bcache is not really helping for backup system, which is
> more write oriented.

That's a good point. So, what I didn't explain is that I still have some old
filesystem that do get backed up with rsync instead of btrfs send (going
into the same filesystem, but not same subvolume).
Because rsync is so painfully slow when it needs to scan both sides before
it'll even start doing any work, bcache helps there.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03  0:31                         ` Chris Murphy
@ 2018-07-03  4:22                           ` Marc MERLIN
  2018-07-03  8:34                             ` Su Yue
  2018-07-03  8:50                             ` Qu Wenruo
  0 siblings, 2 replies; 72+ messages in thread
From: Marc MERLIN @ 2018-07-03  4:22 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Qu Wenruo, Su Yue, Btrfs BTRFS

On Mon, Jul 02, 2018 at 06:31:43PM -0600, Chris Murphy wrote:
> So the idea behind journaled file systems is that journal replay
> enabled mount time "repair" that's faster than an fsck. Already Btrfs
> use cases with big, but not huge, file systems makes btrfs check a
> problem. Either running out of memory or it takes too long. So already
> it isn't scaling as well as ext4 or XFS in this regard.
> 
> So what's the future hold? It seems like the goal is that the problems
> must be avoided in the first place rather than to repair them after
> the fact.
> 
> Are the problem's Marc is running into understood well enough that
> there can eventually be a fix, maybe even an on-disk format change,
> that prevents such problems from happening in the first place?
> 
> Or does it make sense for him to be running with btrfs debug or some
> subset of btrfs integrity checking mask to try to catch the problems
> in the act of them happening?

Those are all good questions.
To be fair, I cannot claim that btrfs was at fault for whatever filesystem
damage I ended up with. It's very possible that it happened due to a flaky
Sata card that kicked drives off the bus when it shouldn't have.
Sure in theory a journaling filesystem can recover from unexpected power
loss and drives dropping off at bad times, but I'm going to guess that
btrfs' complexity also means that it has data structures (extent tree?) that
need to be updated completely "or else".

I'm obviously ok with a filesystem check being necessary to recover in cases
like this, afterall I still occasionally have to run e2fsck on ext4 too, but
I'm a lot less thrilled with the btrfs situation where basically the repair
tools can either completely crash your kernel, or take days and then either
get stuck in an infinite loop or hit an algorithm that can't scale if you
have too many hardlinks/snapshots.

It sounds like there may not be a fix to this problem with the filesystem's
design, outside of "do not get there, or else".
It would even be useful for btrfs tools to start computing heuristics and
output warnings like "you have more than 100 snapshots on this filesystem,
this is not recommended, please read http://url/"

Qu, Su, does that sound both reasonable and doable?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  1:37                           ` Qu Wenruo
  2018-07-03  4:15                             ` Marc MERLIN
@ 2018-07-03  4:23                             ` Andrei Borzenkov
  1 sibling, 0 replies; 72+ messages in thread
From: Andrei Borzenkov @ 2018-07-03  4:23 UTC (permalink / raw)
  To: Qu Wenruo, Marc MERLIN; +Cc: Su Yue, linux-btrfs

03.07.2018 04:37, Qu Wenruo пишет:
> 
> BTW, IMHO the bcache is not really helping for backup system, which is
> more write oriented.
> 

There is new writecache target which may help in this case.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-02 18:35                               ` Austin S. Hemmelgarn
  2018-07-02 19:40                                 ` Marc MERLIN
@ 2018-07-03  4:25                                 ` Andrei Borzenkov
  2018-07-03  7:15                                   ` Duncan
  1 sibling, 1 reply; 72+ messages in thread
From: Andrei Borzenkov @ 2018-07-03  4:25 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Marc MERLIN; +Cc: Qu Wenruo, Su Yue, linux-btrfs

02.07.2018 21:35, Austin S. Hemmelgarn пишет:
> them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit
> dangerous to do it while writes are happening).

Could you please elaborate? Do you mean btrfs can trim data before new
writes are actually committed to disk?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  4:06                             ` Marc MERLIN
@ 2018-07-03  4:26                               ` Paul Jones
  2018-07-03  5:42                                 ` Marc MERLIN
  0 siblings, 1 reply; 72+ messages in thread
From: Paul Jones @ 2018-07-03  4:26 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2449 bytes --]


> -----Original Message-----
> From: Marc MERLIN <marc@merlins.org>
> Sent: Tuesday, 3 July 2018 2:07 PM
> To: Paul Jones <paul@pauljones.id.au>
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: how to best segment a big block device in resizeable btrfs
> filesystems?
> 
> On Tue, Jul 03, 2018 at 12:51:30AM +0000, Paul Jones wrote:
> > You could combine bcache and lvm if you are happy to use dm-cache
> instead (which lvm uses).
> > I use it myself (but without thin provisioning) and it works well.
> 
> Interesting point. So, I used to use lvm and then lvm2 many years ago until I
> got tired with its performance, especially as asoon as I took even a single
> snapshot.
> But that was a long time ago now, just saying that I'm a bit rusty on LVM
> itself.
> 
> That being said, if I have
> raid5
> dm-cache
> dm-crypt
> dm-thin
> 
> That's still 4 block layers under btrfs.
> Am I any better off using dm-cache instead of bcache, my understanding is
> that it only replaces one block layer with another one and one codebase with
> another.

True, I didn't think of it like that.

> Mmmh, a bit of reading shows that dm-cache is now used as lvmcache, which
> might change things, or not.
> I'll admit that setting up and maintaining bcache is a bit of a pain, I only used it
> at the time because it seemed more ready then, but we're a few years later
> now.
> 
> So, what do you recommend nowadays, assuming you've used both?
> (given that it's literally going to take days to recreate my array, I'd rather do it
> once and the right way the first time :) )

I don't have any experience with this, but since it's the internet let me tell you how I'd do it anyway 😝
raid5
dm-crypt
lvm (using thin provisioning + cache)
btrfs

The cache mode on lvm requires you to set up all your volumes first, then add caching to those volumes last. If you need to modify the volume then you have to remove the cache, make your changes, then re-add the cache. It sounds like a pain, but having the cache separate from the data is quite handy.
Given you are running a backup server I don't think the cache would really do much unless you enable writeback mode. If you can split up your filesystem a bit to the point that btrfs check doesn't OOM that will seriously help performance as well. Rsync might be feasible again.

Paul.

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨è­Ú&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~†­†Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ߣøm

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  4:26                               ` Paul Jones
@ 2018-07-03  5:42                                 ` Marc MERLIN
  0 siblings, 0 replies; 72+ messages in thread
From: Marc MERLIN @ 2018-07-03  5:42 UTC (permalink / raw)
  To: Paul Jones; +Cc: linux-btrfs

On Tue, Jul 03, 2018 at 04:26:37AM +0000, Paul Jones wrote:
> I don't have any experience with this, but since it's the internet let me tell you how I'd do it anyway 😝

That's the spirit :)

> raid5
> dm-crypt
> lvm (using thin provisioning + cache)
> btrfs
> 
> The cache mode on lvm requires you to set up all your volumes first, then
> add caching to those volumes last. If you need to modify the volume then
> you have to remove the cache, make your changes, then re-add the cache. It
> sounds like a pain, but having the cache separate from the data is quite
> handy.

I'm ok enough with that.

> Given you are running a backup server I don't think the cache would
> really do much unless you enable writeback mode. If you can split up your
> filesystem a bit to the point that btrfs check doesn't OOM that will
> seriously help performance as well. Rsync might be feasible again.

I'm a bit warry of write caching with the issues I've had. I may do
write-through, but not writeback :)

But caching helps indeed for my older filesystems that are still backed up
via rsync because the source fs is ext4 and not btrfs.

Thanks for the suggestions
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  4:25                                 ` Andrei Borzenkov
@ 2018-07-03  7:15                                   ` Duncan
  2018-07-06  4:28                                     ` Andrei Borzenkov
  0 siblings, 1 reply; 72+ messages in thread
From: Duncan @ 2018-07-03  7:15 UTC (permalink / raw)
  To: linux-btrfs

Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as excerpted:

> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a
>> bit dangerous to do it while writes are happening).
> 
> Could you please elaborate? Do you mean btrfs can trim data before new
> writes are actually committed to disk?

No.

But normally old roots aren't rewritten for some time simply due to odds 
(fuller filesystems will of course recycle them sooner), and the btrfs 
mount option usebackuproot (formerly recovery, until the norecovery mount 
option that parallels that of other filesystems was added and this option 
was renamed to avoid confusion) can be used to try an older root if the 
current root is too damaged to successfully mount.

But other than simply by odds not using them again immediately, btrfs has 
no special protection for those old roots, and trim/discard will recover 
them to hardware-unused as it does any other unused space, tho whether it 
simply marks them for later processing or actually processes them 
immediately is up to the individual implementation -- some do it 
immediately, killing all chances at using the backup root because it's 
already zeroed out, some don't.

In the context of the discard mount option, that can mean there's never 
any old roots available ever, as they've already been cleaned up by the 
hardware due to the discard option telling the hardware to do it.

But even not using that mount option, and simply doing the trims 
periodically, as done weekly by for instance the systemd fstrim timer and 
service units, or done manually if you prefer, obviously potentially 
wipes the old roots at that point.  If the system's effectively idle at 
the time, not much risk as the current commit is likely to represent a 
filesystem in full stasis, but if there's lots of writes going on at that 
moment *AND* the system happens to crash at just the wrong time, before 
additional commits have recreated at least a bit of root history, again, 
you'll potentially be left without any old roots for the usebackuproot 
mount option to try to fall back to, should it actually be necessary.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03  4:22                           ` Marc MERLIN
@ 2018-07-03  8:34                             ` Su Yue
  2018-07-03 21:34                               ` Chris Murphy
  2018-07-03  8:50                             ` Qu Wenruo
  1 sibling, 1 reply; 72+ messages in thread
From: Su Yue @ 2018-07-03  8:34 UTC (permalink / raw)
  To: Marc MERLIN, Chris Murphy; +Cc: Qu Wenruo, Btrfs BTRFS



On 07/03/2018 12:22 PM, Marc MERLIN wrote:
> On Mon, Jul 02, 2018 at 06:31:43PM -0600, Chris Murphy wrote:
>> So the idea behind journaled file systems is that journal replay
>> enabled mount time "repair" that's faster than an fsck. Already Btrfs
>> use cases with big, but not huge, file systems makes btrfs check a
>> problem. Either running out of memory or it takes too long. So already
>> it isn't scaling as well as ext4 or XFS in this regard.
>>
>> So what's the future hold? It seems like the goal is that the problems
>> must be avoided in the first place rather than to repair them after
>> the fact.
>>
>> Are the problem's Marc is running into understood well enough that
>> there can eventually be a fix, maybe even an on-disk format change,
>> that prevents such problems from happening in the first place?
>>
>> Or does it make sense for him to be running with btrfs debug or some
>> subset of btrfs integrity checking mask to try to catch the problems
>> in the act of them happening?
> 
> Those are all good questions.
> To be fair, I cannot claim that btrfs was at fault for whatever filesystem
> damage I ended up with. It's very possible that it happened due to a flaky
> Sata card that kicked drives off the bus when it shouldn't have.
> Sure in theory a journaling filesystem can recover from unexpected power
> loss and drives dropping off at bad times, but I'm going to guess that
> btrfs' complexity also means that it has data structures (extent tree?) that
> need to be updated completely "or else".
> 
Yes, extent tree is the hardest part for lowmem mode. I'm quite
confident the tool can deal well with file trees(which records metadata
about file and directory name, relationships).
As for extent tree, I have few confidence due to its complexity.

> I'm obviously ok with a filesystem check being necessary to recover in cases
> like this, afterall I still occasionally have to run e2fsck on ext4 too, but
> I'm a lot less thrilled with the btrfs situation where basically the repair
> tools can either completely crash your kernel, or take days and then either
> get stuck in an infinite loop or hit an algorithm that can't scale if you
> have too many hardlinks/snapshots.
> 
It's not surprising that real world filesytems have many snapshots.
Original mode repair eats large memory space, so lowmem mode is created
to save memory but costs time. The latter is just not robust to handle
complex situations.

> It sounds like there may not be a fix to this problem with the filesystem's
> design, outside of "do not get there, or else".
> It would even be useful for btrfs tools to start computing heuristics and
> output warnings like "you have more than 100 snapshots on this filesystem,
> this is not recommended, please read http://url/"
> 
> Qu, Su, does that sound both reasonable and doable?
> 
> Thanks,
> Marc
> 



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03  4:22                           ` Marc MERLIN
  2018-07-03  8:34                             ` Su Yue
@ 2018-07-03  8:50                             ` Qu Wenruo
  2018-07-03 14:38                               ` Marc MERLIN
  2018-07-03 21:46                               ` Chris Murphy
  1 sibling, 2 replies; 72+ messages in thread
From: Qu Wenruo @ 2018-07-03  8:50 UTC (permalink / raw)
  To: Marc MERLIN, Chris Murphy; +Cc: Su Yue, Btrfs BTRFS



On 2018年07月03日 12:22, Marc MERLIN wrote:
> On Mon, Jul 02, 2018 at 06:31:43PM -0600, Chris Murphy wrote:
>> So the idea behind journaled file systems is that journal replay
>> enabled mount time "repair" that's faster than an fsck. Already Btrfs
>> use cases with big, but not huge, file systems makes btrfs check a
>> problem. Either running out of memory or it takes too long. So already
>> it isn't scaling as well as ext4 or XFS in this regard.
>>
>> So what's the future hold? It seems like the goal is that the problems
>> must be avoided in the first place rather than to repair them after
>> the fact.
>>
>> Are the problem's Marc is running into understood well enough that
>> there can eventually be a fix, maybe even an on-disk format change,
>> that prevents such problems from happening in the first place?
>>
>> Or does it make sense for him to be running with btrfs debug or some
>> subset of btrfs integrity checking mask to try to catch the problems
>> in the act of them happening?
> 
> Those are all good questions.
> To be fair, I cannot claim that btrfs was at fault for whatever filesystem
> damage I ended up with. It's very possible that it happened due to a flaky
> Sata card that kicked drives off the bus when it shouldn't have.

However this still doesn't explain the problem you hit.

In theory (well, it's theory by all means), btrfs is fully atomic for
its transaction, even for its data (with csum and cow).
So even a powerloss/data corruption happens between transactions, we
should get the previous trans.

There must be something wrong, however due to the size of the fs, and
the complexity of extent tree, I can't tell.

> Sure in theory a journaling filesystem can recover from unexpected power
> loss and drives dropping off at bad times, but I'm going to guess that
> btrfs' complexity also means that it has data structures (extent tree?) that
> need to be updated completely "or else".

I'm wondering if we have some hidden bug somewhere.
For extent tree, it's metadata, and is protected by mandatory CoW, it
shouldn't be corrupted, unless we have bug in the already complex
delayed reference code, or some unexpected behavior (flush/fua failure)
due to so many layers (dmcrypt + mdraid).

Anyway, if we can't reproduce it in a controlled environment (my VM with
pretty small and plain fs), it's really hard to locate the bug.

> 
> I'm obviously ok with a filesystem check being necessary to recover in cases
> like this, afterall I still occasionally have to run e2fsck on ext4 too, but
> I'm a lot less thrilled with the btrfs situation where basically the repair
> tools can either completely crash your kernel, or take days and then either
> get stuck in an infinite loop or hit an algorithm that can't scale if you
> have too many hardlinks/snapshots.

Unfortunately, all the price is paid for the super fast snapshot creation.
The tradeoff can not be easily solved.

(Another way to implement snapshot is like LVM thin provision, each time
a snapshot is created we need to iterate all allocated blocks of the
thin LV, which can't scale very well when the fs grows, but makes its
mapping management pretty easy. But I think LVM guys have done some
trick to improve the performance)

> 
> It sounds like there may not be a fix to this problem with the filesystem's
> design, outside of "do not get there, or else".
> It would even be useful for btrfs tools to start computing heuristics and
> output warnings like "you have more than 100 snapshots on this filesystem,
> this is not recommended, please read http://url/"

This looks pretty doable, but maybe it's better to add some warning at
btrfs progs (both "subvolume snapshot" and "receive").

Thanks,
Qu

> 
> Qu, Su, does that sound both reasonable and doable?
> 
> Thanks,
> Marc
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  4:15                             ` Marc MERLIN
@ 2018-07-03  9:55                               ` Paul Jones
  2018-07-03 11:29                                 ` Qu Wenruo
  0 siblings, 1 reply; 72+ messages in thread
From: Paul Jones @ 2018-07-03  9:55 UTC (permalink / raw)
  To: Marc MERLIN, Qu Wenruo; +Cc: Su Yue, linux-btrfs

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> owner@vger.kernel.org> On Behalf Of Marc MERLIN
> Sent: Tuesday, 3 July 2018 2:16 PM
> To: Qu Wenruo <quwenruo.btrfs@gmx.com>
> Cc: Su Yue <suy.fnst@cn.fujitsu.com>; linux-btrfs@vger.kernel.org
> Subject: Re: how to best segment a big block device in resizeable btrfs
> filesystems?
> 
> On Tue, Jul 03, 2018 at 09:37:47AM +0800, Qu Wenruo wrote:
> > > If I do this, I would have
> > > software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of
> > > layers, and that's also starting to make me nervous :)
> >
> > If you could keep the number of snapshots to minimal (less than 10)
> > for each btrfs (and the number of send source is less than 5), one big
> > btrfs may work in that case.
> 
> Well, we kind of discussed this already. If btrfs falls over if you reach
> 100 snapshots or so, and it sure seems to in my case, I won't be much better
> off.
> Having btrfs check --repair fail because 32GB of RAM is not enough, and it's
> unable to use swap, is a big deal in my case. You also confirmed that btrfs
> check lowmem does not scale to filesystems like mine, so this translates into
> "if regular btrfs check repair can't fit in 32GB, I am completely out of luck if
> anything happens to the filesystem"

Just out of curiosity I had a look at my backup filesystem.
vm-server /media/backup # btrfs fi us /media/backup/
Overall:
    Device size:                   5.46TiB
    Device allocated:              3.42TiB
    Device unallocated:            2.04TiB
    Device missing:                  0.00B
    Used:                          1.80TiB
    Free (estimated):              1.83TiB      (min: 1.83TiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,RAID1: Size:1.69TiB, Used:906.26GiB
   /dev/mapper/a-backup--a         1.69TiB
   /dev/mapper/b-backup--b         1.69TiB

Metadata,RAID1: Size:19.00GiB, Used:16.90GiB
   /dev/mapper/a-backup--a        19.00GiB
   /dev/mapper/b-backup--b        19.00GiB

System,RAID1: Size:64.00MiB, Used:336.00KiB
   /dev/mapper/a-backup--a        64.00MiB
   /dev/mapper/b-backup--b        64.00MiB

Unallocated:
   /dev/mapper/a-backup--a         1.02TiB
   /dev/mapper/b-backup--b         1.02TiB

compress=zstd,space_cache=v2
202 snapshots, heavily de-duplicated
551G / 361,000 files in latest snapshot

Btrfs check normal mode took 12 mins and 11.5G ram
Lowmem mode I stopped after 4 hours, max memory usage was around 3.9G

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  9:55                               ` Paul Jones
@ 2018-07-03 11:29                                 ` Qu Wenruo
  0 siblings, 0 replies; 72+ messages in thread
From: Qu Wenruo @ 2018-07-03 11:29 UTC (permalink / raw)
  To: Paul Jones, Marc MERLIN; +Cc: Su Yue, linux-btrfs



On 2018年07月03日 17:55, Paul Jones wrote:
>> -----Original Message-----
>> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
>> owner@vger.kernel.org> On Behalf Of Marc MERLIN
>> Sent: Tuesday, 3 July 2018 2:16 PM
>> To: Qu Wenruo <quwenruo.btrfs@gmx.com>
>> Cc: Su Yue <suy.fnst@cn.fujitsu.com>; linux-btrfs@vger.kernel.org
>> Subject: Re: how to best segment a big block device in resizeable btrfs
>> filesystems?
>>
>> On Tue, Jul 03, 2018 at 09:37:47AM +0800, Qu Wenruo wrote:
>>>> If I do this, I would have
>>>> software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of
>>>> layers, and that's also starting to make me nervous :)
>>>
>>> If you could keep the number of snapshots to minimal (less than 10)
>>> for each btrfs (and the number of send source is less than 5), one big
>>> btrfs may work in that case.
>>
>> Well, we kind of discussed this already. If btrfs falls over if you reach
>> 100 snapshots or so, and it sure seems to in my case, I won't be much better
>> off.
>> Having btrfs check --repair fail because 32GB of RAM is not enough, and it's
>> unable to use swap, is a big deal in my case. You also confirmed that btrfs
>> check lowmem does not scale to filesystems like mine, so this translates into
>> "if regular btrfs check repair can't fit in 32GB, I am completely out of luck if
>> anything happens to the filesystem"
> 
> Just out of curiosity I had a look at my backup filesystem.
> vm-server /media/backup # btrfs fi us /media/backup/
> Overall:
>     Device size:                   5.46TiB
>     Device allocated:              3.42TiB
>     Device unallocated:            2.04TiB
>     Device missing:                  0.00B
>     Used:                          1.80TiB
>     Free (estimated):              1.83TiB      (min: 1.83TiB)
>     Data ratio:                       2.00
>     Metadata ratio:                   2.00
>     Global reserve:              512.00MiB      (used: 0.00B)
> 
> Data,RAID1: Size:1.69TiB, Used:906.26GiB

It doesn't affect how fast check run at all.
Unless --check-data-csum is specified.

And even --check-data-csum is specified, most read will still be
sequential, and deduped/reflink won't affect the csum verification speed.

>    /dev/mapper/a-backup--a         1.69TiB
>    /dev/mapper/b-backup--b         1.69TiB
> 
> Metadata,RAID1: Size:19.00GiB, Used:16.90GiB

This is the main factor contributing to btrfs check time.
Just consider it as the minimal amount of data btrfs check needs to read.

>    /dev/mapper/a-backup--a        19.00GiB
>    /dev/mapper/b-backup--b        19.00GiB
> 
> System,RAID1: Size:64.00MiB, Used:336.00KiB
>    /dev/mapper/a-backup--a        64.00MiB
>    /dev/mapper/b-backup--b        64.00MiB
> 
> Unallocated:
>    /dev/mapper/a-backup--a         1.02TiB
>    /dev/mapper/b-backup--b         1.02TiB
> 
> compress=zstd,space_cache=v2
> 202 snapshots, heavily de-duplicated
> 551G / 361,000 files in latest snapshot

No wonder it's so slow for lowmem mode.

> 
> Btrfs check normal mode took 12 mins and 11.5G ram
> Lowmem mode I stopped after 4 hours, max memory usage was around 3.9G

For lowmem, btrfs check will use 25% of your total memory as cache to
speed up it a little. (but as you can see, it's still slow)
Maybe we could add some option to modify how many bytes we could use for
lowmem mode.

Thanks,
Qu

> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03  8:50                             ` Qu Wenruo
@ 2018-07-03 14:38                               ` Marc MERLIN
  2018-07-03 21:46                               ` Chris Murphy
  1 sibling, 0 replies; 72+ messages in thread
From: Marc MERLIN @ 2018-07-03 14:38 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Murphy, Su Yue, Btrfs BTRFS

On Tue, Jul 03, 2018 at 04:50:48PM +0800, Qu Wenruo wrote:
> > It sounds like there may not be a fix to this problem with the filesystem's
> > design, outside of "do not get there, or else".
> > It would even be useful for btrfs tools to start computing heuristics and
> > output warnings like "you have more than 100 snapshots on this filesystem,
> > this is not recommended, please read http://url/"
> 
> This looks pretty doable, but maybe it's better to add some warning at
> btrfs progs (both "subvolume snapshot" and "receive").

This is what I meant to say, correct.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03  8:34                             ` Su Yue
@ 2018-07-03 21:34                               ` Chris Murphy
  2018-07-03 21:40                                 ` Marc MERLIN
  0 siblings, 1 reply; 72+ messages in thread
From: Chris Murphy @ 2018-07-03 21:34 UTC (permalink / raw)
  To: Su Yue; +Cc: Marc MERLIN, Chris Murphy, Qu Wenruo, Btrfs BTRFS

On Tue, Jul 3, 2018 at 2:34 AM, Su Yue <suy.fnst@cn.fujitsu.com> wrote:

> Yes, extent tree is the hardest part for lowmem mode. I'm quite
> confident the tool can deal well with file trees(which records metadata
> about file and directory name, relationships).
> As for extent tree, I have few confidence due to its complexity.

I have to ask again if there's some metadata integrity mask opion Marc
should use to try to catch the corruption cause in the first place?

His use case really can't afford either mode of btrfs check. And also
check is only backward looking, it doesn't show what was happening at
the time. And for big file systems, check rapidly doesn't scale at all
anyway.

And now he's modifying his layout to avoid the problem from happening
again which makes it less likely to catch the cause, and get it fixed.
I think if he's willing to build a kernel with integrity checker
enabled, it should be considered but only if it's likely to reveal why
the problem is happening, even if it can't repair the problem once
it's happened. He's already in that situation so masked integrity
checking is no worse, at least it gives a chance to improve Btrfs
rather than it being a mystery how it got corrupt.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03 21:34                               ` Chris Murphy
@ 2018-07-03 21:40                                 ` Marc MERLIN
  2018-07-04  1:37                                   ` Su Yue
  0 siblings, 1 reply; 72+ messages in thread
From: Marc MERLIN @ 2018-07-03 21:40 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Btrfs BTRFS

On Tue, Jul 03, 2018 at 03:34:45PM -0600, Chris Murphy wrote:
> On Tue, Jul 3, 2018 at 2:34 AM, Su Yue <suy.fnst@cn.fujitsu.com> wrote:
> 
> > Yes, extent tree is the hardest part for lowmem mode. I'm quite
> > confident the tool can deal well with file trees(which records metadata
> > about file and directory name, relationships).
> > As for extent tree, I have few confidence due to its complexity.
> 
> I have to ask again if there's some metadata integrity mask opion Marc
> should use to try to catch the corruption cause in the first place?
> 
> His use case really can't afford either mode of btrfs check. And also
> check is only backward looking, it doesn't show what was happening at
> the time. And for big file systems, check rapidly doesn't scale at all
> anyway.
> 
> And now he's modifying his layout to avoid the problem from happening
> again which makes it less likely to catch the cause, and get it fixed.
> I think if he's willing to build a kernel with integrity checker
> enabled, it should be considered but only if it's likely to reveal why
> the problem is happening, even if it can't repair the problem once
> it's happened. He's already in that situation so masked integrity
> checking is no worse, at least it gives a chance to improve Btrfs
> rather than it being a mystery how it got corrupt.

Yeah, I'm fine waiting a few more ays with this down and gather data if
that helps.
But due to the size, a full btrfs image may be a bit larger than we
want, not counting some confidential data in some filenames.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03  8:50                             ` Qu Wenruo
  2018-07-03 14:38                               ` Marc MERLIN
@ 2018-07-03 21:46                               ` Chris Murphy
  2018-07-03 22:00                                 ` Marc MERLIN
  1 sibling, 1 reply; 72+ messages in thread
From: Chris Murphy @ 2018-07-03 21:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Marc MERLIN, Chris Murphy, Su Yue, Btrfs BTRFS

On Tue, Jul 3, 2018 at 2:50 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
> There must be something wrong, however due to the size of the fs, and
> the complexity of extent tree, I can't tell.

Right, which is why I'm asking if any of the metadata integrity
checker mask options might reveal what's going wrong?

I guess the big issues are:
a. compile kernel with CONFIG_BTRFS_FS_CHECK_INTEGRITY=y is necessary
b. it can come with a high resource burden depending on the mask and
where the log is being written (write system logs to a different file
system for sure)
c. the granularity offered in the integrity checker might not be enough.
d. might take a while before corruptions are injected before
corruption is noticed and flagged.

So it might be pointless, no idea.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03 21:46                               ` Chris Murphy
@ 2018-07-03 22:00                                 ` Marc MERLIN
  2018-07-03 22:52                                   ` Qu Wenruo
  0 siblings, 1 reply; 72+ messages in thread
From: Marc MERLIN @ 2018-07-03 22:00 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Qu Wenruo, Su Yue, Btrfs BTRFS

On Tue, Jul 03, 2018 at 03:46:59PM -0600, Chris Murphy wrote:
> On Tue, Jul 3, 2018 at 2:50 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >
> >
> > There must be something wrong, however due to the size of the fs, and
> > the complexity of extent tree, I can't tell.
> 
> Right, which is why I'm asking if any of the metadata integrity
> checker mask options might reveal what's going wrong?
> 
> I guess the big issues are:
> a. compile kernel with CONFIG_BTRFS_FS_CHECK_INTEGRITY=y is necessary
> b. it can come with a high resource burden depending on the mask and
> where the log is being written (write system logs to a different file
> system for sure)
> c. the granularity offered in the integrity checker might not be enough.
> d. might take a while before corruptions are injected before
> corruption is noticed and flagged.

Back to where I'm at right now. I'm going to delete this filesystem and
start over very soon. Tomorrow or the day after.
I'm happy to get more data off it if someone wants it for posterity, but
I indeed need to recover soon since being with a dead backup server is
not a good place to be in :)

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03 22:00                                 ` Marc MERLIN
@ 2018-07-03 22:52                                   ` Qu Wenruo
  0 siblings, 0 replies; 72+ messages in thread
From: Qu Wenruo @ 2018-07-03 22:52 UTC (permalink / raw)
  To: Marc MERLIN, Chris Murphy; +Cc: Su Yue, Btrfs BTRFS



On 2018年07月04日 06:00, Marc MERLIN wrote:
> On Tue, Jul 03, 2018 at 03:46:59PM -0600, Chris Murphy wrote:
>> On Tue, Jul 3, 2018 at 2:50 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>
>>>
>>> There must be something wrong, however due to the size of the fs, and
>>> the complexity of extent tree, I can't tell.
>>
>> Right, which is why I'm asking if any of the metadata integrity
>> checker mask options might reveal what's going wrong?
>>
>> I guess the big issues are:
>> a. compile kernel with CONFIG_BTRFS_FS_CHECK_INTEGRITY=y is necessary
>> b. it can come with a high resource burden depending on the mask and
>> where the log is being written (write system logs to a different file
>> system for sure)
>> c. the granularity offered in the integrity checker might not be enough.
>> d. might take a while before corruptions are injected before
>> corruption is noticed and flagged.
> 
> Back to where I'm at right now. I'm going to delete this filesystem and
> start over very soon. Tomorrow or the day after.
> I'm happy to get more data off it if someone wants it for posterity, but
> I indeed need to recover soon since being with a dead backup server is
> not a good place to be in :)

Feel free to recover asap, as the extent tree is really too large for
human to analyse manually.

Thanks,
Qu

> 
> Thanks,
> Marc
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-03 21:40                                 ` Marc MERLIN
@ 2018-07-04  1:37                                   ` Su Yue
  0 siblings, 0 replies; 72+ messages in thread
From: Su Yue @ 2018-07-04  1:37 UTC (permalink / raw)
  To: Marc MERLIN, Chris Murphy; +Cc: Qu Wenruo, Btrfs BTRFS



On 07/04/2018 05:40 AM, Marc MERLIN wrote:
> On Tue, Jul 03, 2018 at 03:34:45PM -0600, Chris Murphy wrote:
>> On Tue, Jul 3, 2018 at 2:34 AM, Su Yue <suy.fnst@cn.fujitsu.com> wrote:
>>
>>> Yes, extent tree is the hardest part for lowmem mode. I'm quite
>>> confident the tool can deal well with file trees(which records metadata
>>> about file and directory name, relationships).
>>> As for extent tree, I have few confidence due to its complexity.
>>
>> I have to ask again if there's some metadata integrity mask opion Marc
>> should use to try to catch the corruption cause in the first place?
>>
>> His use case really can't afford either mode of btrfs check. And also
>> check is only backward looking, it doesn't show what was happening at
>> the time. And for big file systems, check rapidly doesn't scale at all
>> anyway.
>>
>> And now he's modifying his layout to avoid the problem from happening
>> again which makes it less likely to catch the cause, and get it fixed.
>> I think if he's willing to build a kernel with integrity checker
>> enabled, it should be considered but only if it's likely to reveal why
>> the problem is happening, even if it can't repair the problem once
>> it's happened. He's already in that situation so masked integrity
>> checking is no worse, at least it gives a chance to improve Btrfs
>> rather than it being a mystery how it got corrupt.
> 
> Yeah, I'm fine waiting a few more ays with this down and gather data if
> that helps.
Thanks! I will write a special version which skips to check wrong extent 
items and print debug log.
And it must run faster to help us locate the stuck problem.

Su
> But due to the size, a full btrfs image may be a bit larger than we
> want, not counting some confidential data in some filenames.
> 
> Marc
> 



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-03  7:15                                   ` Duncan
@ 2018-07-06  4:28                                     ` Andrei Borzenkov
  2018-07-08  8:05                                       ` Duncan
  0 siblings, 1 reply; 72+ messages in thread
From: Andrei Borzenkov @ 2018-07-06  4:28 UTC (permalink / raw)
  To: Duncan, linux-btrfs

03.07.2018 10:15, Duncan пишет:
> Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as excerpted:
> 
>> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
>>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a
>>> bit dangerous to do it while writes are happening).
>>
>> Could you please elaborate? Do you mean btrfs can trim data before new
>> writes are actually committed to disk?
> 
> No.
> 
> But normally old roots aren't rewritten for some time simply due to odds 
> (fuller filesystems will of course recycle them sooner), and the btrfs 
> mount option usebackuproot (formerly recovery, until the norecovery mount 
> option that parallels that of other filesystems was added and this option 
> was renamed to avoid confusion) can be used to try an older root if the 
> current root is too damaged to successfully mount.
> > But other than simply by odds not using them again immediately, btrfs has
> no special protection for those old roots, and trim/discard will recover 
> them to hardware-unused as it does any other unused space, tho whether it 
> simply marks them for later processing or actually processes them 
> immediately is up to the individual implementation -- some do it 
> immediately, killing all chances at using the backup root because it's 
> already zeroed out, some don't.
> 

How is it relevant to "while writes are happening"? Will trimming old
tress immediately after writes have stopped be any different? Why?

> In the context of the discard mount option, that can mean there's never 
> any old roots available ever, as they've already been cleaned up by the 
> hardware due to the discard option telling the hardware to do it.
> 
> But even not using that mount option, and simply doing the trims 
> periodically, as done weekly by for instance the systemd fstrim timer and 
> service units, or done manually if you prefer, obviously potentially 
> wipes the old roots at that point.  If the system's effectively idle at 
> the time, not much risk as the current commit is likely to represent a 
> filesystem in full stasis, but if there's lots of writes going on at that 
> moment *AND* the system happens to crash at just the wrong time, before 
> additional commits have recreated at least a bit of root history, again, 
> you'll potentially be left without any old roots for the usebackuproot 
> mount option to try to fall back to, should it actually be necessary.
> 

Sorry? You are just saying that "previous state can be discarded before
new state is committed", just more verbosely.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: how to best segment a big block device in resizeable btrfs filesystems?
  2018-07-06  4:28                                     ` Andrei Borzenkov
@ 2018-07-08  8:05                                       ` Duncan
  0 siblings, 0 replies; 72+ messages in thread
From: Duncan @ 2018-07-08  8:05 UTC (permalink / raw)
  To: linux-btrfs

Andrei Borzenkov posted on Fri, 06 Jul 2018 07:28:48 +0300 as excerpted:

> 03.07.2018 10:15, Duncan пишет:
>> Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as
>> excerpted:
>> 
>>> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
>>>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a
>>>> bit dangerous to do it while writes are happening).
>>>
>>> Could you please elaborate? Do you mean btrfs can trim data before new
>>> writes are actually committed to disk?
>> 
>> No.
>> 
>> But normally old roots aren't rewritten for some time simply due to
>> odds (fuller filesystems will of course recycle them sooner), and the
>> btrfs mount option usebackuproot (formerly recovery, until the
>> norecovery mount option that parallels that of other filesystems was
>> added and this option was renamed to avoid confusion) can be used to
>> try an older root if the current root is too damaged to successfully
>> mount.

>> But other than simply by odds not using them again immediately, btrfs
>> has
>> no special protection for those old roots, and trim/discard will
>> recover them to hardware-unused as it does any other unused space, tho
>> whether it simply marks them for later processing or actually processes
>> them immediately is up to the individual implementation -- some do it
>> immediately, killing all chances at using the backup root because it's
>> already zeroed out, some don't.
>> 
>> 
> How is it relevant to "while writes are happening"? Will trimming old
> tress immediately after writes have stopped be any different? Why?

Define "while writes are happening" vs. "immediately after writes have 
stopped".  How soon is "immediately", and does the writes stopped 
condition account for data that has reached the device-hardware write 
buffer (so is no longer being transmitted to the device across the bus) 
but not been actually written to media, or not?

On a reasonably quiescent system, multiple empty write cycles are likely 
to have occurred since the last write barrier, and anything in-process is 
likely to have made it to media even if software is missing a write 
barrier it needs (software bug) or the hardware lies about honoring the 
write barrier (hardware bug, allegedly sometimes deliberate on hardware 
willing to gamble with your data that a crash won't happen in a critical 
moment, a somewhat rare occurrence, in ordered to improve normal 
operation performance metrics).

On an IO-maxed system, data and write-barriers are coming down as fast as 
the system can handle them, and write-barriers become critical -- crash 
after something was supposed to get to media but didn't, either because 
of a missing write barrier or because the hardware/firmware lied about 
the barrier and said the data it was supposed to ensure was on-media was, 
when it wasn't, and the btrfs atomic-cow commit guarantees of consistent 
state at each commit go out the window.

At this point it becomes useful to have a number of previous "guaranteed 
consistent state" roots to fall back on, with the /hope/ being that at 
least /one/ of them is usably consistent.  If all but the last one are 
wiped due to trim...

When the system isn't write-maxed the write will have almost certainly 
made it regardless of whether the barrier is there or not, because 
there's enough idle time to finish the current write before another one 
comes down the pipe, so the last-written root is almost certain to be 
fine regardless of barriers, and the history of past roots doesn't matter 
even if there's a crash.

If "immediately after writes have stopped" is strictly defined as a 
condition when all writes including the btrfs commit updating the current 
root and the superblock pointers to the current root have completed, with 
no new writes coming down the pipe in the mean time that might have 
delayed a critical update if a barrier was missed, then trimming old 
roots in this state should be entirely safe, and the distinction between 
that state and the "while writes are happening" is clear.

But if "immediately after writes have stopped" is less strictly defined, 
then the distinction between that state and "while writes are happening" 
remains blurry at best, and having old roots around to fall back on in 
case a write-barrier was missed (for whatever reason, hardware or 
software) becomes a very good thing.

Of course the fact that trim/discard itself is an instruction written to 
the device in the combined command/data stream complexifies the picture 
substantially.  If those write barriers get missed who knows what state 
the new root is in, and if the old ones got erased...  But again, on a 
mostly idle system, it'll probably all "just work", because the writes 
will likely all make it to media, regardless, because there's not a bunch 
of other writes competing for limited write bandwidth and making ordering 
critical.

>> In the context of the discard mount option, that can mean there's never
>> any old roots available ever, as they've already been cleaned up by the
>> hardware due to the discard option telling the hardware to do it.
>> 
>> But even not using that mount option, and simply doing the trims
>> periodically, as done weekly by for instance the systemd fstrim timer
>> and service units, or done manually if you prefer, obviously
>> potentially wipes the old roots at that point.  If the system's
>> effectively idle at the time, not much risk as the current commit is
>> likely to represent a filesystem in full stasis, but if there's lots of
>> writes going on at that moment *AND* the system happens to crash at
>> just the wrong time, before additional commits have recreated at least
>> a bit of root history, again, you'll potentially be left without any
>> old roots for the usebackuproot mount option to try to fall back to,
>> should it actually be necessary.
>> 
>> 
> Sorry? You are just saying that "previous state can be discarded before
> new state is committed", just more verbosely.

No, it's more the new state gets committed before the old is trimmed, but 
should it turn out to be unusable (due to missing write barriers, etc, 
which is more of an issue on a write-bottlenecked system), having a 
history of old roots/states around to fall back to can be very useful.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
       [not found]                         ` <58b36f04-3094-7de0-8d5e-e06e280aac00@cn.fujitsu.com>
@ 2018-07-11  1:08                           ` Su Yue
  0 siblings, 0 replies; 72+ messages in thread
From: Su Yue @ 2018-07-11  1:08 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Su Yue, quwenruo.btrfs, linux-btrfs



On 07/10/2018 06:53 PM, Su Yue wrote:
> 
> 
> On 07/10/2018 12:10 PM, Marc MERLIN wrote:
>> On Tue, Jul 10, 2018 at 08:56:15AM +0800, Su Yue wrote:
>>>> I'm just not clear if my FS is still damaged and btrfsck was just 
>>>> hacked to
>>>> ignore the damage it can't deal with, or whether it was able to repair
>>>> things to a consistent state.
>>>> The fact that I can mount read/write with no errors seems like a 
>>>> good sign.
>>>>
>>> Yes, a good sign. Since extent tree is fixed, the errors left are in
>>> other trees. The most bad result I can see is that writes of some 
>>> files will
>>> reports IO Error. This is the cost of RW.
>>
>> Ok, so we agreed that btrfs scrub won't find this, so ultimately I
>> should run normal btrfsck --repair without the special block skip code
>> you added?
>>
> Yes. Here is the normal btrfsck which skips extent tree to save time.
> And I fixed a bug which is mentioned in other mail by Qu.
> I have no time to add progress of fs trees check though.
> https://github.com/Damenly/btrfs-progs/tree/tmp1
> 
> It may take a long time to fix errors unresolved.
> #./btrfsck -e 2 --mode=lowmem --repair $dev
> '-e' means to skip extent tree.
> Here is the mail. Running above command should sloves errors.
If no other errors occurs, your FS will be good.

Please not run repair of master branch, please :(.
It will ruin all things we did in recent days.

Thanks,
Su
> Thanks
> Su
> 
>> Since I can mount the filesystem read/write though, I can probably
>> delete a lot of snapshots to help the next fsck to run.
>> I assume the number of snapshots also affects the amount of memory taken
>> by regular fsck, so maybe if I delete enough of them regular fsck
>> --repair will work again?
>>
>> Thanks,
>> Marc
>>



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-10  4:55                           ` Qu Wenruo
@ 2018-07-10 10:44                             ` Su Yue
  0 siblings, 0 replies; 72+ messages in thread
From: Su Yue @ 2018-07-10 10:44 UTC (permalink / raw)
  To: Qu Wenruo, Marc MERLIN; +Cc: Su Yue, linux-btrfs



On 07/10/2018 12:55 PM, Qu Wenruo wrote:
> 
> 
> On 2018年07月10日 11:50, Marc MERLIN wrote:
>> On Tue, Jul 10, 2018 at 09:34:36AM +0800, Qu Wenruo wrote:
>>>>>>> Ok, this is where I am now:
>>>>>>> WARNING: debug: end of checking extent item[18457780273152 169 1]
>>>>>>> type: 176 offset: 2
>>>>>>> checking extent items [18457780273152/18457780273152]
>>>>>>> ERROR: errors found in extent allocation tree or chunk allocation
>>>>>>> checking fs roots
>>>>>>> ERROR: root 17592 EXTENT_DATA[25937109 4096] gap exists, expected:
>>>>>>> EXTENT_DATA[25937109 4033]
>>>
>>> The expected end is not even aligned to sectorsize.
>>>
>>> I think there is something wrong.
>>> Dump tree on this INODE would definitely help in this case.
>>>
>>> Marc, would you please try dump using the following command?
>>>
>>> # btrfs ins dump-tree -t 17592 <dev> | grep -C 40 25937109
>>   
>> Sure, there you go:
>> gargamel:~# btrfs ins dump-tree -t 17592 /dev/mapper/dshelf2  | grep -C 40 25937109
> [snip]
>> 	item 30 key (25937109 INODE_ITEM 0) itemoff 13611 itemsize 160
>> 		generation 137680 transid 137680 size 85312 nbytes 85953
>> 		block group 0 mode 100644 links 1 uid 500 gid 500 rdev 0
>> 		sequence 253 flags 0x0(none)
>> 		atime 1529023177.0 (2018-06-14 17:39:37)
>> 		ctime 1529023181.625870411 (2018-06-14 17:39:41)
>> 		mtime 1528885147.0 (2018-06-13 03:19:07)
>> 		otime 1529023159.138139719 (2018-06-14 17:39:19)
>> 	item 31 key (25937109 INODE_REF 14354867) itemoff 13559 itemsize 52
>> 		index 33627 namelen 42 name: thumb1024_112_DiveB-1_Oslob_Whaleshark.jpg
>> 	item 32 key (25937109 EXTENT_DATA 0) itemoff 11563 itemsize 1996
>> 		generation 137680 type 0 (inline)
>> 		inline extent data size 1975 ram_bytes 4033 compression 2 (lzo)
>> 	item 33 key (25937109 EXTENT_DATA 4033) itemoff 11510 itemsize 53
>> 		generation 143349 type 1 (regular)
>> 		extent data disk byte 0 nr 0
>> 		extent data offset 0 nr 63 ram 63
>> 		extent compression 0 (none)
> 
> OK this seems to be caused by btrfs check --repair.
> (According to the generation difference).

Yes, this bug is due to old kernel behavior.
I fixed it in new version.

Thanks,
Su
> 
> So at least no data loss is caused in term of on-disk data.
> 
> However I'm not sure if kernel can handle it.
> Please try to read it with caution, and see if kernel could handle it.
> (I assume for the latest kernel, tree-checker would detect it and refuse
> to read)
> 
> This needs some fix in btrfs check.
> 
> Thanks,
> Qu
> 
>> 	item 34 key (25937109 EXTENT_DATA 4096) itemoff 11457 itemsize 53
>> 		generation 137680 type 1 (regular)
>> 		extent data disk byte 1286516736 nr 4096
>> 		extent data offset 0 nr 4096 ram 4096
>> 		extent compression 0 (none)
>> 	item 35 key (25937109 EXTENT_DATA 8192) itemoff 11404 itemsize 53
>> 		generation 137680 type 1 (regular)
>> 		extent data disk byte 1286520832 nr 8192
>> 		extent data offset 0 nr 12288 ram 12288
>> 		extent compression 2 (lzo)
>> 	item 36 key (25937109 EXTENT_DATA 20480) itemoff 11351 itemsize 53
>> 		generation 137680 type 1 (regular)
>> 		extent data disk byte 4199424000 nr 65536
>> 		extent data offset 0 nr 65536 ram 65536
>> 		extent compression 0 (none)
> 
> 



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-10  3:50                         ` Marc MERLIN
@ 2018-07-10  4:55                           ` Qu Wenruo
  2018-07-10 10:44                             ` Su Yue
  0 siblings, 1 reply; 72+ messages in thread
From: Qu Wenruo @ 2018-07-10  4:55 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Su Yue, Su Yue, linux-btrfs



On 2018年07月10日 11:50, Marc MERLIN wrote:
> On Tue, Jul 10, 2018 at 09:34:36AM +0800, Qu Wenruo wrote:
>>>>>> Ok, this is where I am now:
>>>>>> WARNING: debug: end of checking extent item[18457780273152 169 1]
>>>>>> type: 176 offset: 2
>>>>>> checking extent items [18457780273152/18457780273152]
>>>>>> ERROR: errors found in extent allocation tree or chunk allocation
>>>>>> checking fs roots
>>>>>> ERROR: root 17592 EXTENT_DATA[25937109 4096] gap exists, expected:
>>>>>> EXTENT_DATA[25937109 4033]
>>
>> The expected end is not even aligned to sectorsize.
>>
>> I think there is something wrong.
>> Dump tree on this INODE would definitely help in this case.
>>
>> Marc, would you please try dump using the following command?
>>
>> # btrfs ins dump-tree -t 17592 <dev> | grep -C 40 25937109
>  
> Sure, there you go:
> gargamel:~# btrfs ins dump-tree -t 17592 /dev/mapper/dshelf2  | grep -C 40 25937109
[snip]
> 	item 30 key (25937109 INODE_ITEM 0) itemoff 13611 itemsize 160
> 		generation 137680 transid 137680 size 85312 nbytes 85953
> 		block group 0 mode 100644 links 1 uid 500 gid 500 rdev 0
> 		sequence 253 flags 0x0(none)
> 		atime 1529023177.0 (2018-06-14 17:39:37)
> 		ctime 1529023181.625870411 (2018-06-14 17:39:41)
> 		mtime 1528885147.0 (2018-06-13 03:19:07)
> 		otime 1529023159.138139719 (2018-06-14 17:39:19)
> 	item 31 key (25937109 INODE_REF 14354867) itemoff 13559 itemsize 52
> 		index 33627 namelen 42 name: thumb1024_112_DiveB-1_Oslob_Whaleshark.jpg
> 	item 32 key (25937109 EXTENT_DATA 0) itemoff 11563 itemsize 1996
> 		generation 137680 type 0 (inline)
> 		inline extent data size 1975 ram_bytes 4033 compression 2 (lzo)
> 	item 33 key (25937109 EXTENT_DATA 4033) itemoff 11510 itemsize 53
> 		generation 143349 type 1 (regular)
> 		extent data disk byte 0 nr 0
> 		extent data offset 0 nr 63 ram 63
> 		extent compression 0 (none)

OK this seems to be caused by btrfs check --repair.
(According to the generation difference).

So at least no data loss is caused in term of on-disk data.

However I'm not sure if kernel can handle it.
Please try to read it with caution, and see if kernel could handle it.
(I assume for the latest kernel, tree-checker would detect it and refuse
to read)

This needs some fix in btrfs check.

Thanks,
Qu

> 	item 34 key (25937109 EXTENT_DATA 4096) itemoff 11457 itemsize 53
> 		generation 137680 type 1 (regular)
> 		extent data disk byte 1286516736 nr 4096
> 		extent data offset 0 nr 4096 ram 4096
> 		extent compression 0 (none)
> 	item 35 key (25937109 EXTENT_DATA 8192) itemoff 11404 itemsize 53
> 		generation 137680 type 1 (regular)
> 		extent data disk byte 1286520832 nr 8192
> 		extent data offset 0 nr 12288 ram 12288
> 		extent compression 2 (lzo)
> 	item 36 key (25937109 EXTENT_DATA 20480) itemoff 11351 itemsize 53
> 		generation 137680 type 1 (regular)
> 		extent data disk byte 4199424000 nr 65536
> 		extent data offset 0 nr 65536 ram 65536
> 		extent compression 0 (none)

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
       [not found]                 ` <faba0923-8d1f-5270-ba03-ce9cc484e08a@gmx.com>
@ 2018-07-10  4:00                   ` Marc MERLIN
  0 siblings, 0 replies; 72+ messages in thread
From: Marc MERLIN @ 2018-07-10  4:00 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Su Yue, Su Yue

To fill in for the spectators on the list :)
Su gave me a modified version of btrfsck lowmem that was able to clean
most of my filesystem.
It's not a general case solution since it had some hardcoding specific
to my filesystem problems, but still a great success.
Email quoted below, along with responses to Qu

On Tue, Jul 10, 2018 at 09:09:33AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年07月10日 01:48, Marc MERLIN wrote:
> > Success!
> > Well done Su, this is a huge improvement to the lowmem code. It went from days to less than 3 hours.
> 
> Awesome work!
> 
> > I'll paste the logs below.
> > 
> > Questions:
> > 1) I assume I first need to delete a lot of snapshots. What is the limit in your opinion?
> > 100? 150? other?
> 
> My personal recommendation is just 20. Not 150, not even 100.
 
I see. Then, I may be forced to recreate multiple filesystems anyway.
I have about 25 btrfs send/receive relationships and I have around 10
historical snapshots for each.

In the future, can't we segment extents/snapshots per subvolume, making
subvolumes mini filesystems within the bigger filesystem?

> But snapshot deletion will take time (and it's delayed, you won't know
> if something wrong happened just after "btrfs subv delete") and even
> require a healthy extent tree.
> If all extent tree errors are just false alert, that should not be a big
> problem at all.
> 
> > 
> > 2) my filesystem is somewhat misbalanced. Which balance options do you think are safe to use?
> 
> I would recommend to manually check extent tree for BLOCK_GROUP_ITEM,
> which will tell how big a block group is and how many space is used.
> And gives you an idea on which block group can be relocated.
> Then use vrange= to specify exact block group to relocation.
> 
> One example would be:
> 
> # btrfs ins dump-tree -t extent <dev> | grep -A1 BLOCK_GROUP_ITEM |\
>   tee block_group_dump
> 
> Then the output contains:
> 	item 1 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 16206 itemsize 24
> 		block group used 262144 chunk_objectid 256 flags DATA
> 
> The "13631488" is the bytenr of the block group.
> The "8388608" is the length of the block group.
> The "262144" is the used bytes of the block group.
> 
> The less used space the higher priority it should be relocated. (and
> faster to relocate).
> You could write a small script to do it, or there should be some tool to
> do the calculation for you.
 
I usually use something simpler:
Label: 'btrfs_boot'  uuid: e4c1daa8-9c39-4a59-b0a9-86297d397f3b
	Total devices 1 FS bytes used 30.19GiB
	devid    1 size 79.93GiB used 78.01GiB path /dev/mapper/cryptroot

This is bad, I have 30GB of data, but 78 out of 80GB of structures full.
This is bad news and recommends a balance, correct?
If so, I always struggle as to what value I should give to dusage and
musage...

> And only relocate one block group each time, to avoid possible problem.
> 
> The last but not the least, it's highly recommend to do the relocation
> only after unused snapshots are completely deleted.
> (Or it would be super super slow to relocate)

Thank you for the advise. Hopefully this hepls someone else too, and
maybe someone can write some reallocate helper tool if I don't have the
time to do it myself.

> > 3) Should I start a scrub now (takes about 1 day) or anything else to
> > check that the filesystem is hopefully not damaged anymore?
> 
> I would normally recommend to use btrfs check, but neither mode really
> works here.
> And scrub only checks csum, doesn't check the internal cross reference
> (like content of extent tree).
> 
> Maybe Su could skip the whole extent tree check and let lowmem to check
> the fs tree only, with --check-data-csum it should be a better work than
>  scrub.

I will wait to hear back from Su, but I think the current situation is
that I still have some problems on my FS, they are just
1) not important enough to block mount rw (now it works again)
2) currently ignored by the modified btrfsck I have, but would cause
problems if I used real btrfsck.

Correct?

> > 
> > 4) should btrfs check reset the corrupt counter?
> > bdev /dev/mapper/dshelf2 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
> > for now, should I reset it manually?
> 
> It could be pretty easy to implement if not already implemented.

Seems like it's not given that Su's btrfsck --repair ran to completion
and I still have corrupt set to '2' :)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-10  1:34                       ` Qu Wenruo
@ 2018-07-10  3:50                         ` Marc MERLIN
  2018-07-10  4:55                           ` Qu Wenruo
  0 siblings, 1 reply; 72+ messages in thread
From: Marc MERLIN @ 2018-07-10  3:50 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Su Yue, Su Yue, linux-btrfs

On Tue, Jul 10, 2018 at 09:34:36AM +0800, Qu Wenruo wrote:
> >>>> Ok, this is where I am now:
> >>>> WARNING: debug: end of checking extent item[18457780273152 169 1]
> >>>> type: 176 offset: 2
> >>>> checking extent items [18457780273152/18457780273152]
> >>>> ERROR: errors found in extent allocation tree or chunk allocation
> >>>> checking fs roots
> >>>> ERROR: root 17592 EXTENT_DATA[25937109 4096] gap exists, expected:
> >>>> EXTENT_DATA[25937109 4033]
> 
> The expected end is not even aligned to sectorsize.
> 
> I think there is something wrong.
> Dump tree on this INODE would definitely help in this case.
> 
> Marc, would you please try dump using the following command?
> 
> # btrfs ins dump-tree -t 17592 <dev> | grep -C 40 25937109
 
Sure, there you go:
gargamel:~# btrfs ins dump-tree -t 17592 /dev/mapper/dshelf2  | grep -C 40 25937109
		extent data disk byte 3259370151936 nr 114688
		extent data offset 0 nr 131072 ram 131072
		extent compression 1 (zlib)
	item 144 key (2009526 EXTENT_DATA 1179648) itemoff 7931 itemsize 53
		generation 18462 type 1 (regular)
		extent data disk byte 3259370266624 nr 118784
		extent data offset 0 nr 131072 ram 131072
		extent compression 1 (zlib)
	item 145 key (2009526 EXTENT_DATA 1310720) itemoff 7878 itemsize 53
		generation 18462 type 1 (regular)
		extent data disk byte 3259370385408 nr 118784
		extent data offset 0 nr 131072 ram 131072
		extent compression 1 (zlib)
	item 146 key (2009526 EXTENT_DATA 1441792) itemoff 7825 itemsize 53
		generation 18462 type 1 (regular)
		extent data disk byte 3259370504192 nr 118784
		extent data offset 0 nr 131072 ram 131072
		extent compression 1 (zlib)
	item 147 key (2009526 EXTENT_DATA 1572864) itemoff 7772 itemsize 53
		generation 18462 type 1 (regular)
		extent data disk byte 3259370622976 nr 114688
		extent data offset 0 nr 131072 ram 131072
		extent compression 1 (zlib)
	item 148 key (2009526 EXTENT_DATA 1703936) itemoff 7719 itemsize 53
		generation 18462 type 1 (regular)
		extent data disk byte 3259370737664 nr 118784
		extent data offset 0 nr 131072 ram 131072
		extent compression 1 (zlib)
	item 149 key (2009526 EXTENT_DATA 1835008) itemoff 7666 itemsize 53
		generation 18462 type 1 (regular)
		extent data disk byte 3259370856448 nr 118784
		extent data offset 0 nr 131072 ram 131072
		extent compression 1 (zlib)
	item 150 key (2009526 EXTENT_DATA 1966080) itemoff 7613 itemsize 53
		generation 18462 type 1 (regular)
		extent data disk byte 3259370975232 nr 118784
		extent data offset 0 nr 131072 ram 131072
		extent compression 1 (zlib)
	item 151 key (2009526 EXTENT_DATA 2097152) itemoff 7560 itemsize 53
		generation 18462 type 1 (regular)
		extent data disk byte 3259371094016 nr 114688
		extent data offset 0 nr 131072 ram 131072
		extent compression 1 (zlib)
	item 152 key (2009526 EXTENT_DATA 2228224) itemoff 7507 itemsize 53
		generation 18462 type 1 (regular)
		extent data disk byte 3259371208704 nr 114688
		extent data offset 0 nr 131072 ram 131072
		extent compression 1 (zlib)
	item 153 key (2009526 EXTENT_DATA 2359296) itemoff 7454 itemsize 53
		generation 18462 type 1 (regular)
		extent data disk byte 3259371323392 nr 110592
		extent data offset 0 nr 131072 ram 131072
		extent compression 1 (zlib)
	item 154 key (2009526 EXTENT_DATA 2490368) itemoff 7401 itemsize 53
		generation 18462 type 1 (regular)
		extent data disk byte 3259371433984 nr 114688
		extent data offset 0 nr 131072 ram 131072
		extent compression 1 (zlib)
	item 155 key (2009526 EXTENT_DATA 2621440) itemoff 7348 itemsize 53
		generation 18462 type 1 (regular)
		extent data disk byte 3259371548672 nr 110592
		extent data offset 0 nr 131072 ram 131072
		extent compression 1 (zlib)
	item 156 key (2009526 EXTENT_DATA 2752512) itemoff 7295 itemsize 53
		generation 18462 type 1 (regular)
		extent data disk byte 3259371659264 nr 114688
		extent data offset 0 nr 131072 ram 131072
		extent compression 1 (zlib)
	item 157 key (2009526 EXTENT_DATA 2883584) itemoff 7242 itemsize 53
		generation 18462 type 1 (regular)
		extent data disk byte 3259371773952 nr 106496
		extent data offset 0 nr 131072 ram 131072
		extent compression 1 (zlib)
	item 158 key (2009526 EXTENT_DATA 3014656) itemoff 7189 itemsize 53
		generation 18462 type 1 (regular)
		extent data disk byte 3259371880448 nr 114688
		extent data offset 0 nr 131072 ram 131072
		extent compression 1 (zlib)
	item 159 key (2009526 EXTENT_DATA 3145728) itemoff 7136 itemsize 53
		generation 18462 type 1 (regular)
		extent data disk byte 3259371995136 nr 114688
--
		location key (14379106 INODE_ITEM 0) type FILE
		transid 22300 data_len 0 name_len 35
		name: thumb1024_712_20150620_EDC_Day3.jpg
	item 133 key (14354867 DIR_ITEM 729404427) itemoff 6716 itemsize 109
		location key (14379951 INODE_ITEM 0) type FILE
		transid 22301 data_len 0 name_len 79
		name: thumb1024_AllBest-Dive7-1_Dos_Amigos_Pequena-128_Dive7-1_Dos_Amigos_Pequena.jpg
	item 134 key (14354867 DIR_ITEM 729583157) itemoff 6639 itemsize 77
		location key (17112358 INODE_ITEM 0) type FILE
		transid 32826 data_len 0 name_len 47
		name: thumb1024_151_20180126_Sydney_Australia_Day.jpg
	item 135 key (14354867 DIR_ITEM 729620534) itemoff 6565 itemsize 74
		location key (17112383 INODE_ITEM 0) type FILE
		transid 32826 data_len 0 name_len 44
		name: thumb1024_185_20180127_Powerhouse_Museum.jpg
	item 136 key (14354867 DIR_ITEM 729673586) itemoff 6487 itemsize 78
		location key (15382518 INODE_ITEM 0) type FILE
		transid 22518 data_len 0 name_len 48
		name: thumb1024_144_20170209_Sapporo_Snow_Festival.jpg
	item 137 key (14354867 DIR_ITEM 729690560) itemoff 6420 itemsize 67
		location key (14375605 INODE_ITEM 0) type FILE
		transid 22299 data_len 0 name_len 37
		name: thumb1024_4114_Thu_Penguin_Dinner.jpg
	item 138 key (14354867 DIR_ITEM 729891652) itemoff 6341 itemsize 79
		location key (16747032 INODE_ITEM 0) type FILE
		transid 30141 data_len 0 name_len 49
		name: thumb1024_161_20180106_Tignes_Val_Disere_Day1.jpg
	item 139 key (14354867 DIR_ITEM 730070272) itemoff 6276 itemsize 65
		location key (16884158 INODE_ITEM 0) type FILE
		transid 30467 data_len 0 name_len 35
		name: thumb1024_118_20180117_Lausanne.jpg
	item 140 key (14354867 DIR_ITEM 730123776) itemoff 6198 itemsize 78
		location key (14366570 INODE_ITEM 0) type FILE
		transid 22294 data_len 0 name_len 48
		name: thumb1024_140_20120629_Glenwood_Springs_Day2.jpg
	item 141 key (14354867 DIR_ITEM 730385272) itemoff 6141 itemsize 57
		location key (15324008 INODE_ITEM 0) type FILE
		transid 22507 data_len 0 name_len 27
		name: thumb1024_623_BRC_After.jpg
	item 142 key (14354867 DIR_ITEM 730586073) itemoff 6069 itemsize 72
		location key (25937109 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 42
		name: thumb1024_112_DiveB-1_Oslob_Whaleshark.jpg
	item 143 key (14354867 DIR_ITEM 730655025) itemoff 5998 itemsize 71
		location key (14360632 INODE_ITEM 0) type FILE
		transid 22291 data_len 0 name_len 41
		name: 2_Marativa_Island-100_Marativa_Island.jpg
	item 144 key (14354867 DIR_ITEM 731081108) itemoff 5929 itemsize 69
		location key (14370078 INODE_ITEM 0) type FILE
		transid 22295 data_len 0 name_len 39
		name: thumb1024_2139_NaturalHistoryMuseum.jpg
	item 145 key (14354867 DIR_ITEM 731116607) itemoff 5857 itemsize 72
		location key (15024560 INODE_ITEM 0) type FILE
		transid 22427 data_len 0 name_len 42
		name: thumb1024_131_20160420_Day01_Singapore.jpg
	item 146 key (14354867 DIR_ITEM 731261277) itemoff 5777 itemsize 80
		location key (15080260 INODE_ITEM 0) type FILE
		transid 22471 data_len 0 name_len 50
		name: thumb1024_166_20160723_Day12_Okayama_Hiroshima.jpg
	item 147 key (14354867 DIR_ITEM 731272028) itemoff 5702 itemsize 75
		location key (14371298 INODE_ITEM 0) type FILE
		transid 22296 data_len 0 name_len 45
		name: thumb1024_243_20141110_Day17_Wulai_Sanxia.jpg
	item 148 key (14354867 DIR_ITEM 731674484) itemoff 5636 itemsize 66
		location key (15951111 INODE_ITEM 0) type FILE
		transid 27634 data_len 0 name_len 36
		name: thumb1024_836_20171105_Hong_Kong.jpg
	item 149 key (14354867 DIR_ITEM 731720973) itemoff 5559 itemsize 77
		location key (15003569 INODE_ITEM 0) type FILE
		transid 22421 data_len 0 name_len 47
		name: thumb1024_238_20160120_Whistler_Heli_Skiing.jpg
	item 150 key (14354867 DIR_ITEM 731839840) itemoff 5494 itemsize 65
		location key (14377852 INODE_ITEM 0) type FILE
		transid 22300 data_len 0 name_len 35
		name: thumb1024_565_Glide_Down_Grouse.jpg
	item 151 key (14354867 DIR_ITEM 731977726) itemoff 5402 itemsize 92
		location key (14364355 INODE_ITEM 0) type FILE
		transid 22293 data_len 0 name_len 62
		name: thumb1024_117_20110812_ComputerHistoryMuseum_Dad_Genevieve.jpg
	item 152 key (14354867 DIR_ITEM 731982080) itemoff 5330 itemsize 72
		location key (15401745 INODE_ITEM 0) type FILE
--
		location key (25937099 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 46
		name: thumb1024_111_20180529_Presidential_Museum.jpg
	item 47 key (14354867 DIR_INDEX 33609) itemoff 12751 itemsize 67
		location key (25937100 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 37
		name: thumb1024_111_20180612_Cebu_Bohol.jpg
	item 48 key (14354867 DIR_INDEX 33611) itemoff 12680 itemsize 71
		location key (25937101 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 41
		name: thumb1024_111_Dive1-3_Wallstreet_West.jpg
	item 49 key (14354867 DIR_INDEX 33613) itemoff 12615 itemsize 65
		location key (25937102 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 35
		name: thumb1024_111_Dive7-1N_Thalatta.jpg
	item 50 key (14354867 DIR_INDEX 33615) itemoff 12549 itemsize 66
		location key (25937103 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 36
		name: thumb1024_111_DiveC-1_Lighthouse.jpg
	item 51 key (14354867 DIR_INDEX 33617) itemoff 12475 itemsize 74
		location key (25937104 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 44
		name: thumb1024_1126_20180518_EDC_Vegas_People.jpg
	item 52 key (14354867 DIR_INDEX 33619) itemoff 12401 itemsize 74
		location key (25937105 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 44
		name: thumb1024_1127_20180518_EDC_Vegas_People.jpg
	item 53 key (14354867 DIR_INDEX 33621) itemoff 12338 itemsize 63
		location key (25937106 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 33
		name: thumb1024_112_20180525_Manila.jpg
	item 54 key (14354867 DIR_INDEX 33623) itemoff 12271 itemsize 67
		location key (25937107 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 37
		name: thumb1024_112_20180612_Cebu_Bohol.jpg
	item 55 key (14354867 DIR_INDEX 33625) itemoff 12205 itemsize 66
		location key (25937108 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 36
		name: thumb1024_112_Dive3-3_Black_Rock.jpg
	item 56 key (14354867 DIR_INDEX 33627) itemoff 12133 itemsize 72
		location key (25937109 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 42
		name: thumb1024_112_DiveB-1_Oslob_Whaleshark.jpg
	item 57 key (14354867 DIR_INDEX 33629) itemoff 12059 itemsize 74
		location key (25937110 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 44
		name: thumb1024_1130_20180518_EDC_Vegas_People.jpg
	item 58 key (14354867 DIR_INDEX 33631) itemoff 11985 itemsize 74
		location key (25937111 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 44
		name: thumb1024_1134_20180518_EDC_Vegas_People.jpg
	item 59 key (14354867 DIR_INDEX 33633) itemoff 11911 itemsize 74
		location key (25937112 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 44
		name: thumb1024_1135_20180518_EDC_Vegas_People.jpg
	item 60 key (14354867 DIR_INDEX 33635) itemoff 11837 itemsize 74
		location key (25937113 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 44
		name: thumb1024_1136_20180518_EDC_Vegas_People.jpg
	item 61 key (14354867 DIR_INDEX 33637) itemoff 11775 itemsize 62
		location key (25937114 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 32
		name: thumb1024_113_20180519_Vegas.jpg
	item 62 key (14354867 DIR_INDEX 33639) itemoff 11712 itemsize 63
		location key (25937115 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 33
		name: thumb1024_113_20180524_Manila.jpg
	item 63 key (14354867 DIR_INDEX 33641) itemoff 11649 itemsize 63
		location key (25937116 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 33
		name: thumb1024_113_20180525_Manila.jpg
	item 64 key (14354867 DIR_INDEX 33643) itemoff 11574 itemsize 75
		location key (25937117 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 45
		name: thumb1024_113_20180525_Manila_Mind_Museum.jpg
	item 65 key (14354867 DIR_INDEX 33645) itemoff 11505 itemsize 69
		location key (25937118 INODE_ITEM 0) type FILE
		transid 137680 data_len 0 name_len 39
		name: thumb1024_113_20180528_Taal_Volcano.jpg
	item 66 key (14354867 DIR_INDEX 33647) itemoff 11418 itemsize 87
		location key (25937119 INODE_ITEM 0) type FILE
--
	item 22 key (25937106 EXTENT_DATA 0) itemoff 14343 itemsize 53
		generation 137680 type 1 (regular)
		extent data disk byte 42296578048 nr 163840
		extent data offset 0 nr 163840 ram 163840
		extent compression 0 (none)
	item 23 key (25937107 INODE_ITEM 0) itemoff 14183 itemsize 160
		generation 137680 transid 137680 size 217631 nbytes 221184
		block group 0 mode 100644 links 1 uid 500 gid 500 rdev 0
		sequence 976 flags 0x0(none)
		atime 1529023119.0 (2018-06-14 17:38:39)
		ctime 1529023119.806610768 (2018-06-14 17:38:39)
		mtime 1528945892.0 (2018-06-13 20:11:32)
		otime 1529023119.802610815 (2018-06-14 17:38:39)
	item 24 key (25937107 INODE_REF 14354867) itemoff 14136 itemsize 47
		index 33623 namelen 37 name: thumb1024_112_20180612_Cebu_Bohol.jpg
	item 25 key (25937107 EXTENT_DATA 0) itemoff 14083 itemsize 53
		generation 137680 type 1 (regular)
		extent data disk byte 42297749504 nr 221184
		extent data offset 0 nr 221184 ram 221184
		extent compression 0 (none)
	item 26 key (25937108 INODE_ITEM 0) itemoff 13923 itemsize 160
		generation 137680 transid 137680 size 202071 nbytes 204800
		block group 0 mode 100644 links 1 uid 500 gid 500 rdev 0
		sequence 371 flags 0x8(NOCOMPRESS)
		atime 1529023149.0 (2018-06-14 17:39:09)
		ctime 1529023158.154151505 (2018-06-14 17:39:18)
		mtime 1528885173.0 (2018-06-13 03:19:33)
		otime 1529023119.806610768 (2018-06-14 17:38:39)
	item 27 key (25937108 INODE_REF 14354867) itemoff 13877 itemsize 46
		index 33625 namelen 36 name: thumb1024_112_Dive3-3_Black_Rock.jpg
	item 28 key (25937108 EXTENT_DATA 0) itemoff 13824 itemsize 53
		generation 137680 type 1 (regular)
		extent data disk byte 1286512640 nr 4096
		extent data offset 0 nr 4096 ram 4096
		extent compression 0 (none)
	item 29 key (25937108 EXTENT_DATA 4096) itemoff 13771 itemsize 53
		generation 137680 type 1 (regular)
		extent data disk byte 42297970688 nr 200704
		extent data offset 0 nr 200704 ram 200704
		extent compression 0 (none)
	item 30 key (25937109 INODE_ITEM 0) itemoff 13611 itemsize 160
		generation 137680 transid 137680 size 85312 nbytes 85953
		block group 0 mode 100644 links 1 uid 500 gid 500 rdev 0
		sequence 253 flags 0x0(none)
		atime 1529023177.0 (2018-06-14 17:39:37)
		ctime 1529023181.625870411 (2018-06-14 17:39:41)
		mtime 1528885147.0 (2018-06-13 03:19:07)
		otime 1529023159.138139719 (2018-06-14 17:39:19)
	item 31 key (25937109 INODE_REF 14354867) itemoff 13559 itemsize 52
		index 33627 namelen 42 name: thumb1024_112_DiveB-1_Oslob_Whaleshark.jpg
	item 32 key (25937109 EXTENT_DATA 0) itemoff 11563 itemsize 1996
		generation 137680 type 0 (inline)
		inline extent data size 1975 ram_bytes 4033 compression 2 (lzo)
	item 33 key (25937109 EXTENT_DATA 4033) itemoff 11510 itemsize 53
		generation 143349 type 1 (regular)
		extent data disk byte 0 nr 0
		extent data offset 0 nr 63 ram 63
		extent compression 0 (none)
	item 34 key (25937109 EXTENT_DATA 4096) itemoff 11457 itemsize 53
		generation 137680 type 1 (regular)
		extent data disk byte 1286516736 nr 4096
		extent data offset 0 nr 4096 ram 4096
		extent compression 0 (none)
	item 35 key (25937109 EXTENT_DATA 8192) itemoff 11404 itemsize 53
		generation 137680 type 1 (regular)
		extent data disk byte 1286520832 nr 8192
		extent data offset 0 nr 12288 ram 12288
		extent compression 2 (lzo)
	item 36 key (25937109 EXTENT_DATA 20480) itemoff 11351 itemsize 53
		generation 137680 type 1 (regular)
		extent data disk byte 4199424000 nr 65536
		extent data offset 0 nr 65536 ram 65536
		extent compression 0 (none)
	item 37 key (25937110 INODE_ITEM 0) itemoff 11191 itemsize 160
		generation 137680 transid 137680 size 135515 nbytes 139264
		block group 0 mode 100644 links 1 uid 500 gid 500 rdev 0
		sequence 145 flags 0x0(none)
		atime 1529023187.0 (2018-06-14 17:39:47)
		ctime 1529023190.13769960 (2018-06-14 17:39:50)
		mtime 1527198648.0 (2018-05-24 14:50:48)
		otime 1529023182.589858866 (2018-06-14 17:39:42)
	item 38 key (25937110 INODE_REF 14354867) itemoff 11137 itemsize 54
		index 33629 namelen 44 name: thumb1024_1130_20180518_EDC_Vegas_People.jpg
	item 39 key (25937110 EXTENT_DATA 0) itemoff 11084 itemsize 53
		generation 137680 type 1 (regular)
		extent data disk byte 34841370624 nr 139264
		extent data offset 0 nr 139264 ram 139264
		extent compression 0 (none)
	item 40 key (25937111 INODE_ITEM 0) itemoff 10924 itemsize 160
		generation 137680 transid 137680 size 151659 nbytes 155648
		block group 0 mode 100644 links 1 uid 500 gid 500 rdev 0
		sequence 479 flags 0x0(none)
		atime 1529023192.0 (2018-06-14 17:39:52)
		ctime 1529023194.61721484 (2018-06-14 17:39:54)
		mtime 1527198649.0 (2018-05-24 14:50:49)
		otime 1529023190.625762632 (2018-06-14 17:39:50)
	item 41 key (25937111 INODE_REF 14354867) itemoff 10870 itemsize 54
		index 33631 namelen 44 name: thumb1024_1134_20180518_EDC_Vegas_People.jpg
	item 42 key (25937111 EXTENT_DATA 0) itemoff 10817 itemsize 53
		generation 137680 type 1 (regular)
		extent data disk byte 37451919360 nr 155648
		extent data offset 0 nr 155648 ram 155648
		extent compression 0 (none)
	item 43 key (25937112 INODE_ITEM 0) itemoff 10657 itemsize 160
		generation 137680 transid 137680 size 81989 nbytes 81920
		block group 0 mode 100644 links 1 uid 500 gid 500 rdev 0
		sequence 265 flags 0x8(NOCOMPRESS)
		atime 1529023198.0 (2018-06-14 17:39:58)
		ctime 1529023198.929663188 (2018-06-14 17:39:58)

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
       [not found]                   ` <20180707172114.bfc26eoahullffgg@merlins.org>
@ 2018-07-10  1:37                     ` Su Yue
  2018-07-10  1:34                       ` Qu Wenruo
  0 siblings, 1 reply; 72+ messages in thread
From: Su Yue @ 2018-07-10  1:37 UTC (permalink / raw)
  To: Marc MERLIN, Su Yue; +Cc: linux-btrfs, Qu Wenruo

[CC to linux-btrfs]

Here is the log of wrong extent data.

On 07/08/2018 01:21 AM, Marc MERLIN wrote:
> On Fri, Jul 06, 2018 at 10:56:36AM -0700, Marc MERLIN wrote:
>> On Fri, Jul 06, 2018 at 09:05:23AM -0700, Marc MERLIN wrote:
>>> Ok, this is where I am now:
>>> WARNING: debug: end of checking extent item[18457780273152 169 1] type: 176 offset: 2
>>> checking extent items [18457780273152/18457780273152]
>>> ERROR: errors found in extent allocation tree or chunk allocation
>>> checking fs roots
>>> ERROR: root 17592 EXTENT_DATA[25937109 4096] gap exists, expected: EXTENT_DATA[25937109 4033]
>>> ERROR: root 17592 EXTENT_DATA[25937109 8192] gap exists, expected: EXTENT_DATA[25937109 8129]
>>> ERROR: root 17592 EXTENT_DATA[25937109 20480] gap exists, expected: EXTENT_DATA[25937109 20417]
>>> ERROR: root 17592 EXTENT_DATA[25937493 4096] gap exists, expected: EXTENT_DATA[25937493 3349]
>>> ERROR: root 17592 EXTENT_DATA[25937493 8192] gap exists, expected: EXTENT_DATA[25937493 7445]
>>> ERROR: root 17592 EXTENT_DATA[25937493 12288] gap exists, expected: EXTENT_DATA[25937493 11541]
>>> ERROR: root 17592 EXTENT_DATA[25941335 4096] gap exists, expected: EXTENT_DATA[25941335 4091]
>>> ERROR: root 17592 EXTENT_DATA[25941335 8192] gap exists, expected: EXTENT_DATA[25941335 8187]
>>> ERROR: root 17592 EXTENT_DATA[25942002 4096] gap exists, expected: EXTENT_DATA[25942002 4093]
>>> ERROR: root 17592 EXTENT_DATA[25942790 4096] gap exists, expected: EXTENT_DATA[25942790 4094]
>>> ERROR: root 17592 EXTENT_DATA[25945819 4096] gap exists, expected: EXTENT_DATA[25945819 4093]
>>> ERROR: root 17592 EXTENT_DATA[26064834 4096] gap exists, expected: EXTENT_DATA[26064834 129]
>>> ERROR: root 17592 EXTENT_DATA[26064834 135168] gap exists, expected: EXTENT_DATA[26064834 131201]
>>> ERROR: root 17592 EXTENT_DATA[26064834 266240] gap exists, expected: EXTENT_DATA[26064834 262273]
>>> ERROR: root 17592 EXTENT_DATA[26064834 397312] gap exists, expected: EXTENT_DATA[26064834 393345]
>>> ERROR: root 17592 EXTENT_DATA[26064834 528384] gap exists, expected: EXTENT_DATA[26064834 524417]
>>> ERROR: root 17592 EXTENT_DATA[26064834 659456] gap exists, expected: EXTENT_DATA[26064834 655489]
>>> ERROR: root 17592 EXTENT_DATA[26064834 790528] gap exists, expected: EXTENT_DATA[26064834 786561]
>>> ERROR: root 17592 EXTENT_DATA[26064834 921600] gap exists, expected: EXTENT_DATA[26064834 917633]
>>> ERROR: root 17592 EXTENT_DATA[26064834 929792] gap exists, expected: EXTENT_DATA[26064834 925825]
>>> ERROR: root 17592 EXTENT_DATA[26064834 1224704] gap exists, expected: EXTENT_DATA[26064834 1220737]
>>>
>>> I'm not sure how long it's been stuck on that line. I'll watch it today.
>>
>> Ok, it's been stuck there for 2H.
> 
> Well, it's now the next day and it's finished running:
> 
> checking extent items [18457780273152/18457780273152]
> ERROR: errors found in extent allocation tree or chunk allocation
> checking fs roots
> ERROR: root 17592 EXTENT_DATA[25937109 4096] gap exists, expected: EXTENT_DATA[25937109 4033]
> ERROR: root 17592 EXTENT_DATA[25937109 8192] gap exists, expected: EXTENT_DATA[25937109 8129]
> ERROR: root 17592 EXTENT_DATA[25937109 20480] gap exists, expected: EXTENT_DATA[25937109 20417]
> ERROR: root 17592 EXTENT_DATA[25937493 4096] gap exists, expected: EXTENT_DATA[25937493 3349]
> ERROR: root 17592 EXTENT_DATA[25937493 8192] gap exists, expected: EXTENT_DATA[25937493 7445]
> ERROR: root 17592 EXTENT_DATA[25937493 12288] gap exists, expected: EXTENT_DATA[25937493 11541]
> ERROR: root 17592 EXTENT_DATA[25941335 4096] gap exists, expected: EXTENT_DATA[25941335 4091]
> ERROR: root 17592 EXTENT_DATA[25941335 8192] gap exists, expected: EXTENT_DATA[25941335 8187]
> ERROR: root 17592 EXTENT_DATA[25942002 4096] gap exists, expected: EXTENT_DATA[25942002 4093]
> ERROR: root 17592 EXTENT_DATA[25942790 4096] gap exists, expected: EXTENT_DATA[25942790 4094]
> ERROR: root 17592 EXTENT_DATA[25945819 4096] gap exists, expected: EXTENT_DATA[25945819 4093]
> ERROR: root 17592 EXTENT_DATA[26064834 4096] gap exists, expected: EXTENT_DATA[26064834 129]
> ERROR: root 17592 EXTENT_DATA[26064834 135168] gap exists, expected: EXTENT_DATA[26064834 131201]
> ERROR: root 17592 EXTENT_DATA[26064834 266240] gap exists, expected: EXTENT_DATA[26064834 262273]
> ERROR: root 17592 EXTENT_DATA[26064834 397312] gap exists, expected: EXTENT_DATA[26064834 393345]
> ERROR: root 17592 EXTENT_DATA[26064834 528384] gap exists, expected: EXTENT_DATA[26064834 524417]
> ERROR: root 17592 EXTENT_DATA[26064834 659456] gap exists, expected: EXTENT_DATA[26064834 655489]
> ERROR: root 17592 EXTENT_DATA[26064834 790528] gap exists, expected: EXTENT_DATA[26064834 786561]
> ERROR: root 17592 EXTENT_DATA[26064834 921600] gap exists, expected: EXTENT_DATA[26064834 917633]
> ERROR: root 17592 EXTENT_DATA[26064834 929792] gap exists, expected: EXTENT_DATA[26064834 925825]
> ERROR: root 17592 EXTENT_DATA[26064834 1224704] gap exists, expected: EXTENT_DATA[26064834 1220737]
> ERROR: root 21322 EXTENT_DATA[25320803 4096] gap exists, expected: EXTENT_DATA[25320803 56]
> ERROR: root 21322 EXTENT_DATA[25320803 143360] gap exists, expected: EXTENT_DATA[25320803 139320]
> ERROR: root 21322 EXTENT_DATA[25320803 151552] gap exists, expected: EXTENT_DATA[25320803 147512]
> ERROR: root 21322 EXTENT_DATA[25320803 290816] gap exists, expected: EXTENT_DATA[25320803 286776]
> ERROR: root 21322 EXTENT_DATA[25320803 294912] gap exists, expected: EXTENT_DATA[25320803 290872]
> ERROR: root 21322 EXTENT_DATA[25320803 2949120] gap exists, expected: EXTENT_DATA[25320803 2945080]
> ERROR: root 21322 EXTENT_DATA[25320803 2953216] gap exists, expected: EXTENT_DATA[25320803 2949176]
> ERROR: root 21322 EXTENT_DATA[25320803 5836800] gap exists, expected: EXTENT_DATA[25320803 5832760]
> ERROR: root 22870 EXTENT_DATA[26062114 4096] gap exists, expected: EXTENT_DATA[26062114 89]
> ERROR: root 22870 EXTENT_DATA[26062114 16384] gap exists, expected: EXTENT_DATA[26062114 12377]
> ERROR: root 22870 EXTENT_DATA[26062114 20480] gap exists, expected: EXTENT_DATA[26062114 16473]
> (many lines skipped)
> ERROR: root 23124 EXTENT_DATA[26064190 390852608] gap exists, expected: EXTENT_DATA[26064190 390848601]
> ERROR: root 23124 EXTENT_DATA[26064190 390983680] gap exists, expected: EXTENT_DATA[26064190 390979673]
> ERROR: root 23124 EXTENT_DATA[26064190 391114752] gap exists, expected: EXTENT_DATA[26064190 391110745]
> ERROR: root 23124 EXTENT_DATA[26064190 391245824] gap exists, expected: EXTENT_DATA[26064190 391241817]
> ERROR: root 23124 EXTENT_DATA[26064190 391376896] gap exists, expected: EXTENT_DATA[26064190 391372889]
> ERROR: root 23124 EXTENT_DATA[26064190 391507968] gap exists, expected: EXTENT_DATA[26064190 391503961]
> ERROR: root 23124 EXTENT_DATA[26064190 391639040] gap exists, expected: EXTENT_DATA[26064190 391635033]
> ERROR: root 23124 EXTENT_DATA[26064190 391770112] gap exists, expected: EXTENT_DATA[26064190 391766105]
> ERROR: root 23124 EXTENT_DATA[26064190 391901184] gap exists, expected: EXTENT_DATA[26064190 391897177]
> ERROR: root 23124 EXTENT_DATA[26064190 392032256] gap exists, expected: EXTENT_DATA[26064190 392028249]
> ERROR: root 23124 EXTENT_DATA[26064190 392163328] gap exists, expected: EXTENT_DATA[26064190 392159321]
> ERROR: root 23124 EXTENT_DATA[26064190 392294400] gap exists, expected: EXTENT_DATA[26064190 392290393]
> ERROR: root 23124 EXTENT_DATA[26064190 392425472] gap exists, expected: EXTENT_DATA[26064190 392421465]
> ERROR: root 23124 EXTENT_DATA[26064190 392556544] gap exists, expected: EXTENT_DATA[26064190 392552537]
> ERROR: root 23124 EXTENT_DATA[26064190 392687616] gap exists, expected: EXTENT_DATA[26064190 392683609]
> ERROR: root 23124 EXTENT_DATA[26064190 392818688] gap exists, expected: EXTENT_DATA[26064190 392814681]
> ERROR: root 23124 EXTENT_DATA[26064190 392949760] gap exists, expected: EXTENT_DATA[26064190 392945753]
> ERROR: root 23186 EXTENT_DATA[26064834 4096] gap exists, expected: EXTENT_DATA[26064834 129]
> ERROR: root 23186 EXTENT_DATA[26064834 135168] gap exists, expected: EXTENT_DATA[26064834 131201]
> ERROR: root 23186 EXTENT_DATA[26064834 266240] gap exists, expected: EXTENT_DATA[26064834 262273]
> ERROR: root 23186 EXTENT_DATA[26064834 397312] gap exists, expected: EXTENT_DATA[26064834 393345]
> ERROR: root 23186 EXTENT_DATA[26064834 528384] gap exists, expected: EXTENT_DATA[26064834 524417]
> ERROR: root 23186 EXTENT_DATA[26064834 659456] gap exists, expected: EXTENT_DATA[26064834 655489]
> ERROR: root 23186 EXTENT_DATA[26064834 790528] gap exists, expected: EXTENT_DATA[26064834 786561]
> ERROR: root 23186 EXTENT_DATA[26064834 921600] gap exists, expected: EXTENT_DATA[26064834 917633]
> ERROR: root 23186 EXTENT_DATA[26064834 929792] gap exists, expected: EXTENT_DATA[26064834 925825]
> ERROR: root 23186 EXTENT_DATA[26064834 1224704] gap exists, expected: EXTENT_DATA[26064834 1220737]
> ERROR: errors found in fs roots
> cache and super generation don't match, space cache will be invalidated
> found 13697056956416 bytes used, error(s) found
> total csum bytes: 0
> total tree bytes: 10282598400
> total fs tree bytes: 0
> total extent tree bytes: 10282598400
> btree space waste bytes: 2742975592
> file data blocks allocated: 0
>   referenced 0
> 
> 
> What do I do next?
> 
> Thanks,
> Marc
> 



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: So, does btrfs check lowmem take days? weeks?
  2018-07-10  1:37                     ` Su Yue
@ 2018-07-10  1:34                       ` Qu Wenruo
  2018-07-10  3:50                         ` Marc MERLIN
  0 siblings, 1 reply; 72+ messages in thread
From: Qu Wenruo @ 2018-07-10  1:34 UTC (permalink / raw)
  To: Su Yue, Marc MERLIN, Su Yue; +Cc: linux-btrfs



On 2018年07月10日 09:37, Su Yue wrote:
> [CC to linux-btrfs]
> 
> Here is the log of wrong extent data.
> 
> On 07/08/2018 01:21 AM, Marc MERLIN wrote:
>> On Fri, Jul 06, 2018 at 10:56:36AM -0700, Marc MERLIN wrote:
>>> On Fri, Jul 06, 2018 at 09:05:23AM -0700, Marc MERLIN wrote:
>>>> Ok, this is where I am now:
>>>> WARNING: debug: end of checking extent item[18457780273152 169 1]
>>>> type: 176 offset: 2
>>>> checking extent items [18457780273152/18457780273152]
>>>> ERROR: errors found in extent allocation tree or chunk allocation
>>>> checking fs roots
>>>> ERROR: root 17592 EXTENT_DATA[25937109 4096] gap exists, expected:
>>>> EXTENT_DATA[25937109 4033]

The expected end is not even aligned to sectorsize.

I think there is something wrong.
Dump tree on this INODE would definitely help in this case.

Marc, would you please try dump using the following command?

# btrfs ins dump-tree -t 17592 <dev> | grep -C 40 25937109

Thanks,
Qu

>>>> ERROR: root 17592 EXTENT_DATA[25937109 8192] gap exists, expected:
>>>> EXTENT_DATA[25937109 8129]
>>>> ERROR: root 17592 EXTENT_DATA[25937109 20480] gap exists, expected:
>>>> EXTENT_DATA[25937109 20417]
>>>> ERROR: root 17592 EXTENT_DATA[25937493 4096] gap exists, expected:
>>>> EXTENT_DATA[25937493 3349]
>>>> ERROR: root 17592 EXTENT_DATA[25937493 8192] gap exists, expected:
>>>> EXTENT_DATA[25937493 7445]
>>>> ERROR: root 17592 EXTENT_DATA[25937493 12288] gap exists, expected:
>>>> EXTENT_DATA[25937493 11541]
>>>> ERROR: root 17592 EXTENT_DATA[25941335 4096] gap exists, expected:
>>>> EXTENT_DATA[25941335 4091]
>>>> ERROR: root 17592 EXTENT_DATA[25941335 8192] gap exists, expected:
>>>> EXTENT_DATA[25941335 8187]
>>>> ERROR: root 17592 EXTENT_DATA[25942002 4096] gap exists, expected:
>>>> EXTENT_DATA[25942002 4093]
>>>> ERROR: root 17592 EXTENT_DATA[25942790 4096] gap exists, expected:
>>>> EXTENT_DATA[25942790 4094]
>>>> ERROR: root 17592 EXTENT_DATA[25945819 4096] gap exists, expected:
>>>> EXTENT_DATA[25945819 4093]
>>>> ERROR: root 17592 EXTENT_DATA[26064834 4096] gap exists, expected:
>>>> EXTENT_DATA[26064834 129]
>>>> ERROR: root 17592 EXTENT_DATA[26064834 135168] gap exists, expected:
>>>> EXTENT_DATA[26064834 131201]
>>>> ERROR: root 17592 EXTENT_DATA[26064834 266240] gap exists, expected:
>>>> EXTENT_DATA[26064834 262273]
>>>> ERROR: root 17592 EXTENT_DATA[26064834 397312] gap exists, expected:
>>>> EXTENT_DATA[26064834 393345]
>>>> ERROR: root 17592 EXTENT_DATA[26064834 528384] gap exists, expected:
>>>> EXTENT_DATA[26064834 524417]
>>>> ERROR: root 17592 EXTENT_DATA[26064834 659456] gap exists, expected:
>>>> EXTENT_DATA[26064834 655489]
>>>> ERROR: root 17592 EXTENT_DATA[26064834 790528] gap exists, expected:
>>>> EXTENT_DATA[26064834 786561]
>>>> ERROR: root 17592 EXTENT_DATA[26064834 921600] gap exists, expected:
>>>> EXTENT_DATA[26064834 917633]
>>>> ERROR: root 17592 EXTENT_DATA[26064834 929792] gap exists, expected:
>>>> EXTENT_DATA[26064834 925825]
>>>> ERROR: root 17592 EXTENT_DATA[26064834 1224704] gap exists,
>>>> expected: EXTENT_DATA[26064834 1220737]
>>>>
>>>> I'm not sure how long it's been stuck on that line. I'll watch it
>>>> today.
>>>
>>> Ok, it's been stuck there for 2H.
>>
>> Well, it's now the next day and it's finished running:
>>
>> checking extent items [18457780273152/18457780273152]
>> ERROR: errors found in extent allocation tree or chunk allocation
>> checking fs roots
>> ERROR: root 17592 EXTENT_DATA[25937109 4096] gap exists, expected:
>> EXTENT_DATA[25937109 4033]
>> ERROR: root 17592 EXTENT_DATA[25937109 8192] gap exists, expected:
>> EXTENT_DATA[25937109 8129]
>> ERROR: root 17592 EXTENT_DATA[25937109 20480] gap exists, expected:
>> EXTENT_DATA[25937109 20417]
>> ERROR: root 17592 EXTENT_DATA[25937493 4096] gap exists, expected:
>> EXTENT_DATA[25937493 3349]
>> ERROR: root 17592 EXTENT_DATA[25937493 8192] gap exists, expected:
>> EXTENT_DATA[25937493 7445]
>> ERROR: root 17592 EXTENT_DATA[25937493 12288] gap exists, expected:
>> EXTENT_DATA[25937493 11541]
>> ERROR: root 17592 EXTENT_DATA[25941335 4096] gap exists, expected:
>> EXTENT_DATA[25941335 4091]
>> ERROR: root 17592 EXTENT_DATA[25941335 8192] gap exists, expected:
>> EXTENT_DATA[25941335 8187]
>> ERROR: root 17592 EXTENT_DATA[25942002 4096] gap exists, expected:
>> EXTENT_DATA[25942002 4093]
>> ERROR: root 17592 EXTENT_DATA[25942790 4096] gap exists, expected:
>> EXTENT_DATA[25942790 4094]
>> ERROR: root 17592 EXTENT_DATA[25945819 4096] gap exists, expected:
>> EXTENT_DATA[25945819 4093]
>> ERROR: root 17592 EXTENT_DATA[26064834 4096] gap exists, expected:
>> EXTENT_DATA[26064834 129]
>> ERROR: root 17592 EXTENT_DATA[26064834 135168] gap exists, expected:
>> EXTENT_DATA[26064834 131201]
>> ERROR: root 17592 EXTENT_DATA[26064834 266240] gap exists, expected:
>> EXTENT_DATA[26064834 262273]
>> ERROR: root 17592 EXTENT_DATA[26064834 397312] gap exists, expected:
>> EXTENT_DATA[26064834 393345]
>> ERROR: root 17592 EXTENT_DATA[26064834 528384] gap exists, expected:
>> EXTENT_DATA[26064834 524417]
>> ERROR: root 17592 EXTENT_DATA[26064834 659456] gap exists, expected:
>> EXTENT_DATA[26064834 655489]
>> ERROR: root 17592 EXTENT_DATA[26064834 790528] gap exists, expected:
>> EXTENT_DATA[26064834 786561]
>> ERROR: root 17592 EXTENT_DATA[26064834 921600] gap exists, expected:
>> EXTENT_DATA[26064834 917633]
>> ERROR: root 17592 EXTENT_DATA[26064834 929792] gap exists, expected:
>> EXTENT_DATA[26064834 925825]
>> ERROR: root 17592 EXTENT_DATA[26064834 1224704] gap exists, expected:
>> EXTENT_DATA[26064834 1220737]
>> ERROR: root 21322 EXTENT_DATA[25320803 4096] gap exists, expected:
>> EXTENT_DATA[25320803 56]
>> ERROR: root 21322 EXTENT_DATA[25320803 143360] gap exists, expected:
>> EXTENT_DATA[25320803 139320]
>> ERROR: root 21322 EXTENT_DATA[25320803 151552] gap exists, expected:
>> EXTENT_DATA[25320803 147512]
>> ERROR: root 21322 EXTENT_DATA[25320803 290816] gap exists, expected:
>> EXTENT_DATA[25320803 286776]
>> ERROR: root 21322 EXTENT_DATA[25320803 294912] gap exists, expected:
>> EXTENT_DATA[25320803 290872]
>> ERROR: root 21322 EXTENT_DATA[25320803 2949120] gap exists, expected:
>> EXTENT_DATA[25320803 2945080]
>> ERROR: root 21322 EXTENT_DATA[25320803 2953216] gap exists, expected:
>> EXTENT_DATA[25320803 2949176]
>> ERROR: root 21322 EXTENT_DATA[25320803 5836800] gap exists, expected:
>> EXTENT_DATA[25320803 5832760]
>> ERROR: root 22870 EXTENT_DATA[26062114 4096] gap exists, expected:
>> EXTENT_DATA[26062114 89]
>> ERROR: root 22870 EXTENT_DATA[26062114 16384] gap exists, expected:
>> EXTENT_DATA[26062114 12377]
>> ERROR: root 22870 EXTENT_DATA[26062114 20480] gap exists, expected:
>> EXTENT_DATA[26062114 16473]
>> (many lines skipped)
>> ERROR: root 23124 EXTENT_DATA[26064190 390852608] gap exists,
>> expected: EXTENT_DATA[26064190 390848601]
>> ERROR: root 23124 EXTENT_DATA[26064190 390983680] gap exists,
>> expected: EXTENT_DATA[26064190 390979673]
>> ERROR: root 23124 EXTENT_DATA[26064190 391114752] gap exists,
>> expected: EXTENT_DATA[26064190 391110745]
>> ERROR: root 23124 EXTENT_DATA[26064190 391245824] gap exists,
>> expected: EXTENT_DATA[26064190 391241817]
>> ERROR: root 23124 EXTENT_DATA[26064190 391376896] gap exists,
>> expected: EXTENT_DATA[26064190 391372889]
>> ERROR: root 23124 EXTENT_DATA[26064190 391507968] gap exists,
>> expected: EXTENT_DATA[26064190 391503961]
>> ERROR: root 23124 EXTENT_DATA[26064190 391639040] gap exists,
>> expected: EXTENT_DATA[26064190 391635033]
>> ERROR: root 23124 EXTENT_DATA[26064190 391770112] gap exists,
>> expected: EXTENT_DATA[26064190 391766105]
>> ERROR: root 23124 EXTENT_DATA[26064190 391901184] gap exists,
>> expected: EXTENT_DATA[26064190 391897177]
>> ERROR: root 23124 EXTENT_DATA[26064190 392032256] gap exists,
>> expected: EXTENT_DATA[26064190 392028249]
>> ERROR: root 23124 EXTENT_DATA[26064190 392163328] gap exists,
>> expected: EXTENT_DATA[26064190 392159321]
>> ERROR: root 23124 EXTENT_DATA[26064190 392294400] gap exists,
>> expected: EXTENT_DATA[26064190 392290393]
>> ERROR: root 23124 EXTENT_DATA[26064190 392425472] gap exists,
>> expected: EXTENT_DATA[26064190 392421465]
>> ERROR: root 23124 EXTENT_DATA[26064190 392556544] gap exists,
>> expected: EXTENT_DATA[26064190 392552537]
>> ERROR: root 23124 EXTENT_DATA[26064190 392687616] gap exists,
>> expected: EXTENT_DATA[26064190 392683609]
>> ERROR: root 23124 EXTENT_DATA[26064190 392818688] gap exists,
>> expected: EXTENT_DATA[26064190 392814681]
>> ERROR: root 23124 EXTENT_DATA[26064190 392949760] gap exists,
>> expected: EXTENT_DATA[26064190 392945753]
>> ERROR: root 23186 EXTENT_DATA[26064834 4096] gap exists, expected:
>> EXTENT_DATA[26064834 129]
>> ERROR: root 23186 EXTENT_DATA[26064834 135168] gap exists, expected:
>> EXTENT_DATA[26064834 131201]
>> ERROR: root 23186 EXTENT_DATA[26064834 266240] gap exists, expected:
>> EXTENT_DATA[26064834 262273]
>> ERROR: root 23186 EXTENT_DATA[26064834 397312] gap exists, expected:
>> EXTENT_DATA[26064834 393345]
>> ERROR: root 23186 EXTENT_DATA[26064834 528384] gap exists, expected:
>> EXTENT_DATA[26064834 524417]
>> ERROR: root 23186 EXTENT_DATA[26064834 659456] gap exists, expected:
>> EXTENT_DATA[26064834 655489]
>> ERROR: root 23186 EXTENT_DATA[26064834 790528] gap exists, expected:
>> EXTENT_DATA[26064834 786561]
>> ERROR: root 23186 EXTENT_DATA[26064834 921600] gap exists, expected:
>> EXTENT_DATA[26064834 917633]
>> ERROR: root 23186 EXTENT_DATA[26064834 929792] gap exists, expected:
>> EXTENT_DATA[26064834 925825]
>> ERROR: root 23186 EXTENT_DATA[26064834 1224704] gap exists, expected:
>> EXTENT_DATA[26064834 1220737]
>> ERROR: errors found in fs roots
>> cache and super generation don't match, space cache will be invalidated
>> found 13697056956416 bytes used, error(s) found
>> total csum bytes: 0
>> total tree bytes: 10282598400
>> total fs tree bytes: 0
>> total extent tree bytes: 10282598400
>> btree space waste bytes: 2742975592
>> file data blocks allocated: 0
>>   referenced 0
>>
>>
>> What do I do next?
>>
>> Thanks,
>> Marc
>>
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2018-07-11  1:03 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-29  4:27 So, does btrfs check lowmem take days? weeks? Marc MERLIN
2018-06-29  5:07 ` Qu Wenruo
2018-06-29  5:28   ` Marc MERLIN
2018-06-29  5:48     ` Qu Wenruo
2018-06-29  6:06       ` Marc MERLIN
2018-06-29  6:29         ` Qu Wenruo
2018-06-29  6:59           ` Marc MERLIN
2018-06-29  7:09             ` Roman Mamedov
2018-06-29  7:22               ` Marc MERLIN
2018-06-29  7:34                 ` Roman Mamedov
2018-06-29  8:04                 ` Lionel Bouton
2018-06-29 16:24                   ` btrfs send/receive vs rsync Marc MERLIN
2018-06-30  8:18                     ` Duncan
2018-06-29  7:20             ` So, does btrfs check lowmem take days? weeks? Qu Wenruo
2018-06-29  7:28               ` Marc MERLIN
2018-06-29 17:10                 ` Marc MERLIN
2018-06-30  0:04                   ` Chris Murphy
2018-06-30  2:44                   ` Marc MERLIN
2018-06-30 14:49                     ` Qu Wenruo
2018-06-30 21:06                       ` Marc MERLIN
2018-06-29  6:02     ` Su Yue
2018-06-29  6:10       ` Marc MERLIN
2018-06-29  6:32         ` Su Yue
2018-06-29  6:43           ` Marc MERLIN
2018-07-01 23:22             ` Marc MERLIN
2018-07-02  2:02               ` Su Yue
2018-07-02  3:22                 ` Marc MERLIN
2018-07-02  6:22                   ` Su Yue
2018-07-02 14:05                     ` Marc MERLIN
2018-07-02 14:42                       ` Qu Wenruo
2018-07-02 15:18                         ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
2018-07-02 16:59                           ` Austin S. Hemmelgarn
2018-07-02 17:34                             ` Marc MERLIN
2018-07-02 18:35                               ` Austin S. Hemmelgarn
2018-07-02 19:40                                 ` Marc MERLIN
2018-07-03  4:25                                 ` Andrei Borzenkov
2018-07-03  7:15                                   ` Duncan
2018-07-06  4:28                                     ` Andrei Borzenkov
2018-07-08  8:05                                       ` Duncan
2018-07-03  0:51                           ` Paul Jones
2018-07-03  4:06                             ` Marc MERLIN
2018-07-03  4:26                               ` Paul Jones
2018-07-03  5:42                                 ` Marc MERLIN
2018-07-03  1:37                           ` Qu Wenruo
2018-07-03  4:15                             ` Marc MERLIN
2018-07-03  9:55                               ` Paul Jones
2018-07-03 11:29                                 ` Qu Wenruo
2018-07-03  4:23                             ` Andrei Borzenkov
2018-07-02 15:19                         ` So, does btrfs check lowmem take days? weeks? Marc MERLIN
2018-07-02 17:08                           ` Austin S. Hemmelgarn
2018-07-02 17:33                           ` Roman Mamedov
2018-07-02 17:39                             ` Marc MERLIN
2018-07-03  0:31                         ` Chris Murphy
2018-07-03  4:22                           ` Marc MERLIN
2018-07-03  8:34                             ` Su Yue
2018-07-03 21:34                               ` Chris Murphy
2018-07-03 21:40                                 ` Marc MERLIN
2018-07-04  1:37                                   ` Su Yue
2018-07-03  8:50                             ` Qu Wenruo
2018-07-03 14:38                               ` Marc MERLIN
2018-07-03 21:46                               ` Chris Murphy
2018-07-03 22:00                                 ` Marc MERLIN
2018-07-03 22:52                                   ` Qu Wenruo
2018-06-29  5:35   ` Su Yue
2018-06-29  5:46     ` Marc MERLIN
     [not found] <94caf6c5-77e1-3da0-d026-a29edb08d410@cn.fujitsu.com>
     [not found] ` <CAKhhfD6svMo=28_UX=ZjRRmF6zNadd3H+8vVZKGX4zjqVr-giw@mail.gmail.com>
     [not found]   ` <3a83cb3c-de2b-e803-f07e-31f7de0ee25f@cn.fujitsu.com>
     [not found]     ` <b1b2d361-eb1a-f172-45d3-409abd131d2b@cn.fujitsu.com>
     [not found]       ` <20180705153023.GA30566@merlins.org>
     [not found]         ` <trinity-d028b6bd-31d9-41c0-a091-47bcb810cdc3-1530808069711@msvc-mesg-gmx023>
     [not found]           ` <20180705165049.t56dvqpz7ljjan5c@merlins.org>
     [not found]             ` <trinity-79578bdf-a849-4342-a082-f2b882f2251e-1530810500266@msvc-mesg-gmx024>
     [not found]               ` <20180706160523.kxwxjzwneseaamnt@merlins.org>
     [not found]                 ` <20180706175636.53ebp7drifiqu5b7@merlins.org>
     [not found]                   ` <20180707172114.bfc26eoahullffgg@merlins.org>
2018-07-10  1:37                     ` Su Yue
2018-07-10  1:34                       ` Qu Wenruo
2018-07-10  3:50                         ` Marc MERLIN
2018-07-10  4:55                           ` Qu Wenruo
2018-07-10 10:44                             ` Su Yue
     [not found] <f9bc21d6-fdc3-ca3a-793f-6fe574c7b8c6@cn.fujitsu.com>
     [not found] ` <20180709031054.qfg4x5yzcl4rao2k@merlins.org>
     [not found]   ` <20180709031501.iutlokfvodtkkfhe@merlins.org>
     [not found]     ` <17cc0cc1-b64d-4daa-18b5-bb2da3736ea1@cn.fujitsu.com>
     [not found]       ` <20180709034058.wjavwjdyixx6smbw@merlins.org>
     [not found]         ` <29302c14-e277-2c69-ac08-c4722c2b18aa@cn.fujitsu.com>
     [not found]           ` <20180709155306.zr3p2kolnanvkpny@merlins.org>
     [not found]             ` <trinity-4aae1c42-a85e-4c73-a30e-8b0d0be05e86-1531152875875@msvc-mesg-gmx023>
     [not found]               ` <20180709174818.wq2d4awmgasxgwad@merlins.org>
     [not found]                 ` <faba0923-8d1f-5270-ba03-ce9cc484e08a@gmx.com>
2018-07-10  4:00                   ` Marc MERLIN
     [not found]                 ` <trinity-4546309e-d603-4d29-885a-e76da594f792-1531159860064@msvc-mesg-gmx021>
     [not found]                   ` <20180709222218.GP9859@merlins.org>
     [not found]                     ` <440b7d12-3504-8b4f-5aa4-b1f39f549730@cn.fujitsu.com>
     [not found]                       ` <20180710041037.4ynitx3flubtwtvc@merlins.org>
     [not found]                         ` <58b36f04-3094-7de0-8d5e-e06e280aac00@cn.fujitsu.com>
2018-07-11  1:08                           ` Su Yue

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.