BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB

All of lore.kernel.org
 help / color / mirror / Atom feed

* BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
@ 2019-10-26 17:46 Atemu
  2019-10-27  0:50 ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Atemu @ 2019-10-26 17:46 UTC (permalink / raw)
  To: linux-btrfs

Hi linux-btrfs,
after btrfs sending ~37GiB of a snapshot of one of my subvolumes,
btrfs send stalls (`pv` (which I'm piping it through) does not report
any significant throughput anymore) and shortly after, the Kernel's
memory usage starts to rise until it runs OOM and panics.

Here's the tail of dmesg I saved before such a Kernel panic:

https://gist.githubusercontent.com/Atemu/3af591b9fa02efee10303ccaac3b4a85/raw/f27c0c911f4a9839a6e59ed494ff5066c7754e07/btrfs%2520send%2520OOM%2520log

(I cancelled the first btrfs send in this example FYI, that's not part
of nor required for this bug.)

And here's a picture of the screen after the Kernel panic:

https://photos.app.goo.gl/cEj5TA9B5V8eRXsy9

(This was recorded a while back but I am able to repoduce the same bug
on archlinux-2019.10.01-x86_64.iso.)

The snapshot holds ~3.8TiB of data that has been compressed (ZSTD:3)
and heavily deduplicated down to ~1.9TiB.
For deduplication I used `bedup dedup` and `duperemove -x -r -h -A -b
32K ---skip-zeroes --dedupe-options=same,fiemap,noblock` and IIRC it
was mostly done around the time 4.19 and 4.20 were recent.

The Inode that btrfs reports as corrupt towards the end of the dmesg
is a 37GiB 7z archive (size correlates) and can be read without errors
on a live system where the bug hasn't been triggered yet. Since it
happens to be a 7z archive, I can even confirm its integrity with `7z
t`.
A scrub and `btrfs check --check-data-csum` don't detect any errors either.

Please tell me what other information I could provide that might be
useful/necessary for squashing this bug,
Atemu

PS: I could spin up a VM with device mapper snapshots of the drives,
destructive troubleshooting is possible if needed.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
  2019-10-26 17:46 BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB Atemu
@ 2019-10-27  0:50 ` Qu Wenruo
  2019-10-27 10:33   ` Atemu
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2019-10-27  0:50 UTC (permalink / raw)
  To: Atemu, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2198 bytes --]



On 2019/10/27 上午1:46, Atemu wrote:
> Hi linux-btrfs,
> after btrfs sending ~37GiB of a snapshot of one of my subvolumes,
> btrfs send stalls (`pv` (which I'm piping it through) does not report
> any significant throughput anymore) and shortly after, the Kernel's
> memory usage starts to rise until it runs OOM and panics.
> 
> Here's the tail of dmesg I saved before such a Kernel panic:
> 
> https://gist.githubusercontent.com/Atemu/3af591b9fa02efee10303ccaac3b4a85/raw/f27c0c911f4a9839a6e59ed494ff5066c7754e07/btrfs%2520send%2520OOM%2520log
> 
> (I cancelled the first btrfs send in this example FYI, that's not part
> of nor required for this bug.)
> 
> And here's a picture of the screen after the Kernel panic:
> 
> https://photos.app.goo.gl/cEj5TA9B5V8eRXsy9
> 
> (This was recorded a while back but I am able to repoduce the same bug
> on archlinux-2019.10.01-x86_64.iso.)
> 
> The snapshot holds ~3.8TiB of data that has been compressed (ZSTD:3)
> and heavily deduplicated down to ~1.9TiB.

That's the problem.

Deduped files caused heavy overload for backref walk.
And send has to do backref walk, and you see the problem...

I'm very interested how heavily deduped the file is.
If it's just all 0 pages, hole punching is more effective than dedupe,
and causes 0 backref overhead.

Thanks,
Qu

> For deduplication I used `bedup dedup` and `duperemove -x -r -h -A -b
> 32K ---skip-zeroes --dedupe-options=same,fiemap,noblock` and IIRC it
> was mostly done around the time 4.19 and 4.20 were recent.
> 
> The Inode that btrfs reports as corrupt towards the end of the dmesg
> is a 37GiB 7z archive (size correlates) and can be read without errors
> on a live system where the bug hasn't been triggered yet. Since it
> happens to be a 7z archive, I can even confirm its integrity with `7z
> t`.
> A scrub and `btrfs check --check-data-csum` don't detect any errors either.
> 
> Please tell me what other information I could provide that might be
> useful/necessary for squashing this bug,
> Atemu
> 
> PS: I could spin up a VM with device mapper snapshots of the drives,
> destructive troubleshooting is possible if needed.
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
  2019-10-27  0:50 ` Qu Wenruo
@ 2019-10-27 10:33   ` Atemu
  2019-10-27 11:34     ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Atemu @ 2019-10-27 10:33 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

> That's the problem.
>
> Deduped files caused heavy overload for backref walk.
> And send has to do backref walk, and you see the problem...

Interesting!
But should it really be able to make btrfs send use up >15GiB of RAM
and cause a kernel panic because of that? The btrfs doesn't even have
that much metadata on-disk in total.

> I'm very interested how heavily deduped the file is.

So am I, how could I get my hands on that information?

Are that particular file's extents what causes btrfs send's memory
usage to spiral out of control?

> If it's just all 0 pages, hole punching is more effective than dedupe,
> and causes 0 backref overhead.

I did punch holes into the disk images I have stored on it by mounting
and fstrim'ing them and the duperemove command I used has a flag that
ignores all 0 pages (those get compressed down to next to nothing
anyways) but it's likely that I ran duperememove once or twice before
I knew about that flag.

Is there a way to find such extents that could cause the backref walk
to overload?

Thanks,
Atemu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
  2019-10-27 10:33   ` Atemu
@ 2019-10-27 11:34     ` Qu Wenruo
  2019-10-27 12:55       ` Atemu
  2019-10-27 15:19       ` Atemu
  0 siblings, 2 replies; 18+ messages in thread
From: Qu Wenruo @ 2019-10-27 11:34 UTC (permalink / raw)
  To: Atemu; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2238 bytes --]



On 2019/10/27 下午6:33, Atemu wrote:
>> That's the problem.
>>
>> Deduped files caused heavy overload for backref walk.
>> And send has to do backref walk, and you see the problem...
> 
> Interesting!
> But should it really be able to make btrfs send use up >15GiB of RAM
> and cause a kernel panic because of that? The btrfs doesn't even have
> that much metadata on-disk in total.

This depends on how shared one file extent is.

If one file extent is shared 10,000 times for one subvolume, and you
have 1000 snapshots of that subvolume, it will really go crazy.

> 
>> I'm very interested how heavily deduped the file is.
> 
> So am I, how could I get my hands on that information?
> 
> Are that particular file's extents what causes btrfs send's memory
> usage to spiral out of control?

I can't say for 100% sure. We need more info on that.

Extent tree dump can provide per-subvolume level view of how shared one
extent is.
But as I mentioned, snapshot is another catalyst for such problem.

> 
>> If it's just all 0 pages, hole punching is more effective than dedupe,
>> and causes 0 backref overhead.
> 
> I did punch holes into the disk images I have stored on it by mounting
> and fstrim'ing

That's trim (or discard), not hole punching.
Normally hole punching is done by ioctl fpunch(). Not sure if dupremove
does that too.

> them and the duperemove command I used has a flag that
> ignores all 0 pages (those get compressed down to next to nothing
> anyways) but it's likely that I ran duperememove once or twice before
> I knew about that flag.
> 
> Is there a way to find such extents that could cause the backref walk
> to overload?

It's really hard to determine, you could try the following command to
determine:
# btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\
  grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}'

Then which key is the most shown one and its size.

If a key's objectid (the first value) shows up multiple times, it's a
kinda heavily shared extent.

Then search that objectid in the full extent tree dump, to find out how
it's shared.

You can see it's already complex...

Thanks,
Qu
> 
> Thanks,
> Atemu
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
  2019-10-27 11:34     ` Qu Wenruo
@ 2019-10-27 12:55       ` Atemu
  2019-10-27 13:43         ` Qu Wenruo
  2019-10-27 15:19       ` Atemu
  1 sibling, 1 reply; 18+ messages in thread
From: Atemu @ 2019-10-27 12:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

> This depends on how shared one file extent is.

But shouldn't it catch that and cancel the btrfs send before it panics
the kernel due to its memory usage?

> If one file extent is shared 10,000 times for one subvolume, and you
> have 1000 snapshots of that subvolume, it will really go crazy.

> But as I mentioned, snapshot is another catalyst for such problem.

I only have two snapshots of the subvolume but some the extents might
very well be shared many many times.

> I can't say for 100% sure. We need more info on that.

Sure.

> That's trim (or discard), not hole punching.

I didn't mean discarding the btrfs to the underlying storage, I meant
mounting the filesystems in the image files sitting inside the btrfs
through a loop device and running fstrim on them.
The loop device should punch holes into the underlying image files
when it receives a discard, right?

> Normally hole punching is done by ioctl fpunch(). Not sure if dupremove
> does that too.

Duperemove doesn't punch holes afaik it can only ignore the 0 pages
and not dedup them.

> Extent tree dump can provide per-subvolume level view of how shared one
> extent is.

> It's really hard to determine, you could try the following command to
> determine:
> # btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\
>   grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}'
>
> Then which key is the most shown one and its size.
>
> If a key's objectid (the first value) shows up multiple times, it's a
> kinda heavily shared extent.
>
> Then search that objectid in the full extent tree dump, to find out how
> it's shared.

Thanks, I'll try that out when I can unmount the btrfs.

> You can see it's already complex...

That's not an issue, I'm fluent in bash ;)

- Atemu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
  2019-10-27 12:55       ` Atemu
@ 2019-10-27 13:43         ` Qu Wenruo
  2019-10-27 15:19           ` Atemu
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2019-10-27 13:43 UTC (permalink / raw)
  To: Atemu; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2403 bytes --]



On 2019/10/27 下午8:55, Atemu wrote:
>> This depends on how shared one file extent is.
> 
> But shouldn't it catch that and cancel the btrfs send before it panics
> the kernel due to its memory usage?

Backref walk is quite tricky in btrfs, we don't really have a good way
to detect whether it's a good idea or not, until we crash...

But at least, we have some plan to fix it, hopefully sooner than later.

> 
>> If one file extent is shared 10,000 times for one subvolume, and you
>> have 1000 snapshots of that subvolume, it will really go crazy.
> 
>> But as I mentioned, snapshot is another catalyst for such problem.
> 
> I only have two snapshots of the subvolume but some the extents might
> very well be shared many many times.
> 
>> I can't say for 100% sure. We need more info on that.
> 
> Sure.
> 
>> That's trim (or discard), not hole punching.
> 
> I didn't mean discarding the btrfs to the underlying storage, I meant
> mounting the filesystems in the image files sitting inside the btrfs
> through a loop device and running fstrim on them.
> The loop device should punch holes into the underlying image files
> when it receives a discard, right?

That's correctly, that will punch holes for *unused* space.
But still, all 0 extents are still considered used, thus won't really work.

Since deduperemover has already skipped all 0 extents, it should be a
big problem I guess?

Thanks,
Qu
> 
>> Normally hole punching is done by ioctl fpunch(). Not sure if dupremove
>> does that too.
> 
> Duperemove doesn't punch holes afaik it can only ignore the 0 pages
> and not dedup them.>
>> Extent tree dump can provide per-subvolume level view of how shared one
>> extent is.
> 
>> It's really hard to determine, you could try the following command to
>> determine:
>> # btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\
>>   grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}'
>>
>> Then which key is the most shown one and its size.
>>
>> If a key's objectid (the first value) shows up multiple times, it's a
>> kinda heavily shared extent.
>>
>> Then search that objectid in the full extent tree dump, to find out how
>> it's shared.
> 
> Thanks, I'll try that out when I can unmount the btrfs.
> 
>> You can see it's already complex...
> 
> That's not an issue, I'm fluent in bash ;)
> 
> - Atemu
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
  2019-10-27 13:43         ` Qu Wenruo
@ 2019-10-27 15:19           ` Atemu
  0 siblings, 0 replies; 18+ messages in thread
From: Atemu @ 2019-10-27 15:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

> Backref walk is quite tricky in btrfs, we don't really have a good way
> to detect whether it's a good idea or not, until we crash...

I see...

>
> But at least, we have some plan to fix it, hopefully sooner than later.

That's good to hear

> That's correctly, that will punch holes for *unused* space.
> But still, all 0 extents are still considered used, thus won't really work.

Ahh that's what you mean, yeah it won't get those.
But the thing is, most all-0 pages should occur in unused space of
disk images, there shouldn't be much else that stores so many zeros.

> Since deduperemover has already skipped all 0 extents, it should be a
> big problem I guess?

As I said, I might've ran it once or twice without the flag but I
don't fully remember anymore.

-Atemu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
  2019-10-27 11:34     ` Qu Wenruo
  2019-10-27 12:55       ` Atemu
@ 2019-10-27 15:19       ` Atemu
  2019-10-27 23:16         ` Qu Wenruo
  2019-10-28 11:30         ` Filipe Manana
  1 sibling, 2 replies; 18+ messages in thread
From: Atemu @ 2019-10-27 15:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

> It's really hard to determine, you could try the following command to
> determine:
> # btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\
>   grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}'
>
> Then which key is the most shown one and its size.
>
> If a key's objectid (the first value) shows up multiple times, it's a
> kinda heavily shared extent.
>
> Then search that objectid in the full extent tree dump, to find out how
> it's shared.

I analyzed it a bit differently but this should be the information we wanted:

https://gist.github.com/Atemu/206c44cd46474458c083721e49d84a42

Yeah...

Is there any way to "unshare" these worst cases without having to
btrfs defragment everything?

I also uploaded the (compressed) extent tree dump if you want to take
a look yourself (205MB, expires in 7 days):

https://send.firefox.com/download/a729c57a94fcd89e/#w51BjzRmGnCg2qKNs39UNw

-Atemu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
  2019-10-27 15:19       ` Atemu
@ 2019-10-27 23:16         ` Qu Wenruo
  2019-10-28 12:26           ` Atemu
  2019-10-28 11:30         ` Filipe Manana
  1 sibling, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2019-10-27 23:16 UTC (permalink / raw)
  To: Atemu; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1478 bytes --]



On 2019/10/27 下午11:19, Atemu wrote:
>> It's really hard to determine, you could try the following command to
>> determine:
>> # btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\
>>   grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}'
>>
>> Then which key is the most shown one and its size.
>>
>> If a key's objectid (the first value) shows up multiple times, it's a
>> kinda heavily shared extent.
>>
>> Then search that objectid in the full extent tree dump, to find out how
>> it's shared.
> 
> I analyzed it a bit differently but this should be the information we wanted:
> 
> https://gist.github.com/Atemu/206c44cd46474458c083721e49d84a42
> 
> Yeah...

Holy s***...

Almost every line means 30~1000 refs, and there are over 2000 lines.
No wonder it eats up all memory.

> 
> Is there any way to "unshare" these worst cases without having to
> btrfs defragment everything?

Btrfs defrag should do that, but at the cost of hugely increased space
usage.

BTW, have you verified the content of that extent?
Is that all zero? If so, just find a tool to punch all these files and
you should be OK to go.

Or I can't see any reason why a data extent can be shared so many times.

Thanks,
Qu

> 
> I also uploaded the (compressed) extent tree dump if you want to take
> a look yourself (205MB, expires in 7 days):
> 
> https://send.firefox.com/download/a729c57a94fcd89e/#w51BjzRmGnCg2qKNs39UNw
> 
> -Atemu
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
  2019-10-27 23:16         ` Qu Wenruo
@ 2019-10-28 12:26           ` Atemu
  0 siblings, 0 replies; 18+ messages in thread
From: Atemu @ 2019-10-28 12:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

> > Is there any way to "unshare" these worst cases without having to
> > btrfs defragment everything?
>
> Btrfs defrag should do that, but at the cost of hugely increased space
> usage.

Yeah, that's why I was asking for a way to do it without btrfs defrag,
somehow have only those extents split up and update the references in
the inodes.

> BTW, have you verified the content of that extent?
> Is that all zero? If so, just find a tool to punch all these files and
> you should be OK to go.

How I can get the content of those objectids and find out which inodes
reference them?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
  2019-10-27 15:19       ` Atemu
  2019-10-27 23:16         ` Qu Wenruo
@ 2019-10-28 11:30         ` Filipe Manana
  2019-10-28 12:36           ` Qu Wenruo
  2019-10-28 12:44           ` Atemu
  1 sibling, 2 replies; 18+ messages in thread
From: Filipe Manana @ 2019-10-28 11:30 UTC (permalink / raw)
  To: Atemu; +Cc: Qu Wenruo, linux-btrfs

On Sun, Oct 27, 2019 at 4:51 PM Atemu <atemu.main@gmail.com> wrote:
>
> > It's really hard to determine, you could try the following command to
> > determine:
> > # btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\
> >   grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}'
> >
> > Then which key is the most shown one and its size.
> >
> > If a key's objectid (the first value) shows up multiple times, it's a
> > kinda heavily shared extent.
> >
> > Then search that objectid in the full extent tree dump, to find out how
> > it's shared.
>
> I analyzed it a bit differently but this should be the information we wanted:
>
> https://gist.github.com/Atemu/206c44cd46474458c083721e49d84a42

That's quite a lot of extents shared many times.
That indeed slows backreference walking and therefore send which uses it.
While the slowdown is known, the memory consumption I wasn't aware of,
but from your logs, it's not clear
where it comes exactly from, something to be looked at. There's also a
significant number of data checksum errors.

I think in the meanwhile send can just skip backreference walking and
attempt to clone whenever the number of
backreferences for an inode exceeds some limit, in which case it would
fallback to writes instead of cloning.

I'll look into it, thanks for the report (and Qu for telling how to
get the backreference counts).

>
> Yeah...
>
> Is there any way to "unshare" these worst cases without having to
> btrfs defragment everything?
>
> I also uploaded the (compressed) extent tree dump if you want to take
> a look yourself (205MB, expires in 7 days):
>
> https://send.firefox.com/download/a729c57a94fcd89e/#w51BjzRmGnCg2qKNs39UNw
>
> -Atemu



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
  2019-10-28 11:30         ` Filipe Manana
@ 2019-10-28 12:36           ` Qu Wenruo
  2019-10-28 12:43             ` Filipe Manana
  2019-10-28 12:44           ` Atemu
  1 sibling, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2019-10-28 12:36 UTC (permalink / raw)
  To: fdmanana, Atemu; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2169 bytes --]



On 2019/10/28 下午7:30, Filipe Manana wrote:
> On Sun, Oct 27, 2019 at 4:51 PM Atemu <atemu.main@gmail.com> wrote:
>>
>>> It's really hard to determine, you could try the following command to
>>> determine:
>>> # btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\
>>>   grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}'
>>>
>>> Then which key is the most shown one and its size.
>>>
>>> If a key's objectid (the first value) shows up multiple times, it's a
>>> kinda heavily shared extent.
>>>
>>> Then search that objectid in the full extent tree dump, to find out how
>>> it's shared.
>>
>> I analyzed it a bit differently but this should be the information we wanted:
>>
>> https://gist.github.com/Atemu/206c44cd46474458c083721e49d84a42
> 
> That's quite a lot of extents shared many times.
> That indeed slows backreference walking and therefore send which uses it.
> While the slowdown is known, the memory consumption I wasn't aware of,
> but from your logs, it's not clear
> where it comes exactly from, something to be looked at. There's also a
> significant number of data checksum errors.
> 
> I think in the meanwhile send can just skip backreference walking and
> attempt to clone whenever the number of
> backreferences for an inode exceeds some limit, in which case it would
> fallback to writes instead of cloning.

Long time ago I had a purpose to record sent extents in an rbtree, then
instead of do the full backref walk, go that rbtree walk instead.
That should still be way faster than full backref walk, and still have a
good enough hit rate.
(And of course, if it fails, falls back to regular write)

Thanks,
Qu

> 
> I'll look into it, thanks for the report (and Qu for telling how to
> get the backreference counts).
> 
>>
>> Yeah...
>>
>> Is there any way to "unshare" these worst cases without having to
>> btrfs defragment everything?
>>
>> I also uploaded the (compressed) extent tree dump if you want to take
>> a look yourself (205MB, expires in 7 days):
>>
>> https://send.firefox.com/download/a729c57a94fcd89e/#w51BjzRmGnCg2qKNs39UNw
>>
>> -Atemu
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
  2019-10-28 12:36           ` Qu Wenruo
@ 2019-10-28 12:43             ` Filipe Manana
  2019-10-28 14:58               ` Martin Raiber
  0 siblings, 1 reply; 18+ messages in thread
From: Filipe Manana @ 2019-10-28 12:43 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Atemu, linux-btrfs

On Mon, Oct 28, 2019 at 12:36 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2019/10/28 下午7:30, Filipe Manana wrote:
> > On Sun, Oct 27, 2019 at 4:51 PM Atemu <atemu.main@gmail.com> wrote:
> >>
> >>> It's really hard to determine, you could try the following command to
> >>> determine:
> >>> # btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\
> >>>   grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}'
> >>>
> >>> Then which key is the most shown one and its size.
> >>>
> >>> If a key's objectid (the first value) shows up multiple times, it's a
> >>> kinda heavily shared extent.
> >>>
> >>> Then search that objectid in the full extent tree dump, to find out how
> >>> it's shared.
> >>
> >> I analyzed it a bit differently but this should be the information we wanted:
> >>
> >> https://gist.github.com/Atemu/206c44cd46474458c083721e49d84a42
> >
> > That's quite a lot of extents shared many times.
> > That indeed slows backreference walking and therefore send which uses it.
> > While the slowdown is known, the memory consumption I wasn't aware of,
> > but from your logs, it's not clear
> > where it comes exactly from, something to be looked at. There's also a
> > significant number of data checksum errors.
> >
> > I think in the meanwhile send can just skip backreference walking and
> > attempt to clone whenever the number of
> > backreferences for an inode exceeds some limit, in which case it would
> > fallback to writes instead of cloning.
>
> Long time ago I had a purpose to record sent extents in an rbtree, then
> instead of do the full backref walk, go that rbtree walk instead.
> That should still be way faster than full backref walk, and still have a
> good enough hit rate.

The problem of that is that it can use a lot of memory. We can have
thousands of extents, tens of thousands, etc.
Sure one can limit such cache to store up to some limit N, cache only
the last N extents found (or some other policy), etc., but then either
hits become so rare that it's nearly worthless or it's way too
complex.
Until the general backref walking speedups and caching is done (and
honestly I don't know the state of that since who was working on that
is no longer working on btrfs), a simple solution would be better IMO.

Thanks.

> (And of course, if it fails, falls back to regular write)
>
> Thanks,
> Qu
>
> >
> > I'll look into it, thanks for the report (and Qu for telling how to
> > get the backreference counts).
> >
> >>
> >> Yeah...
> >>
> >> Is there any way to "unshare" these worst cases without having to
> >> btrfs defragment everything?
> >>
> >> I also uploaded the (compressed) extent tree dump if you want to take
> >> a look yourself (205MB, expires in 7 days):
> >>
> >> https://send.firefox.com/download/a729c57a94fcd89e/#w51BjzRmGnCg2qKNs39UNw
> >>
> >> -Atemu
> >
> >
> >
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
  2019-10-28 12:43             ` Filipe Manana
@ 2019-10-28 14:58               ` Martin Raiber
  0 siblings, 0 replies; 18+ messages in thread
From: Martin Raiber @ 2019-10-28 14:58 UTC (permalink / raw)
  Cc: linux-btrfs

On 28.10.2019 13:43 Filipe Manana wrote:
> On Mon, Oct 28, 2019 at 12:36 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>> On 2019/10/28 下午7:30, Filipe Manana wrote:
>>> On Sun, Oct 27, 2019 at 4:51 PM Atemu <atemu.main@gmail.com> wrote:
>>>>> It's really hard to determine, you could try the following command to
>>>>> determine:
>>>>> # btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\
>>>>>   grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}'
>>>>>
>>>>> Then which key is the most shown one and its size.
>>>>>
>>>>> If a key's objectid (the first value) shows up multiple times, it's a
>>>>> kinda heavily shared extent.
>>>>>
>>>>> Then search that objectid in the full extent tree dump, to find out how
>>>>> it's shared.
>>>> I analyzed it a bit differently but this should be the information we wanted:
>>>>
>>>> https://gist.github.com/Atemu/206c44cd46474458c083721e49d84a42
>>> That's quite a lot of extents shared many times.
>>> That indeed slows backreference walking and therefore send which uses it.
>>> While the slowdown is known, the memory consumption I wasn't aware of,
>>> but from your logs, it's not clear
>>> where it comes exactly from, something to be looked at. There's also a
>>> significant number of data checksum errors.
>>>
>>> I think in the meanwhile send can just skip backreference walking and
>>> attempt to clone whenever the number of
>>> backreferences for an inode exceeds some limit, in which case it would
>>> fallback to writes instead of cloning.
>> Long time ago I had a purpose to record sent extents in an rbtree, then
>> instead of do the full backref walk, go that rbtree walk instead.
>> That should still be way faster than full backref walk, and still have a
>> good enough hit rate.
> The problem of that is that it can use a lot of memory. We can have
> thousands of extents, tens of thousands, etc.
> Sure one can limit such cache to store up to some limit N, cache only
> the last N extents found (or some other policy), etc., but then either
> hits become so rare that it's nearly worthless or it's way too
> complex.
> Until the general backref walking speedups and caching is done (and
> honestly I don't know the state of that since who was working on that
> is no longer working on btrfs), a simple solution would be better IMO.
>
> Thanks.
Yeah, some short term plan to mitigate this would be appreciated. I am
running with this patch Qu Wenruo posted a while back:
https://patchwork.kernel.org/patch/9245287/

Some flag/switch/setting or limit to backref walking so this patch isn't
needed would be appreciated. Without this btrfs send is just too slow
once I have a few reflinks and snapshots. I haven't had a kernel panic,
though.

The problem is finding extents to reflink in the clone sources, correct?
My naive solution would be to create a (temporary) cache of (logical
extent) to (ino, offset) per send clone source. Then lookup every extent
in that cache. Maybe add a bloom filter as well (that should filter most
negatives). In some cases iterating over all extents in the clone
sources prior to the send operation would be faster than doing the
backref-walks during send. As an optimization it could be made
persistent and incrementally created from the parent snapshot's cache.
EXTENT_SAME would invalidate it/or it would need to update it.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
  2019-10-28 11:30         ` Filipe Manana
  2019-10-28 12:36           ` Qu Wenruo
@ 2019-10-28 12:44           ` Atemu
  2019-10-28 13:01             ` Filipe Manana
  1 sibling, 1 reply; 18+ messages in thread
From: Atemu @ 2019-10-28 12:44 UTC (permalink / raw)
  To: fdmanana; +Cc: Qu Wenruo, linux-btrfs

> That's quite a lot of extents shared many times.
> That indeed slows backreference walking and therefore send which uses it.
> While the slowdown is known, the memory consumption I wasn't aware of,
> but from your logs, it's not clear

Is there anything else I could monitor to find out?

> where it comes exactly from, something to be looked at. There's also a
> significant number of data checksum errors.

As I said, those seem to be false; the file is in-tact (it happens to
be a 7z archive) and scrubs before triggering the bug don't report
anything either.

Could be related to running OOM or its own bug.

> I think in the meanwhile send can just skip backreference walking and
> attempt to clone whenever the number of
> backreferences for an inode exceeds some limit, in which case it would
> fallback to writes instead of cloning.

Wouldn't it be better to make it dynamic in case it's run under low
memory conditions?

> I'll look into it, thanks for the report (and Qu for telling how to
> get the backreference counts).

Thanks to you both!
-Atemu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
  2019-10-28 12:44           ` Atemu
@ 2019-10-28 13:01             ` Filipe Manana
  2019-10-28 13:44               ` Atemu
  0 siblings, 1 reply; 18+ messages in thread
From: Filipe Manana @ 2019-10-28 13:01 UTC (permalink / raw)
  To: Atemu; +Cc: Qu Wenruo, linux-btrfs

On Mon, Oct 28, 2019 at 12:44 PM Atemu <atemu.main@gmail.com> wrote:
>
> > That's quite a lot of extents shared many times.
> > That indeed slows backreference walking and therefore send which uses it.
> > While the slowdown is known, the memory consumption I wasn't aware of,
> > but from your logs, it's not clear
>
> Is there anything else I could monitor to find out?

You can run 'slabtop' while doing the send operation.
That might be enough.

It's very likely the backreference walking code, due to huge ulists
(kmalloc-N slab), lots of btrfs_prelim_ref structures
(btrfs_prelim_ref slab), etc.

>
> > where it comes exactly from, something to be looked at. There's also a
> > significant number of data checksum errors.
>
> As I said, those seem to be false; the file is in-tact (it happens to
> be a 7z archive) and scrubs before triggering the bug don't report
> anything either.
>
> Could be related to running OOM or its own bug.

Yes, it's likely a different bug. I don't think it's related either.

>
> > I think in the meanwhile send can just skip backreference walking and
> > attempt to clone whenever the number of
> > backreferences for an inode exceeds some limit, in which case it would
> > fallback to writes instead of cloning.
>
> Wouldn't it be better to make it dynamic in case it's run under low
> memory conditions?

Ideally yes. But that's a lot harder to do for several reasons and in
the end might not be worth it.

Thanks.

>
> > I'll look into it, thanks for the report (and Qu for telling how to
> > get the backreference counts).
>
> Thanks to you both!
> -Atemu



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
  2019-10-28 13:01             ` Filipe Manana
@ 2019-10-28 13:44               ` Atemu
  2019-10-31 13:55                 ` Atemu
  0 siblings, 1 reply; 18+ messages in thread
From: Atemu @ 2019-10-28 13:44 UTC (permalink / raw)
  To: fdmanana; +Cc: Qu Wenruo, linux-btrfs

> You can run 'slabtop' while doing the send operation.
> That might be enough.
>
> It's very likely the backreference walking code, due to huge ulists
> (kmalloc-N slab), lots of btrfs_prelim_ref structures
> (btrfs_prelim_ref slab), etc.

I actually did run slabtop once but couldn't remember the exact name
of the top entry, so I didn't mention it.
Now that you mentioned the options though, I'm pretty sure it was
kmalloc-N. N was probably 64 but that I'm not sure about.

> Yes, it's likely a different bug. I don't think it's related either.

I have only seen these warnings after the bug triggered though,
reading the file under normal conditions doesn't produce them.

What would be the best way to get more information on how btrfs comes
to the conclusion that this file is corrupt?

> Ideally yes. But that's a lot harder to do for several reasons and in
> the end might not be worth it.

I see, thanks!

-Atemu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB
  2019-10-28 13:44               ` Atemu
@ 2019-10-31 13:55                 ` Atemu
  0 siblings, 0 replies; 18+ messages in thread
From: Atemu @ 2019-10-31 13:55 UTC (permalink / raw)
  To: fdmanana; +Cc: Qu Wenruo, linux-btrfs

> kmalloc-N. N was probably 64 but that I'm not sure about.

Correction: It's kmalloc-32.

-Atemu

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-10-31 13:56 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-26 17:46 BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB Atemu
2019-10-27  0:50 ` Qu Wenruo
2019-10-27 10:33   ` Atemu
2019-10-27 11:34     ` Qu Wenruo
2019-10-27 12:55       ` Atemu
2019-10-27 13:43         ` Qu Wenruo
2019-10-27 15:19           ` Atemu
2019-10-27 15:19       ` Atemu
2019-10-27 23:16         ` Qu Wenruo
2019-10-28 12:26           ` Atemu
2019-10-28 11:30         ` Filipe Manana
2019-10-28 12:36           ` Qu Wenruo
2019-10-28 12:43             ` Filipe Manana
2019-10-28 14:58               ` Martin Raiber
2019-10-28 12:44           ` Atemu
2019-10-28 13:01             ` Filipe Manana
2019-10-28 13:44               ` Atemu
2019-10-31 13:55                 ` Atemu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.