* BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB @ 2019-10-26 17:46 Atemu 2019-10-27 0:50 ` Qu Wenruo 0 siblings, 1 reply; 18+ messages in thread From: Atemu @ 2019-10-26 17:46 UTC (permalink / raw) To: linux-btrfs Hi linux-btrfs, after btrfs sending ~37GiB of a snapshot of one of my subvolumes, btrfs send stalls (`pv` (which I'm piping it through) does not report any significant throughput anymore) and shortly after, the Kernel's memory usage starts to rise until it runs OOM and panics. Here's the tail of dmesg I saved before such a Kernel panic: https://gist.githubusercontent.com/Atemu/3af591b9fa02efee10303ccaac3b4a85/raw/f27c0c911f4a9839a6e59ed494ff5066c7754e07/btrfs%2520send%2520OOM%2520log (I cancelled the first btrfs send in this example FYI, that's not part of nor required for this bug.) And here's a picture of the screen after the Kernel panic: https://photos.app.goo.gl/cEj5TA9B5V8eRXsy9 (This was recorded a while back but I am able to repoduce the same bug on archlinux-2019.10.01-x86_64.iso.) The snapshot holds ~3.8TiB of data that has been compressed (ZSTD:3) and heavily deduplicated down to ~1.9TiB. For deduplication I used `bedup dedup` and `duperemove -x -r -h -A -b 32K ---skip-zeroes --dedupe-options=same,fiemap,noblock` and IIRC it was mostly done around the time 4.19 and 4.20 were recent. The Inode that btrfs reports as corrupt towards the end of the dmesg is a 37GiB 7z archive (size correlates) and can be read without errors on a live system where the bug hasn't been triggered yet. Since it happens to be a 7z archive, I can even confirm its integrity with `7z t`. A scrub and `btrfs check --check-data-csum` don't detect any errors either. Please tell me what other information I could provide that might be useful/necessary for squashing this bug, Atemu PS: I could spin up a VM with device mapper snapshots of the drives, destructive troubleshooting is possible if needed. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB 2019-10-26 17:46 BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB Atemu @ 2019-10-27 0:50 ` Qu Wenruo 2019-10-27 10:33 ` Atemu 0 siblings, 1 reply; 18+ messages in thread From: Qu Wenruo @ 2019-10-27 0:50 UTC (permalink / raw) To: Atemu, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 2198 bytes --] On 2019/10/27 上午1:46, Atemu wrote: > Hi linux-btrfs, > after btrfs sending ~37GiB of a snapshot of one of my subvolumes, > btrfs send stalls (`pv` (which I'm piping it through) does not report > any significant throughput anymore) and shortly after, the Kernel's > memory usage starts to rise until it runs OOM and panics. > > Here's the tail of dmesg I saved before such a Kernel panic: > > https://gist.githubusercontent.com/Atemu/3af591b9fa02efee10303ccaac3b4a85/raw/f27c0c911f4a9839a6e59ed494ff5066c7754e07/btrfs%2520send%2520OOM%2520log > > (I cancelled the first btrfs send in this example FYI, that's not part > of nor required for this bug.) > > And here's a picture of the screen after the Kernel panic: > > https://photos.app.goo.gl/cEj5TA9B5V8eRXsy9 > > (This was recorded a while back but I am able to repoduce the same bug > on archlinux-2019.10.01-x86_64.iso.) > > The snapshot holds ~3.8TiB of data that has been compressed (ZSTD:3) > and heavily deduplicated down to ~1.9TiB. That's the problem. Deduped files caused heavy overload for backref walk. And send has to do backref walk, and you see the problem... I'm very interested how heavily deduped the file is. If it's just all 0 pages, hole punching is more effective than dedupe, and causes 0 backref overhead. Thanks, Qu > For deduplication I used `bedup dedup` and `duperemove -x -r -h -A -b > 32K ---skip-zeroes --dedupe-options=same,fiemap,noblock` and IIRC it > was mostly done around the time 4.19 and 4.20 were recent. > > The Inode that btrfs reports as corrupt towards the end of the dmesg > is a 37GiB 7z archive (size correlates) and can be read without errors > on a live system where the bug hasn't been triggered yet. Since it > happens to be a 7z archive, I can even confirm its integrity with `7z > t`. > A scrub and `btrfs check --check-data-csum` don't detect any errors either. > > Please tell me what other information I could provide that might be > useful/necessary for squashing this bug, > Atemu > > PS: I could spin up a VM with device mapper snapshots of the drives, > destructive troubleshooting is possible if needed. > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB 2019-10-27 0:50 ` Qu Wenruo @ 2019-10-27 10:33 ` Atemu 2019-10-27 11:34 ` Qu Wenruo 0 siblings, 1 reply; 18+ messages in thread From: Atemu @ 2019-10-27 10:33 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs > That's the problem. > > Deduped files caused heavy overload for backref walk. > And send has to do backref walk, and you see the problem... Interesting! But should it really be able to make btrfs send use up >15GiB of RAM and cause a kernel panic because of that? The btrfs doesn't even have that much metadata on-disk in total. > I'm very interested how heavily deduped the file is. So am I, how could I get my hands on that information? Are that particular file's extents what causes btrfs send's memory usage to spiral out of control? > If it's just all 0 pages, hole punching is more effective than dedupe, > and causes 0 backref overhead. I did punch holes into the disk images I have stored on it by mounting and fstrim'ing them and the duperemove command I used has a flag that ignores all 0 pages (those get compressed down to next to nothing anyways) but it's likely that I ran duperememove once or twice before I knew about that flag. Is there a way to find such extents that could cause the backref walk to overload? Thanks, Atemu ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB 2019-10-27 10:33 ` Atemu @ 2019-10-27 11:34 ` Qu Wenruo 2019-10-27 12:55 ` Atemu 2019-10-27 15:19 ` Atemu 0 siblings, 2 replies; 18+ messages in thread From: Qu Wenruo @ 2019-10-27 11:34 UTC (permalink / raw) To: Atemu; +Cc: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 2238 bytes --] On 2019/10/27 下午6:33, Atemu wrote: >> That's the problem. >> >> Deduped files caused heavy overload for backref walk. >> And send has to do backref walk, and you see the problem... > > Interesting! > But should it really be able to make btrfs send use up >15GiB of RAM > and cause a kernel panic because of that? The btrfs doesn't even have > that much metadata on-disk in total. This depends on how shared one file extent is. If one file extent is shared 10,000 times for one subvolume, and you have 1000 snapshots of that subvolume, it will really go crazy. > >> I'm very interested how heavily deduped the file is. > > So am I, how could I get my hands on that information? > > Are that particular file's extents what causes btrfs send's memory > usage to spiral out of control? I can't say for 100% sure. We need more info on that. Extent tree dump can provide per-subvolume level view of how shared one extent is. But as I mentioned, snapshot is another catalyst for such problem. > >> If it's just all 0 pages, hole punching is more effective than dedupe, >> and causes 0 backref overhead. > > I did punch holes into the disk images I have stored on it by mounting > and fstrim'ing That's trim (or discard), not hole punching. Normally hole punching is done by ioctl fpunch(). Not sure if dupremove does that too. > them and the duperemove command I used has a flag that > ignores all 0 pages (those get compressed down to next to nothing > anyways) but it's likely that I ran duperememove once or twice before > I knew about that flag. > > Is there a way to find such extents that could cause the backref walk > to overload? It's really hard to determine, you could try the following command to determine: # btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\ grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}' Then which key is the most shown one and its size. If a key's objectid (the first value) shows up multiple times, it's a kinda heavily shared extent. Then search that objectid in the full extent tree dump, to find out how it's shared. You can see it's already complex... Thanks, Qu > > Thanks, > Atemu > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB 2019-10-27 11:34 ` Qu Wenruo @ 2019-10-27 12:55 ` Atemu 2019-10-27 13:43 ` Qu Wenruo 2019-10-27 15:19 ` Atemu 1 sibling, 1 reply; 18+ messages in thread From: Atemu @ 2019-10-27 12:55 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs > This depends on how shared one file extent is. But shouldn't it catch that and cancel the btrfs send before it panics the kernel due to its memory usage? > If one file extent is shared 10,000 times for one subvolume, and you > have 1000 snapshots of that subvolume, it will really go crazy. > But as I mentioned, snapshot is another catalyst for such problem. I only have two snapshots of the subvolume but some the extents might very well be shared many many times. > I can't say for 100% sure. We need more info on that. Sure. > That's trim (or discard), not hole punching. I didn't mean discarding the btrfs to the underlying storage, I meant mounting the filesystems in the image files sitting inside the btrfs through a loop device and running fstrim on them. The loop device should punch holes into the underlying image files when it receives a discard, right? > Normally hole punching is done by ioctl fpunch(). Not sure if dupremove > does that too. Duperemove doesn't punch holes afaik it can only ignore the 0 pages and not dedup them. > Extent tree dump can provide per-subvolume level view of how shared one > extent is. > It's really hard to determine, you could try the following command to > determine: > # btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\ > grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}' > > Then which key is the most shown one and its size. > > If a key's objectid (the first value) shows up multiple times, it's a > kinda heavily shared extent. > > Then search that objectid in the full extent tree dump, to find out how > it's shared. Thanks, I'll try that out when I can unmount the btrfs. > You can see it's already complex... That's not an issue, I'm fluent in bash ;) - Atemu ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB 2019-10-27 12:55 ` Atemu @ 2019-10-27 13:43 ` Qu Wenruo 2019-10-27 15:19 ` Atemu 0 siblings, 1 reply; 18+ messages in thread From: Qu Wenruo @ 2019-10-27 13:43 UTC (permalink / raw) To: Atemu; +Cc: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 2403 bytes --] On 2019/10/27 下午8:55, Atemu wrote: >> This depends on how shared one file extent is. > > But shouldn't it catch that and cancel the btrfs send before it panics > the kernel due to its memory usage? Backref walk is quite tricky in btrfs, we don't really have a good way to detect whether it's a good idea or not, until we crash... But at least, we have some plan to fix it, hopefully sooner than later. > >> If one file extent is shared 10,000 times for one subvolume, and you >> have 1000 snapshots of that subvolume, it will really go crazy. > >> But as I mentioned, snapshot is another catalyst for such problem. > > I only have two snapshots of the subvolume but some the extents might > very well be shared many many times. > >> I can't say for 100% sure. We need more info on that. > > Sure. > >> That's trim (or discard), not hole punching. > > I didn't mean discarding the btrfs to the underlying storage, I meant > mounting the filesystems in the image files sitting inside the btrfs > through a loop device and running fstrim on them. > The loop device should punch holes into the underlying image files > when it receives a discard, right? That's correctly, that will punch holes for *unused* space. But still, all 0 extents are still considered used, thus won't really work. Since deduperemover has already skipped all 0 extents, it should be a big problem I guess? Thanks, Qu > >> Normally hole punching is done by ioctl fpunch(). Not sure if dupremove >> does that too. > > Duperemove doesn't punch holes afaik it can only ignore the 0 pages > and not dedup them.> >> Extent tree dump can provide per-subvolume level view of how shared one >> extent is. > >> It's really hard to determine, you could try the following command to >> determine: >> # btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\ >> grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}' >> >> Then which key is the most shown one and its size. >> >> If a key's objectid (the first value) shows up multiple times, it's a >> kinda heavily shared extent. >> >> Then search that objectid in the full extent tree dump, to find out how >> it's shared. > > Thanks, I'll try that out when I can unmount the btrfs. > >> You can see it's already complex... > > That's not an issue, I'm fluent in bash ;) > > - Atemu > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB 2019-10-27 13:43 ` Qu Wenruo @ 2019-10-27 15:19 ` Atemu 0 siblings, 0 replies; 18+ messages in thread From: Atemu @ 2019-10-27 15:19 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs > Backref walk is quite tricky in btrfs, we don't really have a good way > to detect whether it's a good idea or not, until we crash... I see... > > But at least, we have some plan to fix it, hopefully sooner than later. That's good to hear > That's correctly, that will punch holes for *unused* space. > But still, all 0 extents are still considered used, thus won't really work. Ahh that's what you mean, yeah it won't get those. But the thing is, most all-0 pages should occur in unused space of disk images, there shouldn't be much else that stores so many zeros. > Since deduperemover has already skipped all 0 extents, it should be a > big problem I guess? As I said, I might've ran it once or twice without the flag but I don't fully remember anymore. -Atemu ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB 2019-10-27 11:34 ` Qu Wenruo 2019-10-27 12:55 ` Atemu @ 2019-10-27 15:19 ` Atemu 2019-10-27 23:16 ` Qu Wenruo 2019-10-28 11:30 ` Filipe Manana 1 sibling, 2 replies; 18+ messages in thread From: Atemu @ 2019-10-27 15:19 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs > It's really hard to determine, you could try the following command to > determine: > # btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\ > grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}' > > Then which key is the most shown one and its size. > > If a key's objectid (the first value) shows up multiple times, it's a > kinda heavily shared extent. > > Then search that objectid in the full extent tree dump, to find out how > it's shared. I analyzed it a bit differently but this should be the information we wanted: https://gist.github.com/Atemu/206c44cd46474458c083721e49d84a42 Yeah... Is there any way to "unshare" these worst cases without having to btrfs defragment everything? I also uploaded the (compressed) extent tree dump if you want to take a look yourself (205MB, expires in 7 days): https://send.firefox.com/download/a729c57a94fcd89e/#w51BjzRmGnCg2qKNs39UNw -Atemu ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB 2019-10-27 15:19 ` Atemu @ 2019-10-27 23:16 ` Qu Wenruo 2019-10-28 12:26 ` Atemu 2019-10-28 11:30 ` Filipe Manana 1 sibling, 1 reply; 18+ messages in thread From: Qu Wenruo @ 2019-10-27 23:16 UTC (permalink / raw) To: Atemu; +Cc: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 1478 bytes --] On 2019/10/27 下午11:19, Atemu wrote: >> It's really hard to determine, you could try the following command to >> determine: >> # btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\ >> grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}' >> >> Then which key is the most shown one and its size. >> >> If a key's objectid (the first value) shows up multiple times, it's a >> kinda heavily shared extent. >> >> Then search that objectid in the full extent tree dump, to find out how >> it's shared. > > I analyzed it a bit differently but this should be the information we wanted: > > https://gist.github.com/Atemu/206c44cd46474458c083721e49d84a42 > > Yeah... Holy s***... Almost every line means 30~1000 refs, and there are over 2000 lines. No wonder it eats up all memory. > > Is there any way to "unshare" these worst cases without having to > btrfs defragment everything? Btrfs defrag should do that, but at the cost of hugely increased space usage. BTW, have you verified the content of that extent? Is that all zero? If so, just find a tool to punch all these files and you should be OK to go. Or I can't see any reason why a data extent can be shared so many times. Thanks, Qu > > I also uploaded the (compressed) extent tree dump if you want to take > a look yourself (205MB, expires in 7 days): > > https://send.firefox.com/download/a729c57a94fcd89e/#w51BjzRmGnCg2qKNs39UNw > > -Atemu > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB 2019-10-27 23:16 ` Qu Wenruo @ 2019-10-28 12:26 ` Atemu 0 siblings, 0 replies; 18+ messages in thread From: Atemu @ 2019-10-28 12:26 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs > > Is there any way to "unshare" these worst cases without having to > > btrfs defragment everything? > > Btrfs defrag should do that, but at the cost of hugely increased space > usage. Yeah, that's why I was asking for a way to do it without btrfs defrag, somehow have only those extents split up and update the references in the inodes. > BTW, have you verified the content of that extent? > Is that all zero? If so, just find a tool to punch all these files and > you should be OK to go. How I can get the content of those objectids and find out which inodes reference them? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB 2019-10-27 15:19 ` Atemu 2019-10-27 23:16 ` Qu Wenruo @ 2019-10-28 11:30 ` Filipe Manana 2019-10-28 12:36 ` Qu Wenruo 2019-10-28 12:44 ` Atemu 1 sibling, 2 replies; 18+ messages in thread From: Filipe Manana @ 2019-10-28 11:30 UTC (permalink / raw) To: Atemu; +Cc: Qu Wenruo, linux-btrfs On Sun, Oct 27, 2019 at 4:51 PM Atemu <atemu.main@gmail.com> wrote: > > > It's really hard to determine, you could try the following command to > > determine: > > # btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\ > > grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}' > > > > Then which key is the most shown one and its size. > > > > If a key's objectid (the first value) shows up multiple times, it's a > > kinda heavily shared extent. > > > > Then search that objectid in the full extent tree dump, to find out how > > it's shared. > > I analyzed it a bit differently but this should be the information we wanted: > > https://gist.github.com/Atemu/206c44cd46474458c083721e49d84a42 That's quite a lot of extents shared many times. That indeed slows backreference walking and therefore send which uses it. While the slowdown is known, the memory consumption I wasn't aware of, but from your logs, it's not clear where it comes exactly from, something to be looked at. There's also a significant number of data checksum errors. I think in the meanwhile send can just skip backreference walking and attempt to clone whenever the number of backreferences for an inode exceeds some limit, in which case it would fallback to writes instead of cloning. I'll look into it, thanks for the report (and Qu for telling how to get the backreference counts). > > Yeah... > > Is there any way to "unshare" these worst cases without having to > btrfs defragment everything? > > I also uploaded the (compressed) extent tree dump if you want to take > a look yourself (205MB, expires in 7 days): > > https://send.firefox.com/download/a729c57a94fcd89e/#w51BjzRmGnCg2qKNs39UNw > > -Atemu -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.” ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB 2019-10-28 11:30 ` Filipe Manana @ 2019-10-28 12:36 ` Qu Wenruo 2019-10-28 12:43 ` Filipe Manana 2019-10-28 12:44 ` Atemu 1 sibling, 1 reply; 18+ messages in thread From: Qu Wenruo @ 2019-10-28 12:36 UTC (permalink / raw) To: fdmanana, Atemu; +Cc: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 2169 bytes --] On 2019/10/28 下午7:30, Filipe Manana wrote: > On Sun, Oct 27, 2019 at 4:51 PM Atemu <atemu.main@gmail.com> wrote: >> >>> It's really hard to determine, you could try the following command to >>> determine: >>> # btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\ >>> grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}' >>> >>> Then which key is the most shown one and its size. >>> >>> If a key's objectid (the first value) shows up multiple times, it's a >>> kinda heavily shared extent. >>> >>> Then search that objectid in the full extent tree dump, to find out how >>> it's shared. >> >> I analyzed it a bit differently but this should be the information we wanted: >> >> https://gist.github.com/Atemu/206c44cd46474458c083721e49d84a42 > > That's quite a lot of extents shared many times. > That indeed slows backreference walking and therefore send which uses it. > While the slowdown is known, the memory consumption I wasn't aware of, > but from your logs, it's not clear > where it comes exactly from, something to be looked at. There's also a > significant number of data checksum errors. > > I think in the meanwhile send can just skip backreference walking and > attempt to clone whenever the number of > backreferences for an inode exceeds some limit, in which case it would > fallback to writes instead of cloning. Long time ago I had a purpose to record sent extents in an rbtree, then instead of do the full backref walk, go that rbtree walk instead. That should still be way faster than full backref walk, and still have a good enough hit rate. (And of course, if it fails, falls back to regular write) Thanks, Qu > > I'll look into it, thanks for the report (and Qu for telling how to > get the backreference counts). > >> >> Yeah... >> >> Is there any way to "unshare" these worst cases without having to >> btrfs defragment everything? >> >> I also uploaded the (compressed) extent tree dump if you want to take >> a look yourself (205MB, expires in 7 days): >> >> https://send.firefox.com/download/a729c57a94fcd89e/#w51BjzRmGnCg2qKNs39UNw >> >> -Atemu > > > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 520 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB 2019-10-28 12:36 ` Qu Wenruo @ 2019-10-28 12:43 ` Filipe Manana 2019-10-28 14:58 ` Martin Raiber 0 siblings, 1 reply; 18+ messages in thread From: Filipe Manana @ 2019-10-28 12:43 UTC (permalink / raw) To: Qu Wenruo; +Cc: Atemu, linux-btrfs On Mon, Oct 28, 2019 at 12:36 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > On 2019/10/28 下午7:30, Filipe Manana wrote: > > On Sun, Oct 27, 2019 at 4:51 PM Atemu <atemu.main@gmail.com> wrote: > >> > >>> It's really hard to determine, you could try the following command to > >>> determine: > >>> # btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\ > >>> grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}' > >>> > >>> Then which key is the most shown one and its size. > >>> > >>> If a key's objectid (the first value) shows up multiple times, it's a > >>> kinda heavily shared extent. > >>> > >>> Then search that objectid in the full extent tree dump, to find out how > >>> it's shared. > >> > >> I analyzed it a bit differently but this should be the information we wanted: > >> > >> https://gist.github.com/Atemu/206c44cd46474458c083721e49d84a42 > > > > That's quite a lot of extents shared many times. > > That indeed slows backreference walking and therefore send which uses it. > > While the slowdown is known, the memory consumption I wasn't aware of, > > but from your logs, it's not clear > > where it comes exactly from, something to be looked at. There's also a > > significant number of data checksum errors. > > > > I think in the meanwhile send can just skip backreference walking and > > attempt to clone whenever the number of > > backreferences for an inode exceeds some limit, in which case it would > > fallback to writes instead of cloning. > > Long time ago I had a purpose to record sent extents in an rbtree, then > instead of do the full backref walk, go that rbtree walk instead. > That should still be way faster than full backref walk, and still have a > good enough hit rate. The problem of that is that it can use a lot of memory. We can have thousands of extents, tens of thousands, etc. Sure one can limit such cache to store up to some limit N, cache only the last N extents found (or some other policy), etc., but then either hits become so rare that it's nearly worthless or it's way too complex. Until the general backref walking speedups and caching is done (and honestly I don't know the state of that since who was working on that is no longer working on btrfs), a simple solution would be better IMO. Thanks. > (And of course, if it fails, falls back to regular write) > > Thanks, > Qu > > > > > I'll look into it, thanks for the report (and Qu for telling how to > > get the backreference counts). > > > >> > >> Yeah... > >> > >> Is there any way to "unshare" these worst cases without having to > >> btrfs defragment everything? > >> > >> I also uploaded the (compressed) extent tree dump if you want to take > >> a look yourself (205MB, expires in 7 days): > >> > >> https://send.firefox.com/download/a729c57a94fcd89e/#w51BjzRmGnCg2qKNs39UNw > >> > >> -Atemu > > > > > > > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.” ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB 2019-10-28 12:43 ` Filipe Manana @ 2019-10-28 14:58 ` Martin Raiber 0 siblings, 0 replies; 18+ messages in thread From: Martin Raiber @ 2019-10-28 14:58 UTC (permalink / raw) Cc: linux-btrfs On 28.10.2019 13:43 Filipe Manana wrote: > On Mon, Oct 28, 2019 at 12:36 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> >> >> On 2019/10/28 下午7:30, Filipe Manana wrote: >>> On Sun, Oct 27, 2019 at 4:51 PM Atemu <atemu.main@gmail.com> wrote: >>>>> It's really hard to determine, you could try the following command to >>>>> determine: >>>>> # btrfs ins dump-tree -t extent --bfs /dev/nvme/btrfs |\ >>>>> grep "(.*_ITEM.*)" | awk '{print $4" "$5" "$6" size "$10}' >>>>> >>>>> Then which key is the most shown one and its size. >>>>> >>>>> If a key's objectid (the first value) shows up multiple times, it's a >>>>> kinda heavily shared extent. >>>>> >>>>> Then search that objectid in the full extent tree dump, to find out how >>>>> it's shared. >>>> I analyzed it a bit differently but this should be the information we wanted: >>>> >>>> https://gist.github.com/Atemu/206c44cd46474458c083721e49d84a42 >>> That's quite a lot of extents shared many times. >>> That indeed slows backreference walking and therefore send which uses it. >>> While the slowdown is known, the memory consumption I wasn't aware of, >>> but from your logs, it's not clear >>> where it comes exactly from, something to be looked at. There's also a >>> significant number of data checksum errors. >>> >>> I think in the meanwhile send can just skip backreference walking and >>> attempt to clone whenever the number of >>> backreferences for an inode exceeds some limit, in which case it would >>> fallback to writes instead of cloning. >> Long time ago I had a purpose to record sent extents in an rbtree, then >> instead of do the full backref walk, go that rbtree walk instead. >> That should still be way faster than full backref walk, and still have a >> good enough hit rate. > The problem of that is that it can use a lot of memory. We can have > thousands of extents, tens of thousands, etc. > Sure one can limit such cache to store up to some limit N, cache only > the last N extents found (or some other policy), etc., but then either > hits become so rare that it's nearly worthless or it's way too > complex. > Until the general backref walking speedups and caching is done (and > honestly I don't know the state of that since who was working on that > is no longer working on btrfs), a simple solution would be better IMO. > > Thanks. Yeah, some short term plan to mitigate this would be appreciated. I am running with this patch Qu Wenruo posted a while back: https://patchwork.kernel.org/patch/9245287/ Some flag/switch/setting or limit to backref walking so this patch isn't needed would be appreciated. Without this btrfs send is just too slow once I have a few reflinks and snapshots. I haven't had a kernel panic, though. The problem is finding extents to reflink in the clone sources, correct? My naive solution would be to create a (temporary) cache of (logical extent) to (ino, offset) per send clone source. Then lookup every extent in that cache. Maybe add a bloom filter as well (that should filter most negatives). In some cases iterating over all extents in the clone sources prior to the send operation would be faster than doing the backref-walks during send. As an optimization it could be made persistent and incrementally created from the parent snapshot's cache. EXTENT_SAME would invalidate it/or it would need to update it. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB 2019-10-28 11:30 ` Filipe Manana 2019-10-28 12:36 ` Qu Wenruo @ 2019-10-28 12:44 ` Atemu 2019-10-28 13:01 ` Filipe Manana 1 sibling, 1 reply; 18+ messages in thread From: Atemu @ 2019-10-28 12:44 UTC (permalink / raw) To: fdmanana; +Cc: Qu Wenruo, linux-btrfs > That's quite a lot of extents shared many times. > That indeed slows backreference walking and therefore send which uses it. > While the slowdown is known, the memory consumption I wasn't aware of, > but from your logs, it's not clear Is there anything else I could monitor to find out? > where it comes exactly from, something to be looked at. There's also a > significant number of data checksum errors. As I said, those seem to be false; the file is in-tact (it happens to be a 7z archive) and scrubs before triggering the bug don't report anything either. Could be related to running OOM or its own bug. > I think in the meanwhile send can just skip backreference walking and > attempt to clone whenever the number of > backreferences for an inode exceeds some limit, in which case it would > fallback to writes instead of cloning. Wouldn't it be better to make it dynamic in case it's run under low memory conditions? > I'll look into it, thanks for the report (and Qu for telling how to > get the backreference counts). Thanks to you both! -Atemu ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB 2019-10-28 12:44 ` Atemu @ 2019-10-28 13:01 ` Filipe Manana 2019-10-28 13:44 ` Atemu 0 siblings, 1 reply; 18+ messages in thread From: Filipe Manana @ 2019-10-28 13:01 UTC (permalink / raw) To: Atemu; +Cc: Qu Wenruo, linux-btrfs On Mon, Oct 28, 2019 at 12:44 PM Atemu <atemu.main@gmail.com> wrote: > > > That's quite a lot of extents shared many times. > > That indeed slows backreference walking and therefore send which uses it. > > While the slowdown is known, the memory consumption I wasn't aware of, > > but from your logs, it's not clear > > Is there anything else I could monitor to find out? You can run 'slabtop' while doing the send operation. That might be enough. It's very likely the backreference walking code, due to huge ulists (kmalloc-N slab), lots of btrfs_prelim_ref structures (btrfs_prelim_ref slab), etc. > > > where it comes exactly from, something to be looked at. There's also a > > significant number of data checksum errors. > > As I said, those seem to be false; the file is in-tact (it happens to > be a 7z archive) and scrubs before triggering the bug don't report > anything either. > > Could be related to running OOM or its own bug. Yes, it's likely a different bug. I don't think it's related either. > > > I think in the meanwhile send can just skip backreference walking and > > attempt to clone whenever the number of > > backreferences for an inode exceeds some limit, in which case it would > > fallback to writes instead of cloning. > > Wouldn't it be better to make it dynamic in case it's run under low > memory conditions? Ideally yes. But that's a lot harder to do for several reasons and in the end might not be worth it. Thanks. > > > I'll look into it, thanks for the report (and Qu for telling how to > > get the backreference counts). > > Thanks to you both! > -Atemu -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.” ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB 2019-10-28 13:01 ` Filipe Manana @ 2019-10-28 13:44 ` Atemu 2019-10-31 13:55 ` Atemu 0 siblings, 1 reply; 18+ messages in thread From: Atemu @ 2019-10-28 13:44 UTC (permalink / raw) To: fdmanana; +Cc: Qu Wenruo, linux-btrfs > You can run 'slabtop' while doing the send operation. > That might be enough. > > It's very likely the backreference walking code, due to huge ulists > (kmalloc-N slab), lots of btrfs_prelim_ref structures > (btrfs_prelim_ref slab), etc. I actually did run slabtop once but couldn't remember the exact name of the top entry, so I didn't mention it. Now that you mentioned the options though, I'm pretty sure it was kmalloc-N. N was probably 64 but that I'm not sure about. > Yes, it's likely a different bug. I don't think it's related either. I have only seen these warnings after the bug triggered though, reading the file under normal conditions doesn't produce them. What would be the best way to get more information on how btrfs comes to the conclusion that this file is corrupt? > Ideally yes. But that's a lot harder to do for several reasons and in > the end might not be worth it. I see, thanks! -Atemu ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB 2019-10-28 13:44 ` Atemu @ 2019-10-31 13:55 ` Atemu 0 siblings, 0 replies; 18+ messages in thread From: Atemu @ 2019-10-31 13:55 UTC (permalink / raw) To: fdmanana; +Cc: Qu Wenruo, linux-btrfs > kmalloc-N. N was probably 64 but that I'm not sure about. Correction: It's kmalloc-32. -Atemu ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2019-10-31 13:56 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-10-26 17:46 BUG: btrfs send: Kernel's memory usage rises until OOM kernel panic after sending ~37GiB Atemu 2019-10-27 0:50 ` Qu Wenruo 2019-10-27 10:33 ` Atemu 2019-10-27 11:34 ` Qu Wenruo 2019-10-27 12:55 ` Atemu 2019-10-27 13:43 ` Qu Wenruo 2019-10-27 15:19 ` Atemu 2019-10-27 15:19 ` Atemu 2019-10-27 23:16 ` Qu Wenruo 2019-10-28 12:26 ` Atemu 2019-10-28 11:30 ` Filipe Manana 2019-10-28 12:36 ` Qu Wenruo 2019-10-28 12:43 ` Filipe Manana 2019-10-28 14:58 ` Martin Raiber 2019-10-28 12:44 ` Atemu 2019-10-28 13:01 ` Filipe Manana 2019-10-28 13:44 ` Atemu 2019-10-31 13:55 ` Atemu
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.