* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory [not found] <CAJ3TwYQXqUZiKhYc5rciTmvGX1RLkHnkQb5SSYAJ7AD+kbudag@mail.gmail.com> @ 2015-07-31 2:34 ` Qu Wenruo 2015-07-31 4:10 ` John Ettedgui [not found] ` <CAJ3TwYRN+1tJY+paz=qZT0_XP=r9CcTKbBgX_kZRFOWj8vSK=w@mail.gmail.com> 0 siblings, 2 replies; 54+ messages in thread From: Qu Wenruo @ 2015-07-31 2:34 UTC (permalink / raw) To: John Ettedgui, linux-btrfs, georgi-georgiev-btrfs John Ettedgui wrote on 2015/07/29 18:55 +0000: > Hello, > I have the same issue and would like to add myself to this thread. > My btrfs partition is about 10tb on top of lvm2 and has been taking about a minute to mount in the past few months. > >> Qu Wenruo <quwenruo <at>cn.fujitsu.com <http://cn.fujitsu.com>> writes: >> >> Quite common, especial when it grows large. >> But it would be much better to use ftrace to show which btrfs operation >> takes the most time. > > > I have got a trace file running this command: > trace-cmd record -e btrfs mount <PARTITION> > > Since it is fairly big for an email I have gzipped it. > > Thanks! > John > Hi John, Thanks for the trace output. But it seems that, your root partition is also btrfs, causing a lot of btrfs trace from your systemd journal. Would you mind re-collecting the ftrace without such logging system caused btrfs trace? BTW, although I'm not quite familiar with ftrace, would you please consider collect ftrace with function_graph tracer? That would help a lot to find which takes the most time. But it may trace too much things and maybe hard to read. Thanks, Qu ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-07-31 2:34 ` mount btrfs takes 30 minutes, btrfs check runs out of memory Qu Wenruo @ 2015-07-31 4:10 ` John Ettedgui 2015-08-02 5:44 ` Georgi Georgiev [not found] ` <CAJ3TwYRN+1tJY+paz=qZT0_XP=r9CcTKbBgX_kZRFOWj8vSK=w@mail.gmail.com> 1 sibling, 1 reply; 54+ messages in thread From: John Ettedgui @ 2015-07-31 4:10 UTC (permalink / raw) Cc: “linux-btrfs@vger.kernel.org”, georgi-georgiev-btrfs On Thu, Jul 30, 2015 at 7:34 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > > > Hi John, > Thanks for the trace output. You are welcome, thank you for looking at it! > > But it seems that, your root partition is also btrfs, causing a lot of btrfs > trace from your systemd journal. > Oh yes sorry about that. I actually have 3 partition in btrfs, the problematic one being the only big one. > Would you mind re-collecting the ftrace without such logging system caused > btrfs trace? Sure, how would I do that? This is my first time using ftrace. > > BTW, although I'm not quite familiar with ftrace, would you please consider > collect ftrace with function_graph tracer? Sure, how would I do that one as well? (I'll look these up in the meantime, I just want to make sure to not give you something not useful again). > That would help a lot to find which takes the most time. > But it may trace too much things and maybe hard to read. > > Thanks, > Qu Great, thank you! ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-07-31 4:10 ` John Ettedgui @ 2015-08-02 5:44 ` Georgi Georgiev 0 siblings, 0 replies; 54+ messages in thread From: Georgi Georgiev @ 2015-08-02 5:44 UTC (permalink / raw) To: John Ettedgui; +Cc: “linux-btrfs@vger.kernel.org” Quoting John Ettedgui at 2015-07-30-21:10:27(-0700): > On Thu, Jul 30, 2015 at 7:34 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > > > > Hi John, > > Thanks for the trace output. > You are welcome, thank you for looking at it! > > > > But it seems that, your root partition is also btrfs, causing a lot of btrfs > > trace from your systemd journal. > > > Oh yes sorry about that. > I actually have 3 partition in btrfs, the problematic one being the > only big one. > > Would you mind re-collecting the ftrace without such logging system caused > > btrfs trace? > Sure, how would I do that? > This is my first time using ftrace. > > > > BTW, although I'm not quite familiar with ftrace, would you please consider > > collect ftrace with function_graph tracer? > Sure, how would I do that one as well? You can use set_ftrace_pid to trace only a single process (for example, the mount command). There is a sample script I found in the ftrace documentation that goes something like this: # First disable tracing, to clear the trace buffer echo nop > current_tracer echo 0 > tracing_on echo 0 > tracing_enabled # Then re-enable it after setting the filters echo $$ > set_ftrace_pid echo '*btrfs*' > set_ftrace_filter echo function_graph > current_tracer echo 1 > tracing_enabled echo 1 > tracing_on # And finally *exec* the command to trace: exec mount .... I tried it, but the logs were way too large, and I was still fiddling with the trace_options to set. If someone has good advice, we can try it again -- Georgi ^ permalink raw reply [flat|nested] 54+ messages in thread
[parent not found: <CAJ3TwYRN+1tJY+paz=qZT0_XP=r9CcTKbBgX_kZRFOWj8vSK=w@mail.gmail.com>]
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory [not found] ` <CAJ3TwYRN+1tJY+paz=qZT0_XP=r9CcTKbBgX_kZRFOWj8vSK=w@mail.gmail.com> @ 2015-07-31 4:52 ` Qu Wenruo [not found] ` <CAJ3TwYR5g-JhjmGnZUXqLXc7qV1_=AN5_6sj54JQODbtgG9Aag@mail.gmail.com> 0 siblings, 1 reply; 54+ messages in thread From: Qu Wenruo @ 2015-07-31 4:52 UTC (permalink / raw) To: John Ettedgui, btrfs John Ettedgui wrote on 2015/07/30 21:09 -0700: > On Thu, Jul 30, 2015 at 7:34 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: >> >> >> Hi John, >> Thanks for the trace output. > You are welcome, thank you for looking at it! >> >> But it seems that, your root partition is also btrfs, causing a lot of btrfs >> trace from your systemd journal. >> > Oh yes sorry about that. > I actually have 3 partition in btrfs, the problematic one being the > only big one. >> Would you mind re-collecting the ftrace without such logging system caused >> btrfs trace? > Sure, how would I do that? > This is my first time using ftrace. I'm not familiar with ftrace either, but your trace is good enough already, the only thing needed is to avoid using btrfs as root partition(at least /var/). My personal recommendation is to use a liveCD or rescue media to do the trace dump. Other recommendation is to enable all btrfs trace point, and it seems that you have already done it while collecting the trace. >> >> BTW, although I'm not quite familiar with ftrace, would you please consider >> collect ftrace with function_graph tracer? > Sure, how would I do that one as well? > (I'll look these up in the meantime, I just want to make sure to not > give you something not useful again). This LWN article should help you, as I'm not so familiar with it either. https://lwn.net/Articles/370423/ <The function_graph tracer> paragraph. And the graph_function is btrfs_mount. Thanks, Qu >> That would help a lot to find which takes the most time. >> But it may trace too much things and maybe hard to read. >> >> Thanks, >> Qu > > Great, thank you! > John > ^ permalink raw reply [flat|nested] 54+ messages in thread
[parent not found: <CAJ3TwYR5g-JhjmGnZUXqLXc7qV1_=AN5_6sj54JQODbtgG9Aag@mail.gmail.com>]
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory [not found] ` <CAJ3TwYR5g-JhjmGnZUXqLXc7qV1_=AN5_6sj54JQODbtgG9Aag@mail.gmail.com> @ 2015-07-31 5:40 ` Qu Wenruo 2015-07-31 5:45 ` John Ettedgui 0 siblings, 1 reply; 54+ messages in thread From: Qu Wenruo @ 2015-07-31 5:40 UTC (permalink / raw) To: John Ettedgui; +Cc: btrfs John Ettedgui wrote on 2015/07/30 22:15 -0700: > On Thu, Jul 30, 2015 at 9:52 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: >> I'm not familiar with ftrace either, but your trace is good enough already, >> the only thing needed is to avoid using btrfs as root partition(at least >> /var/). > I've stopped all journaling services for now, I hope that's enough/ >> >> My personal recommendation is to use a liveCD or rescue media to do the >> trace dump. >> > If not I'll have to do that, but this computer has no CD drive. It seems that you're using Chromium while doing the dump. :) If no CD drive, I'll recommend to use Archlinux installation iso to make a bootable USB stick and do the dump. (just download and dd would do the trick) As its kernel and tools is much newer than most distribution. It's better to provide two trace. One is the function tracer one, with "btrfs:*" as set_event. The other is the function_graph one. with "btrfs_mount" as set_graph_function. Thanks for your patient to help improving btrfs. Although I may not be able to check the trace until next Monday... :( Thanks, Qu >> Other recommendation is to enable all btrfs trace point, and it seems that >> you have already done it while collecting the trace. >>>> >> This LWN article should help you, as I'm not so familiar with it either. >> >> https://lwn.net/Articles/370423/ >> <The function_graph tracer> paragraph. >> >> And the graph_function is btrfs_mount. > That actually helped a lot! > I've been trying to get it working since I sent the previous email, > but never realized I needed the supply the function and that's > probably it never worked (or used too much space before crashing) >> >> Thanks, >> Qu >> > I hope this is better. > > John > ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-07-31 5:40 ` Qu Wenruo @ 2015-07-31 5:45 ` John Ettedgui 2015-08-01 4:35 ` John Ettedgui 0 siblings, 1 reply; 54+ messages in thread From: John Ettedgui @ 2015-07-31 5:45 UTC (permalink / raw) To: Qu Wenruo; +Cc: btrfs On Thu, Jul 30, 2015 at 10:40 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > > > > It seems that you're using Chromium while doing the dump. :) > Ooops I did not think that would be an issue :/ > If no CD drive, I'll recommend to use Archlinux installation iso to make a > bootable USB stick and do the dump. > (just download and dd would do the trick) > As its kernel and tools is much newer than most distribution. Sure that's the distribution I use anyway. :) I should have some usb stick somewhere to try it. > > It's better to provide two trace. > One is the function tracer one, with "btrfs:*" as set_event. > The other is the function_graph one. with "btrfs_mount" as > set_graph_function. > Oh I see, I will try that. > Thanks for your patient to help improving btrfs. Well, thank you for helping me out here! > Although I may not be able to check the trace until next Monday... :( > Oh that's fine, I can live with a one extra minute reboot for a few more days. > Thanks, > Qu > > Thanks! John ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-07-31 5:45 ` John Ettedgui @ 2015-08-01 4:35 ` John Ettedgui 2015-08-01 10:05 ` Russell Coker 2015-08-04 1:39 ` Qu Wenruo 0 siblings, 2 replies; 54+ messages in thread From: John Ettedgui @ 2015-08-01 4:35 UTC (permalink / raw) To: Qu Wenruo; +Cc: btrfs On Thu, Jul 30, 2015 at 10:45 PM, John Ettedgui <john.ettedgui@gmail.com> wrote: > On Thu, Jul 30, 2015 at 10:40 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: >> >> It seems that you're using Chromium while doing the dump. :) >> If no CD drive, I'll recommend to use Archlinux installation iso to make a >> bootable USB stick and do the dump. >> (just download and dd would do the trick) >> As its kernel and tools is much newer than most distribution. So I did not have any usb sticks large enough for this task (only 4Gb) so I restarted into emergency runlevel with only / mounted and as ro, I hope that'll do. >> >> It's better to provide two trace. >> One is the function tracer one, with "btrfs:*" as set_event. >> The other is the function_graph one. with "btrfs_mount" as >> set_graph_function. So I got 2 new traces, and I am hoping that these are what you meant, but I am still not sure. Here are the commands I used in case...: trace-cmd record -o trace-function_graph.dat -p function_graph -g btrfs_mount mount MountPoint and trace-function_graph.dat -p function -l 'btrfs_*' mount MountPoint (using -e btrfs only lead to a crash but -l 'btrfs_*' passed, though I am sure they have different purposes.. I hope that's the correct one) The first one was so big, 2Gb, I had to use xz to compress it and host it somewhere else, the ML would most likely not take it. The other one is quite small but I hosted it in the same place.... Here are the links: https://mega.nz/#!8tgTjKyK!XJnWH05bsv9sJ3nANIxKsdkL20RePPS4cKgWSxit0eQ https://mega.nz/#!xopkVA6L!z9xjo3us1Nv6wdOs05jNZdhNbiAP5yeLdneEp0huUzI I hope that was it this time! Thanks, John ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-08-01 4:35 ` John Ettedgui @ 2015-08-01 10:05 ` Russell Coker 2015-08-04 1:39 ` Qu Wenruo 1 sibling, 0 replies; 54+ messages in thread From: Russell Coker @ 2015-08-01 10:05 UTC (permalink / raw) To: John Ettedgui, btrfs On Sat, 1 Aug 2015 02:35:39 PM John Ettedgui wrote: > >> It seems that you're using Chromium while doing the dump. :) > >> If no CD drive, I'll recommend to use Archlinux installation iso to make > >> a bootable USB stick and do the dump. > >> (just download and dd would do the trick) > >> As its kernel and tools is much newer than most distribution. > > So I did not have any usb sticks large enough for this task (only 4Gb) > so I restarted into emergency runlevel with only / mounted and as ro, > I hope that'll do. The Debian/Jessie Netinst image is about 120M and allows you to launch a shell. If you want a newer kernel you could rebuild the Debian Netinst yourself. Also a basic text-only Linux installation takes a lot less than 4G of storage. I have a couple of 1G USB sticks with Debian installed that I use to fix things. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-08-01 4:35 ` John Ettedgui 2015-08-01 10:05 ` Russell Coker @ 2015-08-04 1:39 ` Qu Wenruo 2015-08-04 1:55 ` John Ettedgui 1 sibling, 1 reply; 54+ messages in thread From: Qu Wenruo @ 2015-08-04 1:39 UTC (permalink / raw) To: John Ettedgui; +Cc: btrfs John Ettedgui wrote on 2015/07/31 21:35 -0700: > On Thu, Jul 30, 2015 at 10:45 PM, John Ettedgui <john.ettedgui@gmail.com> wrote: >> On Thu, Jul 30, 2015 at 10:40 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: >>> >>> It seems that you're using Chromium while doing the dump. :) >>> If no CD drive, I'll recommend to use Archlinux installation iso to make a >>> bootable USB stick and do the dump. >>> (just download and dd would do the trick) >>> As its kernel and tools is much newer than most distribution. > So I did not have any usb sticks large enough for this task (only 4Gb) > so I restarted into emergency runlevel with only / mounted and as ro, > I hope that'll do. >>> >>> It's better to provide two trace. >>> One is the function tracer one, with "btrfs:*" as set_event. >>> The other is the function_graph one. with "btrfs_mount" as >>> set_graph_function. > So I got 2 new traces, and I am hoping that these are what you meant, > but I am still not sure. > Here are the commands I used in case...: > > trace-cmd record -o > trace-function_graph.dat -p function_graph -g btrfs_mount mount MountPoint > > and > > trace-function_graph.dat -p function -l 'btrfs_*' mount MountPoint > (using -e btrfs only lead to a crash but -l 'btrfs_*' passed, though I > am sure they have different purposes.. I hope that's the correct one) > > The first one was so big, 2Gb, I had to use xz to compress it and host > it somewhere else, the ML would most likely not take it. > The other one is quite small but I hosted it in the same place.... > Here are the links: > https://mega.nz/#!8tgTjKyK!XJnWH05bsv9sJ3nANIxKsdkL20RePPS4cKgWSxit0eQ > https://mega.nz/#!xopkVA6L!z9xjo3us1Nv6wdOs05jNZdhNbiAP5yeLdneEp0huUzI > > I hope that was it this time! Oh, you were using trace-cmd, that's why the data is so huge. I was originally hoping you just copy the trace file, which is human readable and not so huge. But that's OK anyway. I'll try to analyse it to find a clue if possible. Thanks, Qu > Thanks, > John > ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-08-04 1:39 ` Qu Wenruo @ 2015-08-04 1:55 ` John Ettedgui 2015-08-04 2:31 ` John Ettedgui 2015-08-04 3:01 ` Qu Wenruo 0 siblings, 2 replies; 54+ messages in thread From: John Ettedgui @ 2015-08-04 1:55 UTC (permalink / raw) To: Qu Wenruo; +Cc: btrfs On Mon, Aug 3, 2015 at 6:39 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > > Oh, you were using trace-cmd, that's why the data is so huge. Oh, I thought it was just automating the work for me, but without any sort of impact. > > I was originally hoping you just copy the trace file, which is human > readable and not so huge. If you mean something like the ouput of trace-cmd report, it was actually bigger than the dat files (about twice the size) that's why I shared the dats instead. If you want the reports instead I'll gladly share them. > > But that's OK anyway. > > I'll try to analyse it to find a clue if possible. > > Thanks, > Qu Great thank you! By the way, I just thought of a few things to mention. This btrfs partition is an ext4 converted partition, and I hit the same behavior as these guys under heavy load: http://www.spinics.net/lists/linux-btrfs/msg44660.html http://www.spinics.net/lists/linux-btrfs/msg44191.html I don't think it's related to the crash, but maybe to the conversion? Thanks Qu! John ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-08-04 1:55 ` John Ettedgui @ 2015-08-04 2:31 ` John Ettedgui 2015-08-04 3:01 ` Qu Wenruo 1 sibling, 0 replies; 54+ messages in thread From: John Ettedgui @ 2015-08-04 2:31 UTC (permalink / raw) To: Qu Wenruo; +Cc: btrfs On Mon, Aug 3, 2015 at 6:55 PM, John Ettedgui <john.ettedgui@gmail.com> wrote: > On Mon, Aug 3, 2015 at 6:39 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: >> >> Oh, you were using trace-cmd, that's why the data is so huge. > Oh, I thought it was just automating the work for me, but without any > sort of impact. >> >> I was originally hoping you just copy the trace file, which is human >> readable and not so huge. > If you mean something like the ouput of trace-cmd report, it was > actually bigger than the dat files (about twice the size) that's why I > shared the dats instead. > If you want the reports instead I'll gladly share them. In case it helps here are the reports instead of the dats: https://mega.co.nz/#!FwpwHQyL!m0dQHSfQSNGzw9yUwJ6l0eb7Mzta0pOSAf1JHDZ1zfo https://mega.co.nz/#!B1JgXLxZ!oI1bm0RyhqFbkCWnT95GNKohGozmvqxgJDSUtVdo77s I guess once compressed the size difference is meaningless Thanks, John ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-08-04 1:55 ` John Ettedgui 2015-08-04 2:31 ` John Ettedgui @ 2015-08-04 3:01 ` Qu Wenruo 2015-08-04 4:58 ` John Ettedgui 2015-08-04 14:38 ` Chris Murphy 1 sibling, 2 replies; 54+ messages in thread From: Qu Wenruo @ 2015-08-04 3:01 UTC (permalink / raw) To: John Ettedgui; +Cc: btrfs John Ettedgui wrote on 2015/08/03 18:55 -0700: > On Mon, Aug 3, 2015 at 6:39 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: >> >> Oh, you were using trace-cmd, that's why the data is so huge. > Oh, I thought it was just automating the work for me, but without any > sort of impact. >> >> I was originally hoping you just copy the trace file, which is human >> readable and not so huge. > If you mean something like the ouput of trace-cmd report, it was > actually bigger than the dat files (about twice the size) that's why I > shared the dats instead. > If you want the reports instead I'll gladly share them. Nop, not the report, but /sys/kernel/debug/tracing/trace. But that needs some manual operation, like set event and graph functions. >> >> But that's OK anyway. >> >> I'll try to analyse it to find a clue if possible. >> >> Thanks, >> Qu > Great thank you! > > By the way, I just thought of a few things to mention. > This btrfs partition is an ext4 converted partition, and I hit the > same behavior as these guys under heavy load: > http://www.spinics.net/lists/linux-btrfs/msg44660.html > http://www.spinics.net/lists/linux-btrfs/msg44191.html > I don't think it's related to the crash, but maybe to the conversion? Oh, converted... That's too bad. :( [[What's wrong with convert]] Although btrfs is flex enough in theory to fit itself into the free space of ext* and works fine, But in practice, ext* is too fragmental in the standard of btrfs, not to mention it also enables mixed-blockgroup. [[Recommendations]] I'd recommend to delete the ext*_img subvolume and rebalance all chunks in the fs if you're stick to the converted filesystem. Although the best practice is staying away from such converted fs, either using pure, newly created btrfs, or convert back to ext* before any balance. [[But before that, just try something]] But you have already provided some interesting facts. As the filesystem is high fragmented, I'd like to recommend to do some little test: (BTW I assume you don't use some special mount options) To test if it's the space cache causing the mount speed drop. 1) clear page cache # echo 3 > /proc/sys/vm/drop_caches 2) Do a normal mount Just as what you do as usual, with your normal mount options Record the mount time 3) umount it. 4) clear page cache # echo 3 > /proc/sys/vm/drop_caches 5) mount it with "clear_cache" mount option It may takes sometime to clear the existing cache. It's just used to clear space cache. Don't compare mount time! 6) umount it 7) clear page cache # echo 3 > /proc/sys/vm/drop_caches 8) mount with "nospace_cache" mount option To see if there is obvious mount time change. Hopes that's the space cache thing causing the slow mount. But don't expect it too much anyway, it's just one personal guess. After the test, I'd recommend to follow the [[Recommendations]] if you just want a stable filesystem. Thanks, Qu > > Thanks Qu! > John > ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-08-04 3:01 ` Qu Wenruo @ 2015-08-04 4:58 ` John Ettedgui 2015-08-04 6:47 ` Duncan 2015-08-04 11:28 ` Austin S Hemmelgarn 2015-08-04 14:38 ` Chris Murphy 1 sibling, 2 replies; 54+ messages in thread From: John Ettedgui @ 2015-08-04 4:58 UTC (permalink / raw) To: Qu Wenruo; +Cc: btrfs On Mon, Aug 3, 2015 at 8:01 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > Oh, converted... > That's too bad. :( > > [[What's wrong with convert]] > Although btrfs is flex enough in theory to fit itself into the free space of > ext* and works fine, > But in practice, ext* is too fragmental in the standard of btrfs, not to > mention it also enables mixed-blockgroup. > Oh oh :/ > > [[Recommendations]] > I'd recommend to delete the ext*_img subvolume and rebalance all chunks in > the fs if you're stick to the converted filesystem. > Already done (well the rebalance crashed towards the end both time with the read only error, but someone on #btrfs looked at my partition stats and said it was probably good enough) > Although the best practice is staying away from such converted fs, either > using pure, newly created btrfs, or convert back to ext* before any balance. > Unfortunately I don't have enough hard drive space to do a clean btrfs, so my only way to use btrfs for that partition was a conversion. > [[But before that, just try something]] > But you have already provided some interesting facts. As the filesystem is > high fragmented, I'd like to recommend to do some little test: > (BTW I assume you don't use some special mount options) Current mount options in fstab: btrfs defaults,noatime,compress=lzo,space_cache,autodefrag 0 0 It's the same as my other btrfs partitions, apart from the fact that they are on a SSD and way smaller. > To test if it's the space cache causing the mount speed drop. > > 1) clear page cache > # echo 3 > /proc/sys/vm/drop_caches > 2) Do a normal mount > Just as what you do as usual, with your normal mount options > Record the mount time 0.01s user 0.42s system 0% cpu 1:01.70 total > 3) umount it. not asked but might as well: 0.00s user 0.65s system 1% cpu 35.536 total > 4) clear page cache > # echo 3 > /proc/sys/vm/drop_caches > 5) mount it with "clear_cache" mount option > It may takes sometime to clear the existing cache. > It's just used to clear space cache. > Don't compare mount time! Yes I know it's supposed to be slower :) although... it was pretty much the same actually: 0.01s user 0.44s system 0% cpu 1:02.07 total > 6) umount it > 7) clear page cache > # echo 3 > /proc/sys/vm/drop_caches Is it ok if that value never changed since 1) ? > 8) mount with "nospace_cache" mount option > To see if there is obvious mount time change. > 0.00s user 0.44s system 0% cpu 1:01.86 total > Hopes that's the space cache thing causing the slow mount. > But don't expect it too much anyway, it's just one personal guess. > Unfortunately it is about the same :/ > After the test, I'd recommend to follow the [[Recommendations]] if you just > want a stable filesystem. > I am already within these recommendations I think. Thanks! ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-08-04 4:58 ` John Ettedgui @ 2015-08-04 6:47 ` Duncan 2015-08-04 11:28 ` Austin S Hemmelgarn 1 sibling, 0 replies; 54+ messages in thread From: Duncan @ 2015-08-04 6:47 UTC (permalink / raw) To: linux-btrfs John Ettedgui posted on Mon, 03 Aug 2015 21:58:09 -0700 as excerpted: > Current mount options in fstab: > defaults,noatime,compress=lzo,space_cache,autodefrag 0 0 Just a few hints for a tidier fstab. Feel free to ignore if you don't care, as the practical difference in mount options is nil. =:^) 1) You should be able to delete that space_cache option. Btrfs has defaulted to space_cache since at least 3.0 I think and probably way before that, and even when it wasn't the absolute default, you only had to enable it once, to have it on after that unless you turned it off again. I know I've never specifically added space_cache to my mount options, yet /proc/mounts always has said it was there, and I've been on btrfs solidly since kernel 3.5 era, with tests before that (tho I do think I had to turn it on once, after which it stayed on for that filesystem, back in my earliest tests, which would have been late kernel 2.6 era). 2) Similarly you can omit defaults, since that's only a field placeholder in case you don't have any other options in that field. As soon as you have your first non-default option holding the place of that field, you can omit defaults, since that's exactly what they are, defaults, regardless of whether the kernel is told to use them or not. So all you really need there is noatime,compress=lzo,autodefrag. FWIW, that's what I use as my normal mount options, too. 3) Actually, assuming you're running a half-way modern util-linux (which you should be if you're not on an old enterprise distro), you can omit the trailing 0 0 as well, since those fields are now optional and default to 0 if they aren't there. See the fstab(5) manpage for more on the last two. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-08-04 4:58 ` John Ettedgui 2015-08-04 6:47 ` Duncan @ 2015-08-04 11:28 ` Austin S Hemmelgarn 2015-08-04 17:36 ` John Ettedgui 1 sibling, 1 reply; 54+ messages in thread From: Austin S Hemmelgarn @ 2015-08-04 11:28 UTC (permalink / raw) To: John Ettedgui, Qu Wenruo; +Cc: btrfs [-- Attachment #1: Type: text/plain, Size: 2013 bytes --] On 2015-08-04 00:58, John Ettedgui wrote: > On Mon, Aug 3, 2015 at 8:01 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: >> Although the best practice is staying away from such converted fs, either >> using pure, newly created btrfs, or convert back to ext* before any balance. >> > Unfortunately I don't have enough hard drive space to do a clean > btrfs, so my only way to use btrfs for that partition was a > conversion. If you could get your hands on a decent sized flash drive (32G or more), you could do an incremental conversion offline. The steps would look something like this: 1. Boot the system into a LiveCD or something similar that doesn't need to run from your regular root partition (SystemRescueCD would be my personal recommendation, although if you go that way, make sure to boot the alternative kernel, as it's a lot newer then the standard ones). 2. Plug in the flash drive, format it as BTRFS. 3. Mount both your old partition and the flash drive somewhere. 4. Start copying files from the old partition to the flash drive. 5. When you hit ENOSPC on the flash drive, unmount the old partition, shrink it down to the minimum size possible, and create a new partition in the free space produced by doing so. 6. Add the new partition to the BTRFS filesystem on the flash drive. 7. Repeat steps 4-6 until you have copied everything. 8. Wipe the old partition, and add it to the BTRFS filesystem. 9. Run a full balance on the new BTRFS filesystem. 10. Delete the partition from step 5 that is closest to the old partition (via btrfs device delete), then resize the old partition to fill the space that the deleted partition took up. 11. Repeat steps 9-10 until the only remaining partitions in the new BTRFS filesystem are the old one and the flash drive. 12. Delete the flash drive from the BTRFS filesystem. This takes some time and coordination, but it does work reliably as long as you are careful (I've done it before on multiple systems). [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3019 bytes --] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-08-04 11:28 ` Austin S Hemmelgarn @ 2015-08-04 17:36 ` John Ettedgui 2015-08-05 11:30 ` Austin S Hemmelgarn 0 siblings, 1 reply; 54+ messages in thread From: John Ettedgui @ 2015-08-04 17:36 UTC (permalink / raw) To: Austin S Hemmelgarn; +Cc: Qu Wenruo, btrfs On Tue, Aug 4, 2015 at 4:28 AM, Austin S Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2015-08-04 00:58, John Ettedgui wrote: >> >> On Mon, Aug 3, 2015 at 8:01 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: >>> >>> Although the best practice is staying away from such converted fs, either >>> using pure, newly created btrfs, or convert back to ext* before any >>> balance. >>> >> Unfortunately I don't have enough hard drive space to do a clean >> btrfs, so my only way to use btrfs for that partition was a >> conversion. > > If you could get your hands on a decent sized flash drive (32G or more), you > could do an incremental conversion offline. The steps would look something > like this: > > 1. Boot the system into a LiveCD or something similar that doesn't need to > run from your regular root partition (SystemRescueCD would be my personal > recommendation, although if you go that way, make sure to boot the > alternative kernel, as it's a lot newer then the standard ones). > 2. Plug in the flash drive, format it as BTRFS. > 3. Mount both your old partition and the flash drive somewhere. > 4. Start copying files from the old partition to the flash drive. > 5. When you hit ENOSPC on the flash drive, unmount the old partition, shrink > it down to the minimum size possible, and create a new partition in the free > space produced by doing so. > 6. Add the new partition to the BTRFS filesystem on the flash drive. > 7. Repeat steps 4-6 until you have copied everything. > 8. Wipe the old partition, and add it to the BTRFS filesystem. > 9. Run a full balance on the new BTRFS filesystem. > 10. Delete the partition from step 5 that is closest to the old partition > (via btrfs device delete), then resize the old partition to fill the space > that the deleted partition took up. > 11. Repeat steps 9-10 until the only remaining partitions in the new BTRFS > filesystem are the old one and the flash drive. > 12. Delete the flash drive from the BTRFS filesystem. > > This takes some time and coordination, but it does work reliably as long as > you are careful (I've done it before on multiple systems). > > I suppose I could do that even without the flash as I have some free space anyway, but moving Tbs of data with Gbs of free space will take days, plus the repartitioning. It'd probably be easier to start with a 1Tb drive or something. Is this currently my best bet as conversion is not as good as I thought? I believe my other 2 partitions also come from conversion, though I may have rebuilt them later from scratch. Thank you! John ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-08-04 17:36 ` John Ettedgui @ 2015-08-05 11:30 ` Austin S Hemmelgarn 2015-08-13 22:38 ` Vincent Olivier [not found] ` <CAJ3TwYSW+SvbBrh1u_x+c3HTRx03qSR6BoH5cj_VzCXxZYv6EA@mail.gmail.com> 0 siblings, 2 replies; 54+ messages in thread From: Austin S Hemmelgarn @ 2015-08-05 11:30 UTC (permalink / raw) To: John Ettedgui; +Cc: Qu Wenruo, btrfs [-- Attachment #1: Type: text/plain, Size: 2982 bytes --] On 2015-08-04 13:36, John Ettedgui wrote: > On Tue, Aug 4, 2015 at 4:28 AM, Austin S Hemmelgarn > <ahferroin7@gmail.com> wrote: >> On 2015-08-04 00:58, John Ettedgui wrote: >>> >>> On Mon, Aug 3, 2015 at 8:01 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: >>>> >>>> Although the best practice is staying away from such converted fs, either >>>> using pure, newly created btrfs, or convert back to ext* before any >>>> balance. >>>> >>> Unfortunately I don't have enough hard drive space to do a clean >>> btrfs, so my only way to use btrfs for that partition was a >>> conversion. >> >> If you could get your hands on a decent sized flash drive (32G or more), you >> could do an incremental conversion offline. The steps would look something >> like this: >> >> 1. Boot the system into a LiveCD or something similar that doesn't need to >> run from your regular root partition (SystemRescueCD would be my personal >> recommendation, although if you go that way, make sure to boot the >> alternative kernel, as it's a lot newer then the standard ones). >> 2. Plug in the flash drive, format it as BTRFS. >> 3. Mount both your old partition and the flash drive somewhere. >> 4. Start copying files from the old partition to the flash drive. >> 5. When you hit ENOSPC on the flash drive, unmount the old partition, shrink >> it down to the minimum size possible, and create a new partition in the free >> space produced by doing so. >> 6. Add the new partition to the BTRFS filesystem on the flash drive. >> 7. Repeat steps 4-6 until you have copied everything. >> 8. Wipe the old partition, and add it to the BTRFS filesystem. >> 9. Run a full balance on the new BTRFS filesystem. >> 10. Delete the partition from step 5 that is closest to the old partition >> (via btrfs device delete), then resize the old partition to fill the space >> that the deleted partition took up. >> 11. Repeat steps 9-10 until the only remaining partitions in the new BTRFS >> filesystem are the old one and the flash drive. >> 12. Delete the flash drive from the BTRFS filesystem. >> >> This takes some time and coordination, but it does work reliably as long as >> you are careful (I've done it before on multiple systems). >> >> > I suppose I could do that even without the flash as I have some free > space anyway, but moving Tbs of data with Gbs of free space will take > days, plus the repartitioning. It'd probably be easier to start with a > 1Tb drive or something. > Is this currently my best bet as conversion is not as good as I thought? > > I believe my other 2 partitions also come from conversion, though I > may have rebuilt them later from scratch. > > Thank you! > John > Yeah, you're probably better off getting a TB disk and starting with that. In theory it is possible to automate the process, but I would advise against that if at all possible, it's a lot easier to recover from an error if you're doing it manually. [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3019 bytes --] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-08-05 11:30 ` Austin S Hemmelgarn @ 2015-08-13 22:38 ` Vincent Olivier 2015-08-13 23:19 ` Chris Murphy [not found] ` <CAJ3TwYSW+SvbBrh1u_x+c3HTRx03qSR6BoH5cj_VzCXxZYv6EA@mail.gmail.com> 1 sibling, 1 reply; 54+ messages in thread From: Vincent Olivier @ 2015-08-13 22:38 UTC (permalink / raw) To: linux-btrfs Hi, I think I might be having this problem too. 12 x 4TB RAID10 (original makefs, not converted from ext or whatnot). Says it has ~6TiB left. Centos 7. Dual Xeon CPU. 32GB RAM. ELRepo Kernel 4.1.5. Fstab options: noatime,autodefrag,compress=zlib,space_cache,nossd,noauto,x-systemd.automount Sometimes (not all the time) when I cd or ls the mount point it will not return within 5 minutes (I never let it run more than 5 minutes before rebooting) and I reboot and then it takes between 10-30s. Well as I'm writing this it's already been more than 10 minutes. I don't have the problem when I mount manually without the "noauto,x-systemd.automount" options. Can anyone help ? Thanks. Vincent -----Original Message----- From: "Austin S Hemmelgarn" <ahferroin7@gmail.com> Sent: Wednesday, August 5, 2015 07:30 To: "John Ettedgui" <john.ettedgui@gmail.com> Cc: "Qu Wenruo" <quwenruo@cn.fujitsu.com>, "btrfs" <linux-btrfs@vger.kernel.org> Subject: Re: mount btrfs takes 30 minutes, btrfs check runs out of memory On 2015-08-04 13:36, John Ettedgui wrote: > On Tue, Aug 4, 2015 at 4:28 AM, Austin S Hemmelgarn > <ahferroin7@gmail.com> wrote: >> On 2015-08-04 00:58, John Ettedgui wrote: >>> >>> On Mon, Aug 3, 2015 at 8:01 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: >>>> >>>> Although the best practice is staying away from such converted fs, either >>>> using pure, newly created btrfs, or convert back to ext* before any >>>> balance. >>>> >>> Unfortunately I don't have enough hard drive space to do a clean >>> btrfs, so my only way to use btrfs for that partition was a >>> conversion. >> >> If you could get your hands on a decent sized flash drive (32G or more), you >> could do an incremental conversion offline. The steps would look something >> like this: >> >> 1. Boot the system into a LiveCD or something similar that doesn't need to >> run from your regular root partition (SystemRescueCD would be my personal >> recommendation, although if you go that way, make sure to boot the >> alternative kernel, as it's a lot newer then the standard ones). >> 2. Plug in the flash drive, format it as BTRFS. >> 3. Mount both your old partition and the flash drive somewhere. >> 4. Start copying files from the old partition to the flash drive. >> 5. When you hit ENOSPC on the flash drive, unmount the old partition, shrink >> it down to the minimum size possible, and create a new partition in the free >> space produced by doing so. >> 6. Add the new partition to the BTRFS filesystem on the flash drive. >> 7. Repeat steps 4-6 until you have copied everything. >> 8. Wipe the old partition, and add it to the BTRFS filesystem. >> 9. Run a full balance on the new BTRFS filesystem. >> 10. Delete the partition from step 5 that is closest to the old partition >> (via btrfs device delete), then resize the old partition to fill the space >> that the deleted partition took up. >> 11. Repeat steps 9-10 until the only remaining partitions in the new BTRFS >> filesystem are the old one and the flash drive. >> 12. Delete the flash drive from the BTRFS filesystem. >> >> This takes some time and coordination, but it does work reliably as long as >> you are careful (I've done it before on multiple systems). >> >> > I suppose I could do that even without the flash as I have some free > space anyway, but moving Tbs of data with Gbs of free space will take > days, plus the repartitioning. It'd probably be easier to start with a > 1Tb drive or something. > Is this currently my best bet as conversion is not as good as I thought? > > I believe my other 2 partitions also come from conversion, though I > may have rebuilt them later from scratch. > > Thank you! > John > Yeah, you're probably better off getting a TB disk and starting with that. In theory it is possible to automate the process, but I would advise against that if at all possible, it's a lot easier to recover from an error if you're doing it manually. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-08-13 22:38 ` Vincent Olivier @ 2015-08-13 23:19 ` Chris Murphy 2015-08-14 0:30 ` Duncan 2015-08-14 2:39 ` Vincent Olivier 0 siblings, 2 replies; 54+ messages in thread From: Chris Murphy @ 2015-08-13 23:19 UTC (permalink / raw) To: Btrfs BTRFS On Thu, Aug 13, 2015 at 4:38 PM, Vincent Olivier <vincent@up4.com> wrote: > Hi, > > I think I might be having this problem too. 12 x 4TB RAID10 (original makefs, not converted from ext or whatnot). Says it has ~6TiB left. Centos 7. Dual Xeon CPU. 32GB RAM. ELRepo Kernel 4.1.5. Fstab options: noatime,autodefrag,compress=zlib,space_cache,nossd,noauto,x-systemd.automount Well I think others have suggested 3000 snapshots and quite a few things will get very slow. But then also you have autodefrag and I forget the interaction of this with many snapshots since the snapshot aware defrag code was removed. I'd say file a bug with the full details of the hardware from the ground up to the Btrfs file system. And include as an attachment, dmesg with sysrq+t during this "hang". Usually I see t asked if there's just slowness/delays, and w if there's already a kernel message saying there's a blocked task for 120 seconds. -- Chris Murphy ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-08-13 23:19 ` Chris Murphy @ 2015-08-14 0:30 ` Duncan 2015-08-14 2:42 ` Vincent Olivier 2015-08-14 2:39 ` Vincent Olivier 1 sibling, 1 reply; 54+ messages in thread From: Duncan @ 2015-08-14 0:30 UTC (permalink / raw) To: linux-btrfs Chris Murphy posted on Thu, 13 Aug 2015 17:19:41 -0600 as excerpted: > Well I think others have suggested 3000 snapshots and quite a few things > will get very slow. But then also you have autodefrag and I forget the > interaction of this with many snapshots since the snapshot aware defrag > code was removed. Autodefrag shouldn't have any snapshots mount-time-related interaction, with snapshot-aware-defrag disabled. The interaction between defrag (auto or not) and snapshots will be additional data space usage, since with snapshot-aware disabled, defrag only works with the current copy, thus forcing it to COW the extents elsewhere while not freeing the old extents as they're still referenced by the snapshots, but it shouldn't affect mount-time. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-08-14 0:30 ` Duncan @ 2015-08-14 2:42 ` Vincent Olivier 2015-08-18 17:36 ` Vincent Olivier 0 siblings, 1 reply; 54+ messages in thread From: Vincent Olivier @ 2015-08-14 2:42 UTC (permalink / raw) To: Duncan; +Cc: linux-btrfs I'll try without autodefrag anyways tomorrow just to make sure. And then file a bug report too with however it decides to behave. Vincent -----Original Message----- From: "Duncan" <1i5t5.duncan@cox.net> Sent: Thursday, August 13, 2015 20:30 To: linux-btrfs@vger.kernel.org Subject: Re: mount btrfs takes 30 minutes, btrfs check runs out of memory Chris Murphy posted on Thu, 13 Aug 2015 17:19:41 -0600 as excerpted: > Well I think others have suggested 3000 snapshots and quite a few things > will get very slow. But then also you have autodefrag and I forget the > interaction of this with many snapshots since the snapshot aware defrag > code was removed. Autodefrag shouldn't have any snapshots mount-time-related interaction, with snapshot-aware-defrag disabled. The interaction between defrag (auto or not) and snapshots will be additional data space usage, since with snapshot-aware disabled, defrag only works with the current copy, thus forcing it to COW the extents elsewhere while not freeing the old extents as they're still referenced by the snapshots, but it shouldn't affect mount-time. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-08-14 2:42 ` Vincent Olivier @ 2015-08-18 17:36 ` Vincent Olivier 0 siblings, 0 replies; 54+ messages in thread From: Vincent Olivier @ 2015-08-18 17:36 UTC (permalink / raw) To: Duncan, linux-btrfs it appears that it might be related to label/uuid fstab boot mounting instead when I mount manually without the "noauto,x-systemd.automount” options and use the first device I get from "btrfs fi show" after a "btrfs device scan" I never get the problem does this sound familiar ? I thought I was safe with uuid mount in stab… I can (temporarily) live with manually mounting this filesystem but I would appreciate being able to mount it at boot time via fstab… thanks vincent -----Original Message----- From: "Vincent Olivier" <vincent@up4.com> Sent: Thursday, August 13, 2015 22:42 To: "Duncan" <1i5t5.duncan@cox.net> Cc: linux-btrfs@vger.kernel.org Subject: Re: mount btrfs takes 30 minutes, btrfs check runs out of memory I'll try without autodefrag anyways tomorrow just to make sure. And then file a bug report too with however it decides to behave. Vincent -----Original Message----- From: "Duncan" <1i5t5.duncan@cox.net> Sent: Thursday, August 13, 2015 20:30 To: linux-btrfs@vger.kernel.org Subject: Re: mount btrfs takes 30 minutes, btrfs check runs out of memory Chris Murphy posted on Thu, 13 Aug 2015 17:19:41 -0600 as excerpted: > Well I think others have suggested 3000 snapshots and quite a few things > will get very slow. But then also you have autodefrag and I forget the > interaction of this with many snapshots since the snapshot aware defrag > code was removed. Autodefrag shouldn't have any snapshots mount-time-related interaction, with snapshot-aware-defrag disabled. The interaction between defrag (auto or not) and snapshots will be additional data space usage, since with snapshot-aware disabled, defrag only works with the current copy, thus forcing it to COW the extents elsewhere while not freeing the old extents as they're still referenced by the snapshots, but it shouldn't affect mount-time. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-08-13 23:19 ` Chris Murphy 2015-08-14 0:30 ` Duncan @ 2015-08-14 2:39 ` Vincent Olivier 1 sibling, 0 replies; 54+ messages in thread From: Vincent Olivier @ 2015-08-14 2:39 UTC (permalink / raw) To: linux-btrfs I have 2 snapshots a few days apart for incrementally backing up the volume but that's it. I'll try without autodefrag tomorrow. Vincent -----Original Message----- From: "Chris Murphy" <lists@colorremedies.com> Sent: Thursday, August 13, 2015 19:19 To: "Btrfs BTRFS" <linux-btrfs@vger.kernel.org> Subject: Re: mount btrfs takes 30 minutes, btrfs check runs out of memory On Thu, Aug 13, 2015 at 4:38 PM, Vincent Olivier <vincent@up4.com> wrote: > Hi, > > I think I might be having this problem too. 12 x 4TB RAID10 (original makefs, not converted from ext or whatnot). Says it has ~6TiB left. Centos 7. Dual Xeon CPU. 32GB RAM. ELRepo Kernel 4.1.5. Fstab options: noatime,autodefrag,compress=zlib,space_cache,nossd,noauto,x-systemd.automount Well I think others have suggested 3000 snapshots and quite a few things will get very slow. But then also you have autodefrag and I forget the interaction of this with many snapshots since the snapshot aware defrag code was removed. I'd say file a bug with the full details of the hardware from the ground up to the Btrfs file system. And include as an attachment, dmesg with sysrq+t during this "hang". Usually I see t asked if there's just slowness/delays, and w if there's already a kernel message saying there's a blocked task for 120 seconds. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 54+ messages in thread
[parent not found: <CAJ3TwYSW+SvbBrh1u_x+c3HTRx03qSR6BoH5cj_VzCXxZYv6EA@mail.gmail.com>]
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory [not found] ` <CAJ3TwYSW+SvbBrh1u_x+c3HTRx03qSR6BoH5cj_VzCXxZYv6EA@mail.gmail.com> @ 2016-07-15 3:56 ` Qu Wenruo [not found] ` <CAJ3TwYRXwDVVfT0TRRiM9dEw-7TvY8qG=WvMYKczZOv6wkFWAQ@mail.gmail.com> 2016-07-15 11:29 ` Christian Rohmann 0 siblings, 2 replies; 54+ messages in thread From: Qu Wenruo @ 2016-07-15 3:56 UTC (permalink / raw) To: John Ettedgui, Austin S Hemmelgarn; +Cc: btrfs Sorry for the late reply. [Slow mount] In fact we also reproduce the same problem, and found the problem. It's related to the size of extent tree. If the extent tree is large enough, mount needs to do quite a lot of IO to read out all block group items. And such read is random small read (default leaf size is just 16K), and considering the per GB cost, spinning rust is the normal choice for such large fs, which makes random small read even more slower. The good news is, we have patch to slightly speedup the mount, by avoiding reading out unrelated tree blocks. In our test environment, it takes 15% less time to mount a fs filled with 16K files(2T used space). https://patchwork.kernel.org/patch/9021421/ And according to the facts that only extent size is related to the problem, any method to reduce extent tree size will help, including defrag, nodatacow. [Btrfsck OOM] Lu Fengqi is developing btrfsck low memory usage mode. It's not merged into mainline btrfs progs and not fully completely, but shows quite positive result for large fs. It may needs sometime to get it stable, but IMHO it's going the right direction. Thanks, Qu At 07/12/2016 04:31 AM, John Ettedgui wrote: > On Wed, Aug 5, 2015 at 4:30 AM Austin S Hemmelgarn <ahferroin7@gmail.com > <mailto:ahferroin7@gmail.com>> wrote: >> Yeah, you're probably better off getting a TB disk and starting with >> that. In theory it is possible to automate the process, but I would >> advise against that if at all possible, it's a lot easier to recover >> from an error if you're doing it manually. > > Hello, > > Has there been any progress on this issue? > > My btrfs partitions are now all cleanly made, not converted from ext4, > and yet they still take up to 30s to mount. Interestingly they're all > about the same size, but some take quite longer than others.. I guess > differences FS related. > > > Thank you! > John ^ permalink raw reply [flat|nested] 54+ messages in thread
[parent not found: <CAJ3TwYRXwDVVfT0TRRiM9dEw-7TvY8qG=WvMYKczZOv6wkFWAQ@mail.gmail.com>]
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory [not found] ` <CAJ3TwYRXwDVVfT0TRRiM9dEw-7TvY8qG=WvMYKczZOv6wkFWAQ@mail.gmail.com> @ 2016-07-15 5:24 ` Qu Wenruo 2016-07-15 6:56 ` Kai Krakow [not found] ` <CAJ3TwYSTnQfj=qmBLtnmtXQKexMMD4x=9Gk3p3anf4uF+G26kw@mail.gmail.com> 0 siblings, 2 replies; 54+ messages in thread From: Qu Wenruo @ 2016-07-15 5:24 UTC (permalink / raw) To: John Ettedgui, Austin S Hemmelgarn; +Cc: btrfs At 07/15/2016 12:39 PM, John Ettedgui wrote: > On Thu, Jul 14, 2016 at 8:56 PM Qu Wenruo <quwenruo@cn.fujitsu.com > <mailto:quwenruo@cn.fujitsu.com>> wrote: > > Sorry for the late reply. > > Oh it's all good, it's only a been a few days. > > [Slow mount] > In fact we also reproduce the same problem, and found the problem. > > Awesome! > > It's related to the size of extent tree. > > If the extent tree is large enough, mount needs to do quite a lot of IO > to read out all block group items. > And such read is random small read (default leaf size is just 16K), and > considering the per GB cost, spinning rust is the normal choice for such > large fs, which makes random small read even more slower. > > > The good news is, we have patch to slightly speedup the mount, by > avoiding reading out unrelated tree blocks. > > In our test environment, it takes 15% less time to mount a fs filled > with 16K files(2T used space). > > https://patchwork.kernel.org/patch/9021421/ > > > Great, I will try this and report on it. > > And according to the facts that only extent size is related to the > problem, any method to reduce extent tree size will help, including > defrag, nodatacow. > > Would increasing the leaf size help as well? May help. But didn't test it, and since leafsize can only be determined at mkfs time, it's not an easy thing to try it. > nodatacow seems unsafe Nodatacow is not that unsafe, as btrfs will still do data cow if it's needed, like rewriting data of another subvolume/snapshot. That would be one of the most obvious method if you do a lot of rewrite. > as for defrag, all my partitions are already on > autodefrag, so I assume that should be good. Or is manual once in a > while a good idea as well? AFAIK autodefrag will only help if you're doing appending write. Manual one will help more, but since btrfs has problem defraging extents shared by different subvolumes, I doubt the effect if you have a lot of subvolumes/snapshots. Another method is to disable compression. For compression, file extent size up limit is 128K, while for non-compress case, it's 128M. So for the same 1G sized file, it would cause 8K extents using compression, while only 8 extents without compression. > > Is there a way to display the tree size? that would help knowing what > worked and what didn't. You can dump the whole extent tree to get the accurate size: # btrfs-debug-tree -t 2 <your dev> > some_file It may be quite long, so output redirection is highly recommended. You can do it online(mounted), but if the fs is very very large, it's recommended to do it offline(unmounted), or at least make sure there is not much write while mounted. Check the first few line then you can already get the overall size: ------ btrfs-progs v4.6.1 extent tree key (EXTENT_TREE ROOT_ITEM 0) node 30441472 level *1* items 41 free 452 generation 7 owner 2 ------ If the level is high (7 is the highest possible value), it's almost sure that's the problem. For accurate space size, use the following scrip t to get the number of extent tree blocks: ------ $ egrep -e "^node" -e "^leaf" some_file | wc -l ------ Then multiple it by nodesize, you get the accurate size of extent tree. Thanks, Qu > > > [Btrfsck OOM] > Lu Fengqi is developing btrfsck low memory usage mode. > It's not merged into mainline btrfs progs and not fully completely, but > shows quite positive result for large fs. > > It may needs sometime to get it stable, but IMHO it's going the right > direction. > > Well that is great news as well, thank you for sharing it! > > Thanks, > Qu > > > Thank you! > John ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2016-07-15 5:24 ` Qu Wenruo @ 2016-07-15 6:56 ` Kai Krakow [not found] ` <CAJ3TwYSTnQfj=qmBLtnmtXQKexMMD4x=9Gk3p3anf4uF+G26kw@mail.gmail.com> 1 sibling, 0 replies; 54+ messages in thread From: Kai Krakow @ 2016-07-15 6:56 UTC (permalink / raw) To: linux-btrfs Am Fri, 15 Jul 2016 13:24:45 +0800 schrieb Qu Wenruo <quwenruo@cn.fujitsu.com>: > > as for defrag, all my partitions are already on > > autodefrag, so I assume that should be good. Or is manual once in a > > while a good idea as well? > AFAIK autodefrag will only help if you're doing appending write. > > Manual one will help more, but since btrfs has problem defraging > extents shared by different subvolumes, I doubt the effect if you > have a lot of subvolumes/snapshots. "btrfs fi defrag" is said to only defrag metadata if you are pointing it to directories only without recursion. It could maybe help that case without unsharing the extents: find /btrfs-subvol0 -type d -print0 | xargs -0 btrfs fi defrag -- Regards, Kai Replies to list-only preferred. ^ permalink raw reply [flat|nested] 54+ messages in thread
[parent not found: <CAJ3TwYSTnQfj=qmBLtnmtXQKexMMD4x=9Gk3p3anf4uF+G26kw@mail.gmail.com>]
[parent not found: <CAJ3TwYTnMPVwkrZEU-=Q_Nq+9Bn0vM3z+EFC8RP=RTyaufSoqw@mail.gmail.com>]
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory [not found] ` <CAJ3TwYTnMPVwkrZEU-=Q_Nq+9Bn0vM3z+EFC8RP=RTyaufSoqw@mail.gmail.com> @ 2016-07-18 1:13 ` Qu Wenruo [not found] ` <CAJ3TwYRpc_R-wVur0T6+Uy_aPVXTGpvp_ag1Ar9K2HoB0H1ySQ@mail.gmail.com> 0 siblings, 1 reply; 54+ messages in thread From: Qu Wenruo @ 2016-07-18 1:13 UTC (permalink / raw) To: John Ettedgui, Austin S Hemmelgarn; +Cc: btrfs At 07/16/2016 07:17 PM, John Ettedgui wrote: > On Thu, Jul 14, 2016 at 10:54 PM John Ettedgui <john.ettedgui@gmail.com > <mailto:john.ettedgui@gmail.com>> wrote: > > On Thu, Jul 14, 2016 at 10:26 PM Qu Wenruo <quwenruo@cn.fujitsu.com > <mailto:quwenruo@cn.fujitsu.com>> wrote: > > > > Would increasing the leaf size help as well? > > > nodatacow seems unsafe > > > Nodatacow is not that unsafe, as btrfs will still do data cow if > it's > needed, like rewriting data of another subvolume/snapshot. > > Alright. > > That would be one of the most obvious method if you do a lot of > rewrite. > > > as for defrag, all my partitions are already on > > autodefrag, so I assume that should be good. Or is manual once > in a > > while a good idea as well? > AFAIK autodefrag will only help if you're doing appending write. > > Manual one will help more, but since btrfs has problem defraging > extents > shared by different subvolumes, I doubt the effect if you have a > lot of > subvolumes/snapshots. > > I don't have any subvolume/snapshot for the big partitions, my usage > there is fairly simple. I'll have to add a regular defrag job then. > > > Another method is to disable compression. > For compression, file extent size up limit is 128K, while for > non-compress case, it's 128M. > > So for the same 1G sized file, it would cause 8K extents using > compression, while only 8 extents without compression. > > Now that might be something important, I do use LZO compression on > all of them. > Does this limit apply to only compressed files, or any file if the > fs is mounted using the compression option? > Would mounting these partitions without compression option and then > defragmenting them reverse the compression? > > I've tried this for the slowest to mount partition. > I changed its mount option to compression=no, run defrag and balance. > Not sure if the latter was needed but I thought to try... like in the > past it worked fine up to dusage=99 but with 100% I get a crash, oh well. > The result of defrag + nocompress (I don't know how much it actually > decompressed, and if it changed the limit Qu mentioned before) is about > 26% less time spent to mount the partition, and it's no more my slowerst > partition to mount.! Well, compression=no only affects any write after the mount option. And balance won't help to convert compressed extent to non-compressed one. But maybe the defrag will convert them to normal extents. The best method to de-compress them is, to read them out and rewrite them with compression=no mount option. > > I'll try just defragmenting another partition but keeping the > compression on and see what difference I get there the same changes. > > I've tried the patch, which applied fine to my kernel (4.6.4) but I > don't see any difference in mounting time, maybe I made a mistake or my > issue is not really the same? Pretty possible that there is another problem causing the slow mount. The best method to verify is to do a ftrace on the btrfs mount. Here is the script I tested my patch: ------ #!/bin/bash trace_dir=/sys/kernel/debug/tracing init_trace () { echo 0 > $trace_dir/tracing_on echo > $trace_dir/trace echo function_graph > $trace_dir/current_tracer echo > $trace_dir/set_ftrace_filter echo open_ctree >> $trace_dir/set_ftrace_filter echo btrfs_read_chunk_tree >> $trace_dir/set_ftrace_filter echo btrfs_read_block_groups >> $trace_dir/set_ftrace_filter # This will generate tons of trace, better to comment it out echo find_block_group >> $trace_dir/set_ftrace_filter echo 1 > $trace_dir/tracing_on } end_trace () { cp $trace_dir/trace $(dirname $0) echo 0 > $trace_dir/tracing_on echo > $trace_dir/set_ftrace_filter echo > $trace_dir/trace } init_trace echo start mounting time mount /dev/sdb /mnt/test echo mount done end_trace ------ After executing the script, you got a file named "trace" at the same directory of the script. The content will be like: ------ # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) $ 7670856 us | open_ctree [btrfs](); 2) * 13533.45 us | btrfs_read_chunk_tree [btrfs](); 2) # 1320.981 us | btrfs_init_space_info [btrfs](); 2) | btrfs_read_block_groups [btrfs]() { 2) * 10127.35 us | find_block_group [btrfs](); 2) 4.951 us | find_block_group [btrfs](); 2) * 26225.17 us | find_block_group [btrfs](); ...... 3) * 26450.28 us | find_block_group [btrfs](); 3) * 11590.29 us | find_block_group [btrfs](); 3) $ 7557210 us | } /* btrfs_read_block_groups [btrfs] */ <<< ------ And you can see open_ctree() function, the main part of btrfs mount, takes about 7.67 seconds to execute, while btrfs_read_block_groups() takes 7.56 seconds, about 98.6% of the open_ctree() executing time. If your result are much the same as mine, then that's the same problem. And after applying my patch, please try to compare the executing time of btrfs_read_block_groups() to see if there is any obvious(>5%) change. Thanks, Qu > > Thank you, > John ^ permalink raw reply [flat|nested] 54+ messages in thread
[parent not found: <CAJ3TwYRpc_R-wVur0T6+Uy_aPVXTGpvp_ag1Ar9K2HoB0H1ySQ@mail.gmail.com>]
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory [not found] ` <CAJ3TwYRpc_R-wVur0T6+Uy_aPVXTGpvp_ag1Ar9K2HoB0H1ySQ@mail.gmail.com> @ 2016-07-18 8:41 ` Qu Wenruo [not found] ` <CAJ3TwYRH8JVkuv2Hu7FYb+BSwKGrq1spx079zwOF_FO1y=9NFA@mail.gmail.com> 0 siblings, 1 reply; 54+ messages in thread From: Qu Wenruo @ 2016-07-18 8:41 UTC (permalink / raw) To: John Ettedgui, Austin S Hemmelgarn; +Cc: btrfs At 07/18/2016 04:20 PM, John Ettedgui wrote: > On Sun, Jul 17, 2016 at 6:14 PM Qu Wenruo <quwenruo@cn.fujitsu.com > <mailto:quwenruo@cn.fujitsu.com>> wrote: > > > Well, compression=no only affects any write after the mount option. > And balance won't help to convert compressed extent to > non-compressed one. > > But maybe the defrag will convert them to normal extents. > > The best method to de-compress them is, to read them out and rewrite > them with compression=no mount option. > > Right, I just don't have the extra storage for that right now, though I > suppose I could do little by little, but manually that would take a > really long time, so I went with the defrag route :) > > > > > > I'll try just defragmenting another partition but keeping the > > compression on and see what difference I get there the same changes. > > > > So following that, another partition got its mounting time reduced by > about 70% by running a manual defrag (I kept compression on and used > -clzo for this defragmentation). > So maybe a manual defrag is really the best thing to do so far. Seems to be the case. For further investigation, it would be quite nice for you to upload the output of "btrfs-debug-tree -t 2" dump of your fs. Both before and after, and it doesn't containing anything meaningful(no file name/relation, only extent allocation info), so it's should be quite safe to upload. Since I'm really surprised on the mount time reduce, especially considering the fact that for compression case, max extent size is limited to 128K, IMHO defrag won't help much. > > > I've tried the patch, which applied fine to my kernel (4.6.4) but I > > don't see any difference in mounting time, maybe I made a mistake > or my > > issue is not really the same? > > Pretty possible that there is another problem causing the slow mount. > > The best method to verify is to do a ftrace on the btrfs mount. > Here is the script I tested my patch: > > .... > .... > > Thank you for the script, that makes it a lot easier for me! > > And you can see open_ctree() function, the main part of btrfs mount, > takes about 7.67 seconds to execute, while btrfs_read_block_groups() > takes 7.56 seconds, about 98.6% of the open_ctree() executing time. > > If your result are much the same as mine, then that's the same problem. > > They are similar, 99% is spent in btrfs_read_block_groups. > > And after applying my patch, please try to compare the executing time of > btrfs_read_block_groups() to see if there is any obvious(>5%) change. > > Here's what I have for one partition: > > no patch: > open_ctree: 16952419 > btrfs_read_block_groups: 16844453 > ratio: 0.9936312333950689 > > patch: > open_ctree: 16680173 > btrfs_read_block_groups: 16600532 > ratio: 0.9952254092328659 > > ratio no patch/patch: 0.9983981761086407 OK, almost no improvement. So in your case, most BLOCK_GROUP_ITEMS are not at the tail of a extent tree leaf. And in our test environment, it seems that quite some BLOCK_GROUPS_ITEMS are at the tail of an extent tree leaf, and make the improvement quite obvious. But anyway, if we can change the on-disk format to introduce a specific block group items tree, then I assume the mount time would reduce to less than 5 seconds. Thanks, Qu > > Thank you, > John > ^ permalink raw reply [flat|nested] 54+ messages in thread
[parent not found: <CAJ3TwYRH8JVkuv2Hu7FYb+BSwKGrq1spx079zwOF_FO1y=9NFA@mail.gmail.com>]
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory [not found] ` <CAJ3TwYRH8JVkuv2Hu7FYb+BSwKGrq1spx079zwOF_FO1y=9NFA@mail.gmail.com> @ 2016-07-18 9:07 ` Qu Wenruo 2016-07-18 15:31 ` Duncan [not found] ` <CAJ3TwYS6UTkWf=PNku3RG7hPrXMKz3yhk2WqCRLix4v_VwgrmA@mail.gmail.com> 0 siblings, 2 replies; 54+ messages in thread From: Qu Wenruo @ 2016-07-18 9:07 UTC (permalink / raw) To: John Ettedgui, Austin S Hemmelgarn; +Cc: btrfs At 07/18/2016 04:53 PM, John Ettedgui wrote: > On Mon, Jul 18, 2016 at 1:42 AM Qu Wenruo <quwenruo@cn.fujitsu.com > <mailto:quwenruo@cn.fujitsu.com>> wrote: > > > > So following that, another partition got its mounting time reduced by > > about 70% by running a manual defrag (I kept compression on and used > > -clzo for this defragmentation). > > So maybe a manual defrag is really the best thing to do so far. > > Seems to be the case. > > For further investigation, it would be quite nice for you to upload the > output of "btrfs-debug-tree -t 2" dump of your fs. > Both before and after, and it doesn't containing anything meaningful(no > file name/relation, only extent allocation info), so it's should be > quite safe to upload. > > What do you mean by before and after? > Before defragmentation? Yes, to compare the extent size and verify my assumption. But I'm afraid you don't have any fs with that slow mount time any more. > > Since I'm really surprised on the mount time reduce, especially > considering the fact that for compression case, max extent size is > limited to 128K, IMHO defrag won't help much. > > Is the 128K limit for the whole FS or only for files that btrfs deemed > worth to compress? If it's the latter, that could explain why defrag helped. The latter. But the 128K is not for compressed size, but raw size. So no matter the compressed size, any extent whose uncompressed size is larger than 128K will be split. The main reason I'm surprised about the mount time reduce, is that considering the sectorsize (4K for x86_64 and x86), the fragments won't increase too much. The smallest extent size is determined by sectorsize(4K for most arch). Compressed extent up limit is 128K, 4K -> 128K is only 32 times. While for non-compress case, its extent size up limit is 128M. 32K times larger than sector size, or 1024 times larger than compressed extent size. So I'm quite surprised that defrag helps so much. > > > > And after applying my patch, please try to compare the > executing time of > > btrfs_read_block_groups() to see if there is any obvious(>5%) > change. > > > > Here's what I have for one partition: > > > > no patch: > > open_ctree: 16952419 > > btrfs_read_block_groups: 16844453 > > ratio: 0.9936312333950689 > > > > patch: > > open_ctree: 16680173 > > btrfs_read_block_groups: 16600532 > > ratio: 0.9952254092328659 > > > > ratio no patch/patch: 0.9983981761086407 > > OK, almost no improvement. So in your case, most BLOCK_GROUP_ITEMS are > not at the tail of a extent tree leaf. > And in our test environment, it seems that quite some BLOCK_GROUPS_ITEMS > are at the tail of an extent tree leaf, and make the improvement quite > obvious. > > But anyway, if we can change the on-disk format to introduce a specific > block group items tree, then I assume the mount time would reduce to > less than 5 seconds. > > Less than 5 seconds without regular defrag would be nice. > It would be even nicer to be able to convert from one format to another > and not need to do it at mkfs time, but I don't know how feasible that > will be. If it's possible, it may works just like METADATA_ITEM(or skinny_metadata feature), and in that case, time reduce will depend on how many BLOCK_GROUP_ITEMs are in the new tree. Thanks, Qu > > Another option would be to use something like bcache to have the extent > tree on a SSD while the data stays on the HD. No idea how feasible that > would be though... > > Thank you, > John ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2016-07-18 9:07 ` Qu Wenruo @ 2016-07-18 15:31 ` Duncan [not found] ` <CAJ3TwYS6UTkWf=PNku3RG7hPrXMKz3yhk2WqCRLix4v_VwgrmA@mail.gmail.com> 1 sibling, 0 replies; 54+ messages in thread From: Duncan @ 2016-07-18 15:31 UTC (permalink / raw) To: linux-btrfs Qu Wenruo posted on Mon, 18 Jul 2016 17:07:47 +0800 as excerpted: >> Since I'm really surprised on the mount time reduce, especially >> considering the fact that for compression case, max extent size is >> limited to 128K, IMHO defrag won't help much. >> >> Is the 128K limit for the whole FS or only for files that btrfs deemed >> worth to compress? If it's the latter, that could explain why defrag >> helped. > > The latter. But the 128K is not for compressed size, but raw size. > > So no matter the compressed size, any extent whose uncompressed size is > larger than 128K will be split. > > The main reason I'm surprised about the mount time reduce, is that > considering the sectorsize (4K for x86_64 and x86), the fragments won't > increase too much. > The smallest extent size is determined by sectorsize(4K for most arch). > Compressed extent up limit is 128K, 4K -> 128K is only 32 times. While > for non-compress case, its extent size up limit is 128M. > 32K times larger than sector size, or 1024 times larger than compressed > extent size. > > So I'm quite surprised that defrag helps so much. [I'm only seeing your posts here, not his (yet?), so I'm only seeing what you quote from his posts and may be missing part of the story. Never-the- less...] I think what he's referring to is that he's only running compress, not compress-force, and presumably not autodefrag. So there's probably a lot of uncompressed files that were heavily fragmented and thus in many extents, that a defrag run helped consolidate, thus reducing mount time substantially. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 54+ messages in thread
[parent not found: <CAJ3TwYS6UTkWf=PNku3RG7hPrXMKz3yhk2WqCRLix4v_VwgrmA@mail.gmail.com>]
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory [not found] ` <CAJ3TwYS6UTkWf=PNku3RG7hPrXMKz3yhk2WqCRLix4v_VwgrmA@mail.gmail.com> @ 2016-07-21 8:10 ` Qu Wenruo [not found] ` <CAJ3TwYQ47SVpbO1Pb-TWjhaTCCpMFFmijwTgmV8=7+1_a6_3Ww@mail.gmail.com> 0 siblings, 1 reply; 54+ messages in thread From: Qu Wenruo @ 2016-07-21 8:10 UTC (permalink / raw) To: John Ettedgui, Austin S Hemmelgarn; +Cc: btrfs Thanks for the info, pretty helpful. After a simple analysis, the defrag did do a pretty good job. ----------------------------------------------------------------------- | Avg Extent size | Median Extent size | Data Extents | ----------------------------------------------------------------------- Predefrag | 2.6M | 512K | 1043589 | Postdefrag | 7.4M | 80K | 359823 | Defrag reduced the number of extents to 34%! Quite awesome. While I still see quite a lot small extents (In fact, small extents are more after defrag), so I assume there can be more improvement. But considering the mount time is only affected by number of extents (data and meta, but amount of meta is not affect by defrag), so the improvement is already quite obvious now. Much more obvious than my expectation. Thanks, Qu At 07/20/2016 06:44 PM, John Ettedgui wrote: > On Mon, Jul 18, 2016 at 2:07 AM Qu Wenruo <quwenruo@cn.fujitsu.com > <mailto:quwenruo@cn.fujitsu.com>> wrote: > > Yes, to compare the extent size and verify my assumption. > > But I'm afraid you don't have any fs with that slow mount time any more. > > > Here are 2 links for the information you requested, I've gzipped each > file as it was quite big... > > https://mega.nz/#!QhQSHBhb!RwN3kDBK6ZOkq3e5UkNhzB0XnbfgZgql4c5fvjfDq1w > <https://mega.nz/#%21QhQSHBhb%21RwN3kDBK6ZOkq3e5UkNhzB0XnbfgZgql4c5fvjfDq1w> > https://mega.nz/#!M5gVAbLA!S_TxIls1_q6MqMVlCRK5XxTXifXPE76tdJWsf5XRxYE > <https://mega.nz/#%21M5gVAbLA%21S_TxIls1_q6MqMVlCRK5XxTXifXPE76tdJWsf5XRxYE> > > I didn't look at their content, but just comparing their size, there is > quite a difference there. > > Thank you, > John ^ permalink raw reply [flat|nested] 54+ messages in thread
[parent not found: <CAJ3TwYQ47SVpbO1Pb-TWjhaTCCpMFFmijwTgmV8=7+1_a6_3Ww@mail.gmail.com>]
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory [not found] ` <CAJ3TwYQ47SVpbO1Pb-TWjhaTCCpMFFmijwTgmV8=7+1_a6_3Ww@mail.gmail.com> @ 2016-07-21 8:19 ` Qu Wenruo 2016-07-21 15:47 ` Graham Cobb 2018-02-13 10:21 ` John Ettedgui 0 siblings, 2 replies; 54+ messages in thread From: Qu Wenruo @ 2016-07-21 8:19 UTC (permalink / raw) To: John Ettedgui, Austin S Hemmelgarn; +Cc: btrfs At 07/21/2016 04:13 PM, John Ettedgui wrote: > On Thu, Jul 21, 2016 at 1:10 AM Qu Wenruo <quwenruo@cn.fujitsu.com > <mailto:quwenruo@cn.fujitsu.com>> wrote: > > Thanks for the info, pretty helpful. > > After a simple analysis, the defrag did do a pretty good job. > > ----------------------------------------------------------------------- > | Avg Extent size | Median Extent size | Data Extents | > ----------------------------------------------------------------------- > Predefrag | 2.6M | 512K | 1043589 | > Postdefrag | 7.4M | 80K | 359823 | > > Defrag reduced the number of extents to 34%! > > Quite awesome. > > While I still see quite a lot small extents (In fact, small extents are > more after defrag), so I assume there can be more improvement. > > But considering the mount time is only affected by number of extents > (data and meta, but amount of meta is not affect by defrag), so the > improvement is already quite obvious now. > > Much more obvious than my expectation. > > Thanks, > Qu > > I'm glad to be of help. > Is there anything else you'd like me to try? > I don't have any non-defragmented partitions anymore, but you already > got that information so that should be ok. > > Thank you, > John No more. The dump is already good enough for me to dig for some time. We don't usually get such large extent tree dump from a real world use case. It would help us in several ways, from determine how fragmented a block group is to determine if a defrag will help. Thanks, Qu ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2016-07-21 8:19 ` Qu Wenruo @ 2016-07-21 15:47 ` Graham Cobb 2017-04-10 0:52 ` Qu Wenruo 2018-02-13 10:21 ` John Ettedgui 1 sibling, 1 reply; 54+ messages in thread From: Graham Cobb @ 2016-07-21 15:47 UTC (permalink / raw) To: btrfs On 21/07/16 09:19, Qu Wenruo wrote: > We don't usually get such large extent tree dump from a real world use > case. Let us know if you want some more :-) I have a heavily used single disk BTRFS filesystem with about 3.7TB in use and about 9 million extents. I am happy to provide an extent dump if it is useful to you. Particularly if you don't need me to actually unmount it (i.e. you can live with some inconsistencies). ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2016-07-21 15:47 ` Graham Cobb @ 2017-04-10 0:52 ` Qu Wenruo 0 siblings, 0 replies; 54+ messages in thread From: Qu Wenruo @ 2017-04-10 0:52 UTC (permalink / raw) To: Graham Cobb, btrfs At 07/21/2016 11:47 PM, Graham Cobb wrote: > On 21/07/16 09:19, Qu Wenruo wrote: >> We don't usually get such large extent tree dump from a real world use >> case. > > Let us know if you want some more :-) > > I have a heavily used single disk BTRFS filesystem with about 3.7TB in > use and about 9 million extents. I am happy to provide an extent dump > if it is useful to you. Particularly if you don't need me to actually > unmount it (i.e. you can live with some inconsistencies). Btrfs-debug-tree can dump fs on-line. But I'm not sure if the result is consistent. BTW, for the original problem (slow mount and fsck OOM), although we have no progress to handle slow mount but use defrag to reduce the number of extents, for fsck OOM, the lowmem mode now can handle all the trees. So now lowmem mode should not cause OOM now, but it may take much longer time if you have multiple snapshots. It would be nice if you could try it, especially to see if the lowmem mode really lives up to its name. Thanks, Qu > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2016-07-21 8:19 ` Qu Wenruo 2016-07-21 15:47 ` Graham Cobb @ 2018-02-13 10:21 ` John Ettedgui 2018-02-13 11:04 ` Qu Wenruo 1 sibling, 1 reply; 54+ messages in thread From: John Ettedgui @ 2018-02-13 10:21 UTC (permalink / raw) To: Qu Wenruo; +Cc: Austin S Hemmelgarn, btrfs On Thu, Jul 21, 2016 at 1:19 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > > > No more. > > The dump is already good enough for me to dig for some time. > > We don't usually get such large extent tree dump from a real world use case. > > It would help us in several ways, from determine how fragmented a block > group is to determine if a defrag will help. > > Thanks, > Qu > > Hello there, have you found anything good since then? With a default system, the behavior is pretty much still the same, though I have not recreated the partitions since. Defrag helps, but I think balance helps even more. clear_cache may help too, but I'm not really sure as I've not tried it on its own. I was actually able to get a 4TB partition on a 5400rpm HDD to mount in around 500ms, quite faster that even some Gb partitions I have on SSDs! Alas I wrote some files to it and it's taking over a second again, so no more magic there. The workarounds do work, so it's still not a major issue, but they're slow and sometimes I have to workaround the "no space left on device" which then takes even more time. Thank you! John ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2018-02-13 10:21 ` John Ettedgui @ 2018-02-13 11:04 ` Qu Wenruo 2018-02-13 11:25 ` John Ettedgui 0 siblings, 1 reply; 54+ messages in thread From: Qu Wenruo @ 2018-02-13 11:04 UTC (permalink / raw) To: John Ettedgui, Qu Wenruo; +Cc: Austin S Hemmelgarn, btrfs [-- Attachment #1.1: Type: text/plain, Size: 3368 bytes --] On 2018年02月13日 18:21, John Ettedgui wrote: > On Thu, Jul 21, 2016 at 1:19 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: >> >> >> No more. >> >> The dump is already good enough for me to dig for some time. >> >> We don't usually get such large extent tree dump from a real world use case. >> >> It would help us in several ways, from determine how fragmented a block >> group is to determine if a defrag will help. >> >> Thanks, >> Qu >> >> > > > Hello there, > > have you found anything good since then? Unfortunately, not really much to speed it up. This reminds me of the old (and crazy) idea to skip block group build for RO mount. But not really helpful for it. > With a default system, the behavior is pretty much still the same, > though I have not recreated the partitions since. > > Defrag helps, but I think balance helps even more. > clear_cache may help too, but I'm not really sure as I've not tried it > on its own. > I was actually able to get a 4TB partition on a 5400rpm HDD to mount > in around 500ms, quite faster that even some Gb partitions I have on > SSDs! Alas I wrote some files to it and it's taking over a second > again, so no more magic there. The problem is not about how much space it takes, but how many extents are here in the filesystem. For new fs filled with normal data, I'm pretty sure data extents will be as large as its maximum size (256M), causing very little or even no pressure to block group search. > > The workarounds do work, so it's still not a major issue, but they're > slow and sometimes I have to workaround the "no space left on device" > which then takes even more time. And since I went to SUSE, some mail/info is lost during the procedure. Despite that, I have several more assumption to this problem: 1) Metadata usage bumped by inline files If there are a lot of small files (<2K as default), and your metadata usage is quite high (generally speaking, it meta:data ratio should be way below 1:8), that may be the cause. If so, try mount the fs with "max_inline=0" mount option and then try to rewrite such small files. 2) SSD write amplification along with dynamic remapping To be honest, I'm not really buying this idea, since mount doesn't have anything related to write. But running fstrim won't harm anyway. 3) Rewrite the existing files (extreme defrag) In fact, defrag doesn't work well if there are subvolumes/snapshots /reflink involved. The most stupid and mindless way, is to write a small script and find all regular files, read them out and rewrite it back. This should acts much better than traditional defrag, although it's time-consuming and makes snapshot completely meaningless. (and since you're already hitting ENOSPC, I don't think the idea is really working for you) And since you're already hitting ENOSPC, either it's caused by unbalanced meta/data usage, or it's really going hit the limit, I would recommend to enlarge the fs or delete some files to see if it helps. Thanks, Qu > > Thank you! > John > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 520 bytes --] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2018-02-13 11:04 ` Qu Wenruo @ 2018-02-13 11:25 ` John Ettedgui 2018-02-13 11:40 ` Qu Wenruo 0 siblings, 1 reply; 54+ messages in thread From: John Ettedgui @ 2018-02-13 11:25 UTC (permalink / raw) To: Qu Wenruo; +Cc: Qu Wenruo, Austin S Hemmelgarn, btrfs On Tue, Feb 13, 2018 at 3:04 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > On 2018年02月13日 18:21, John Ettedgui wrote: >> Hello there, >> >> have you found anything good since then? > > Unfortunately, not really much to speed it up. Oh :/ > > This reminds me of the old (and crazy) idea to skip block group build > for RO mount. > But not really helpful for it. > >> With a default system, the behavior is pretty much still the same, >> though I have not recreated the partitions since. >> >> Defrag helps, but I think balance helps even more. >> clear_cache may help too, but I'm not really sure as I've not tried it >> on its own. >> I was actually able to get a 4TB partition on a 5400rpm HDD to mount >> in around 500ms, quite faster that even some Gb partitions I have on >> SSDs! Alas I wrote some files to it and it's taking over a second >> again, so no more magic there. > > The problem is not about how much space it takes, but how many extents > are here in the filesystem. > > For new fs filled with normal data, I'm pretty sure data extents will be > as large as its maximum size (256M), causing very little or even no > pressure to block group search. > What do you mean by "new fs", was there any change that would improve the behavior if I were to recreate the FS? Last time we talked I believe max extent was 128M for non-compressed files, so maybe there's been some good change. >> >> The workarounds do work, so it's still not a major issue, but they're >> slow and sometimes I have to workaround the "no space left on device" >> which then takes even more time. > > And since I went to SUSE, some mail/info is lost during the procedure. I still have all mails, if you want it. No dump left though. > > Despite that, I have several more assumption to this problem: > > 1) Metadata usage bumped by inline files What are inline files? Should I view this as inline in C, in that the small files are stored in the tree directly? > If there are a lot of small files (<2K as default), Of the slow to mount partitions: 2 partitions have less than a dozen files smaller than 2K. 1 has about 5 thousand and the last one 15 thousand. Are the latter considered a lot? > and your metadata > usage is quite high (generally speaking, it meta:data ratio should be > way below 1:8), that may be the cause. > The ratio is about 1:900 on average so that should be ok I guess. > If so, try mount the fs with "max_inline=0" mount option and then > try to rewrite such small files. > Should I try that? > 2) SSD write amplification along with dynamic remapping > To be honest, I'm not really buying this idea, since mount doesn't > have anything related to write. > But running fstrim won't harm anyway. > Oh I am not complaining about slow SSDs mounting. I was just amazed that a partition on a slow HDD mounted faster. Without any specific work, my SSDs partitions tend to mount around 1 sec or so. Of course I'd be happy to worry about them once all the partitions on HDDs mount in a handful of ms :) > 3) Rewrite the existing files (extreme defrag) > In fact, defrag doesn't work well if there are subvolumes/snapshots I have no subvolume or snapshot so that's not a problem. > /reflink involved. > The most stupid and mindless way, is to write a small script and find > all regular files, read them out and rewrite it back. > That's fairly straightforward to do, though it should be quite slow so I'd hope not to have to do that too often. > This should acts much better than traditional defrag, although it's > time-consuming and makes snapshot completely meaningless. > (and since you're already hitting ENOSPC, I don't think the idea is > really working for you) > > And since you're already hitting ENOSPC, either it's caused by > unbalanced meta/data usage, or it's really going hit the limit, I would > recommend to enlarge the fs or delete some files to see if it helps. > Yup, I usually either slowly ramp up the {d,m}usage to pass it, or when that does not work I free some space, then balance will finish. Or did you mean to free some space to see about mount speed? > Thanks, > Qu > Thank you for the quick reply! ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2018-02-13 11:25 ` John Ettedgui @ 2018-02-13 11:40 ` Qu Wenruo 2018-02-13 12:06 ` John Ettedgui 2018-02-13 12:26 ` Holger Hoffstätte 0 siblings, 2 replies; 54+ messages in thread From: Qu Wenruo @ 2018-02-13 11:40 UTC (permalink / raw) To: John Ettedgui; +Cc: Qu Wenruo, Austin S Hemmelgarn, btrfs [-- Attachment #1.1: Type: text/plain, Size: 5778 bytes --] On 2018年02月13日 19:25, John Ettedgui wrote: > On Tue, Feb 13, 2018 at 3:04 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> >> >> On 2018年02月13日 18:21, John Ettedgui wrote: >>> Hello there, >>> >>> have you found anything good since then? >> >> Unfortunately, not really much to speed it up. > Oh :/ >> >> This reminds me of the old (and crazy) idea to skip block group build >> for RO mount. >> But not really helpful for it. >> >>> With a default system, the behavior is pretty much still the same, >>> though I have not recreated the partitions since. >>> >>> Defrag helps, but I think balance helps even more. >>> clear_cache may help too, but I'm not really sure as I've not tried it >>> on its own. >>> I was actually able to get a 4TB partition on a 5400rpm HDD to mount >>> in around 500ms, quite faster that even some Gb partitions I have on >>> SSDs! Alas I wrote some files to it and it's taking over a second >>> again, so no more magic there. >> >> The problem is not about how much space it takes, but how many extents >> are here in the filesystem. >> >> For new fs filled with normal data, I'm pretty sure data extents will be >> as large as its maximum size (256M), causing very little or even no >> pressure to block group search. >> > What do you mean by "new fs", I mean the 4TB partition on that 5400rpm HDD. > was there any change that would improve > the behavior if I were to recreate the FS? If you backed up your fs, and recreate a new, empty btrfs on your original SSD, then copying all data back, I believe it would be much faster to mount. > Last time we talked I believe max extent was 128M for non-compressed > files, so maybe there's been some good change. My fault, 128M is correct. >>> >>> The workarounds do work, so it's still not a major issue, but they're >>> slow and sometimes I have to workaround the "no space left on device" >>> which then takes even more time. >> >> And since I went to SUSE, some mail/info is lost during the procedure. > I still have all mails, if you want it. No dump left though. >> >> Despite that, I have several more assumption to this problem: >> >> 1) Metadata usage bumped by inline files > What are inline files? Should I view this as inline in C, in that the > small files are stored in the tree directly? Exactly. >> If there are a lot of small files (<2K as default), > Of the slow to mount partitions: > 2 partitions have less than a dozen files smaller than 2K. > 1 has about 5 thousand and the last one 15 thousand. > Are the latter considered a lot? If using default 16K nodesize, 8 small files takes one leaf. And 15K small failes means about 2K tree extents. Not that much in my opinion, can't even fill half of a metadata chunk. >> and your metadata >> usage is quite high (generally speaking, it meta:data ratio should be >> way below 1:8), that may be the cause. >> > The ratio is about 1:900 on average so that should be ok I guess. Yep, that should be fine. So not metadata to blame. Then purely fragmented data extents. >> If so, try mount the fs with "max_inline=0" mount option and then >> try to rewrite such small files. >> > Should I try that? No need, it won't cause much difference. >> 2) SSD write amplification along with dynamic remapping >> To be honest, I'm not really buying this idea, since mount doesn't >> have anything related to write. >> But running fstrim won't harm anyway. >> > Oh I am not complaining about slow SSDs mounting. I was just amazed > that a partition on a slow HDD mounted faster. > Without any specific work, my SSDs partitions tend to mount around 1 sec or so. > Of course I'd be happy to worry about them once all the partitions on > HDDs mount in a handful of ms :) > >> 3) Rewrite the existing files (extreme defrag) >> In fact, defrag doesn't work well if there are subvolumes/snapshots > I have no subvolume or snapshot so that's not a problem. >> /reflink involved. >> The most stupid and mindless way, is to write a small script and find >> all regular files, read them out and rewrite it back. >> > That's fairly straightforward to do, though it should be quite slow so > I'd hope not to have to do that too often. Then it could be tried on the most frequently updated files then. And since you don't use snapshot, locate such files and then "chattr +C" would make them nodatacow, reducing later fragments. >> This should acts much better than traditional defrag, although it's >> time-consuming and makes snapshot completely meaningless. >> (and since you're already hitting ENOSPC, I don't think the idea is >> really working for you) >> >> And since you're already hitting ENOSPC, either it's caused by >> unbalanced meta/data usage, or it's really going hit the limit, I would >> recommend to enlarge the fs or delete some files to see if it helps. >> > Yup, I usually either slowly ramp up the {d,m}usage to pass it, or > when that does not work I free some space, then balance will finish. > Or did you mean to free some space to see about mount speed? Kind of, just do such freeing in advance, and try to make btrfs always have unallocated space in case. And finally, use latest kernel if possible. IIRC old kernel doesn't have empty block group auto remove, which makes user need to manually balance to free some space. Thanks, Qu >> Thanks, >> Qu >> > > Thank you for the quick reply! > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 520 bytes --] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2018-02-13 11:40 ` Qu Wenruo @ 2018-02-13 12:06 ` John Ettedgui 2018-02-13 12:46 ` Qu Wenruo 2018-02-13 12:26 ` Holger Hoffstätte 1 sibling, 1 reply; 54+ messages in thread From: John Ettedgui @ 2018-02-13 12:06 UTC (permalink / raw) To: Qu Wenruo; +Cc: Austin S Hemmelgarn, btrfs On Tue, Feb 13, 2018 at 3:40 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > On 2018年02月13日 19:25, John Ettedgui wrote: >> On Tue, Feb 13, 2018 at 3:04 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >>> >>> >>> >>> The problem is not about how much space it takes, but how many extents >>> are here in the filesystem. >>> >>> For new fs filled with normal data, I'm pretty sure data extents will be >>> as large as its maximum size (256M), causing very little or even no >>> pressure to block group search. >>> >> What do you mean by "new fs", > > I mean the 4TB partition on that 5400rpm HDD. > >> was there any change that would improve >> the behavior if I were to recreate the FS? > > If you backed up your fs, and recreate a new, empty btrfs on your > original SSD, then copying all data back, I believe it would be much > faster to mount. > Alright, I'll have to wait on getting some more drives for that but I look forward to trying that. >> Last time we talked I believe max extent was 128M for non-compressed >> files, so maybe there's been some good change. > > My fault, 128M is correct. > >>> And since I went to SUSE, some mail/info is lost during the procedure. >> I still have all mails, if you want it. No dump left though. >>> >>> Despite that, I have several more assumption to this problem: >>> >>> 1) Metadata usage bumped by inline files >> What are inline files? Should I view this as inline in C, in that the >> small files are stored in the tree directly? > > Exactly. > >>> If there are a lot of small files (<2K as default), >> Of the slow to mount partitions: >> 2 partitions have less than a dozen files smaller than 2K. >> 1 has about 5 thousand and the last one 15 thousand. >> Are the latter considered a lot? > > If using default 16K nodesize, 8 small files takes one leaf. > And 15K small failes means about 2K tree extents. > > Not that much in my opinion, can't even fill half of a metadata chunk. > >>> and your metadata >>> usage is quite high (generally speaking, it meta:data ratio should be >>> way below 1:8), that may be the cause. >>> >> The ratio is about 1:900 on average so that should be ok I guess. > > Yep, that should be fine. > So not metadata to blame. > > Then purely fragmented data extents. > >>> If so, try mount the fs with "max_inline=0" mount option and then >>> try to rewrite such small files. >>> >> Should I try that? > > No need, it won't cause much difference. Alright! >>> 2) SSD write amplification along with dynamic remapping >>> To be honest, I'm not really buying this idea, since mount doesn't >>> have anything related to write. >>> But running fstrim won't harm anyway. >>> >> Oh I am not complaining about slow SSDs mounting. I was just amazed >> that a partition on a slow HDD mounted faster. >> Without any specific work, my SSDs partitions tend to mount around 1 sec or so. >> Of course I'd be happy to worry about them once all the partitions on >> HDDs mount in a handful of ms :) >> >>> 3) Rewrite the existing files (extreme defrag) >>> In fact, defrag doesn't work well if there are subvolumes/snapshots >> I have no subvolume or snapshot so that's not a problem. >>> /reflink involved. >>> The most stupid and mindless way, is to write a small script and find >>> all regular files, read them out and rewrite it back. >>> >> That's fairly straightforward to do, though it should be quite slow so >> I'd hope not to have to do that too often. > > Then it could be tried on the most frequently updated files then. That's an interesting idea. More than 3/4 of the data is just storage, so that should be very ok. > > And since you don't use snapshot, locate such files and then "chattr +C" > would make them nodatacow, reducing later fragments. I don't understand, why would that reduce later fragments? > >>> This should acts much better than traditional defrag, although it's >>> time-consuming and makes snapshot completely meaningless. >>> (and since you're already hitting ENOSPC, I don't think the idea is >>> really working for you) >>> >>> And since you're already hitting ENOSPC, either it's caused by >>> unbalanced meta/data usage, or it's really going hit the limit, I would >>> recommend to enlarge the fs or delete some files to see if it helps. >>> >> Yup, I usually either slowly ramp up the {d,m}usage to pass it, or >> when that does not work I free some space, then balance will finish. >> Or did you mean to free some space to see about mount speed? > > Kind of, just do such freeing in advance, and try to make btrfs always > have unallocated space in case. > I actually have very little free space on those partitions, usually under 90Gb, maybe that's part of my problem. > And finally, use latest kernel if possible. > IIRC old kernel doesn't have empty block group auto remove, which makes > user need to manually balance to free some space. > > Thanks, > Qu > I am on 4.15 so no problem there. So manual defrag and new FS to try. Thank you! ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2018-02-13 12:06 ` John Ettedgui @ 2018-02-13 12:46 ` Qu Wenruo 2018-02-13 12:52 ` John Ettedgui 0 siblings, 1 reply; 54+ messages in thread From: Qu Wenruo @ 2018-02-13 12:46 UTC (permalink / raw) To: John Ettedgui; +Cc: Austin S Hemmelgarn, btrfs [-- Attachment #1.1: Type: text/plain, Size: 2232 bytes --] On 2018年02月13日 20:06, John Ettedgui wrote: >>>> >>> That's fairly straightforward to do, though it should be quite slow so >>> I'd hope not to have to do that too often. >> >> Then it could be tried on the most frequently updated files then. > > That's an interesting idea. > More than 3/4 of the data is just storage, so that should be very ok. BTW, how the initial data is created? If the initial data is all written once and doesn't get modified later, then the problem may not be fragments. > >> >> And since you don't use snapshot, locate such files and then "chattr +C" >> would make them nodatacow, reducing later fragments. > > I don't understand, why would that reduce later fragments? Later overwrite will not create new extent, but overwrite existing extents. Other than CoW and cause new extents (fragments) Although expand write will still cause new extent, but that's unavoidable anyway. Thanks, Qu > >> >>>> This should acts much better than traditional defrag, although it's >>>> time-consuming and makes snapshot completely meaningless. >>>> (and since you're already hitting ENOSPC, I don't think the idea is >>>> really working for you) >>>> >>>> And since you're already hitting ENOSPC, either it's caused by >>>> unbalanced meta/data usage, or it's really going hit the limit, I would >>>> recommend to enlarge the fs or delete some files to see if it helps. >>>> >>> Yup, I usually either slowly ramp up the {d,m}usage to pass it, or >>> when that does not work I free some space, then balance will finish. >>> Or did you mean to free some space to see about mount speed? >> >> Kind of, just do such freeing in advance, and try to make btrfs always >> have unallocated space in case. >> > > I actually have very little free space on those partitions, usually > under 90Gb, maybe that's part of my problem. > >> And finally, use latest kernel if possible. >> IIRC old kernel doesn't have empty block group auto remove, which makes >> user need to manually balance to free some space. >> >> Thanks, >> Qu >> > > I am on 4.15 so no problem there. > > So manual defrag and new FS to try. > > Thank you! > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 520 bytes --] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2018-02-13 12:46 ` Qu Wenruo @ 2018-02-13 12:52 ` John Ettedgui 0 siblings, 0 replies; 54+ messages in thread From: John Ettedgui @ 2018-02-13 12:52 UTC (permalink / raw) To: Qu Wenruo; +Cc: Austin S Hemmelgarn, btrfs On Tue, Feb 13, 2018 at 4:46 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > On 2018年02月13日 20:06, John Ettedgui wrote: >>>>> >>>> That's fairly straightforward to do, though it should be quite slow so >>>> I'd hope not to have to do that too often. >>> >>> Then it could be tried on the most frequently updated files then. >> >> That's an interesting idea. >> More than 3/4 of the data is just storage, so that should be very ok. > > BTW, how the initial data is created? > > If the initial data is all written once and doesn't get modified later, > then the problem may not be fragments. > Mostly at once when I recreated the FS a few years ago, and then adding on to it slowly. Though I do try to somewhat balance the free space on all partitions of similar drives, so it may be a tad further more from its original condition. >> >>> >>> And since you don't use snapshot, locate such files and then "chattr +C" >>> would make them nodatacow, reducing later fragments. >> >> I don't understand, why would that reduce later fragments? > > Later overwrite will not create new extent, but overwrite existing extents. > Other than CoW and cause new extents (fragments) > > Although expand write will still cause new extent, but that's > unavoidable anyway. > That's why I didn't understand. Fair enough! Thank you! John ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2018-02-13 11:40 ` Qu Wenruo 2018-02-13 12:06 ` John Ettedgui @ 2018-02-13 12:26 ` Holger Hoffstätte 2018-02-13 12:54 ` Qu Wenruo 1 sibling, 1 reply; 54+ messages in thread From: Holger Hoffstätte @ 2018-02-13 12:26 UTC (permalink / raw) To: Qu Wenruo, John Ettedgui; +Cc: Qu Wenruo, Austin S Hemmelgarn, btrfs [-- Attachment #1.1: Type: text/plain, Size: 1094 bytes --] On 02/13/18 12:40, Qu Wenruo wrote: >>> The problem is not about how much space it takes, but how many extents >>> are here in the filesystem. I have no idea why btrfs' mount even needs to touch all block groups to get going (which seems to be the root of the problem), but here's a not so crazy idea for more "mechanical sympathy". Feel free to mock me if this is terribly wrong or not possible. ;) Mounting of even large filesystems (with many extents) seems to be fine on SSDS, but not so fine on rotational storage. We've heard that from several people with large (multi-TB) filesystems, and obviously it's even more terrible on 5400RPM drives because their seeks are sooo sloow. If the problem is that the bgs are touched/iterated in "tree order", would it then not be possible to sort the block groups in physical order before trying to load whatever mount needs to load? That way the entire process would involve less seeking (no backward seeks for one) and the drive could very likely get more done during a rotation before stepping further. cheers, Holger [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 236 bytes --] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2018-02-13 12:26 ` Holger Hoffstätte @ 2018-02-13 12:54 ` Qu Wenruo 2018-02-13 16:24 ` Holger Hoffstätte 0 siblings, 1 reply; 54+ messages in thread From: Qu Wenruo @ 2018-02-13 12:54 UTC (permalink / raw) To: Holger Hoffstätte, John Ettedgui Cc: Qu Wenruo, Austin S Hemmelgarn, btrfs [-- Attachment #1.1: Type: text/plain, Size: 1848 bytes --] On 2018年02月13日 20:26, Holger Hoffstätte wrote: > On 02/13/18 12:40, Qu Wenruo wrote: >>>> The problem is not about how much space it takes, but how many extents >>>> are here in the filesystem. > > I have no idea why btrfs' mount even needs to touch all block groups to > get going (which seems to be the root of the problem), but here's a > not so crazy idea for more "mechanical sympathy". Feel free to mock > me if this is terribly wrong or not possible. ;) > > Mounting of even large filesystems (with many extents) seems to be fine > on SSDS, but not so fine on rotational storage. We've heard that from > several people with large (multi-TB) filesystems, and obviously it's > even more terrible on 5400RPM drives because their seeks are sooo sloow. > > If the problem is that the bgs are touched/iterated in "tree order", > would it then not be possible to sort the block groups in physical order > before trying to load whatever mount needs to load? This is in fact a good idea. Make block group into its own tree. But it will takes a lot of work to do, since we a modifying the on-disk format. In that case, a leaf with default leaf size (16K) can store 678 block group items. And that many block groups can contain data between 169G (256M metadata size) to 1.6T (10G for max data chunk size). And even for tens of tegas, a level-2 tree should handle it without problem, and searching them should be quite fast. The only problem is, I'm not sure if there will be enough developer interesting with this idea, and this idea may have extra problems hidden. Thanks, Qu > That way the entire > process would involve less seeking (no backward seeks for one) and the > drive could very likely get more done during a rotation before stepping > further. > > cheers, > Holger > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 520 bytes --] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2018-02-13 12:54 ` Qu Wenruo @ 2018-02-13 16:24 ` Holger Hoffstätte 2018-02-14 0:43 ` Qu Wenruo 0 siblings, 1 reply; 54+ messages in thread From: Holger Hoffstätte @ 2018-02-13 16:24 UTC (permalink / raw) To: Qu Wenruo, John Ettedgui; +Cc: Qu Wenruo, Austin S Hemmelgarn, btrfs [-- Attachment #1.1: Type: text/plain, Size: 2970 bytes --] On 02/13/18 13:54, Qu Wenruo wrote: > On 2018年02月13日 20:26, Holger Hoffstätte wrote: >> On 02/13/18 12:40, Qu Wenruo wrote: >>>>> The problem is not about how much space it takes, but how many extents >>>>> are here in the filesystem. >> >> I have no idea why btrfs' mount even needs to touch all block groups to >> get going (which seems to be the root of the problem), but here's a >> not so crazy idea for more "mechanical sympathy". Feel free to mock >> me if this is terribly wrong or not possible. ;) >> >> Mounting of even large filesystems (with many extents) seems to be fine >> on SSDS, but not so fine on rotational storage. We've heard that from >> several people with large (multi-TB) filesystems, and obviously it's >> even more terrible on 5400RPM drives because their seeks are sooo sloow. >> >> If the problem is that the bgs are touched/iterated in "tree order", >> would it then not be possible to sort the block groups in physical order >> before trying to load whatever mount needs to load? > > This is in fact a good idea. > Make block group into its own tree. Well, that's not what I was thinking about at all..yet. :) (keep in mind I'm not really that familiar with the internals). Out of curiosity I ran a bit of perf on my own mount process, which is fast (~700 ms) despite being a ~1.1TB fs, mixture of lots of large and small files. Unfortunately it's also very fresh since I recreated it just this weekend, so everything is neatly packed together and fast. In contrast a friend's fs is ~800 GB, but has 11 GB metadata and is pretty old and fragmented (but running an up-to-date kernel). His fs mounts in ~5s. My perf run shows that the only interesting part responsible for mount time is the nested loop in btrfs_read_block_groups calling find_first_block_group (which got inlined & is not in the perf callgraph) over and over again, accounting for 75% of time spent. I now understand your comment why the real solution to this problem is to move bgs into their own tree, and agree: both kitchens and databases have figured out a long time ago that the key to fast scan and lookup performance is to not put different things in the same storage container; in the case of analytical DBMS this is columnar storage. :) But what I originally meant was something much simpler and more brute-force-ish. I see that btrfs_read_block_groups adds readahead (is that actually effective?) but what I was looking for was the equivalent of a DBMS' sequential scan. Right now finding (and loading) a bg seems to involve a nested loop of tree lookups. It seems easier to rip through the entire tree in nice 8MB chunks and discard what you don't need instead of seeking around trying to find all the right bits in scattered order. Could we alleviate cold mounts by starting more readaheads in btrfs_read_block_groups, so that the extent tree is scanned more linearly? cheers, Holger [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 236 bytes --] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2018-02-13 16:24 ` Holger Hoffstätte @ 2018-02-14 0:43 ` Qu Wenruo 0 siblings, 0 replies; 54+ messages in thread From: Qu Wenruo @ 2018-02-14 0:43 UTC (permalink / raw) To: Holger Hoffstätte, John Ettedgui Cc: Qu Wenruo, Austin S Hemmelgarn, btrfs [-- Attachment #1.1: Type: text/plain, Size: 3470 bytes --] On 2018年02月14日 00:24, Holger Hoffstätte wrote: > On 02/13/18 13:54, Qu Wenruo wrote: >> On 2018年02月13日 20:26, Holger Hoffstätte wrote: >>> On 02/13/18 12:40, Qu Wenruo wrote: >>>>>> The problem is not about how much space it takes, but how many extents >>>>>> are here in the filesystem. >>> >>> I have no idea why btrfs' mount even needs to touch all block groups to >>> get going (which seems to be the root of the problem), but here's a >>> not so crazy idea for more "mechanical sympathy". Feel free to mock >>> me if this is terribly wrong or not possible. ;) >>> >>> Mounting of even large filesystems (with many extents) seems to be fine >>> on SSDS, but not so fine on rotational storage. We've heard that from >>> several people with large (multi-TB) filesystems, and obviously it's >>> even more terrible on 5400RPM drives because their seeks are sooo sloow. >>> >>> If the problem is that the bgs are touched/iterated in "tree order", >>> would it then not be possible to sort the block groups in physical order >>> before trying to load whatever mount needs to load? >> >> This is in fact a good idea. >> Make block group into its own tree. > > Well, that's not what I was thinking about at all..yet. :) > (keep in mind I'm not really that familiar with the internals). > > Out of curiosity I ran a bit of perf on my own mount process, which is > fast (~700 ms) despite being a ~1.1TB fs, mixture of lots of large and > small files. Unfortunately it's also very fresh since I recreated it just > this weekend, so everything is neatly packed together and fast. > > In contrast a friend's fs is ~800 GB, but has 11 GB metadata and is pretty > old and fragmented (but running an up-to-date kernel). His fs mounts in ~5s. > > My perf run shows that the only interesting part responsible for mount time > is the nested loop in btrfs_read_block_groups calling find_first_block_group > (which got inlined & is not in the perf callgraph) over and over again, > accounting for 75% of time spent. > > I now understand your comment why the real solution to this problem > is to move bgs into their own tree, and agree: both kitchens and databases > have figured out a long time ago that the key to fast scan and lookup > performance is to not put different things in the same storage container; > in the case of analytical DBMS this is columnar storage. :) > > But what I originally meant was something much simpler and more > brute-force-ish. I see that btrfs_read_block_groups adds readahead > (is that actually effective?) but what I was looking for was the equivalent > of a DBMS' sequential scan. Right now finding (and loading) a bg seems to > involve a nested loop of tree lookups. It seems easier to rip through the > entire tree in nice 8MB chunks and discard what you don't need instead > of seeking around trying to find all the right bits in scattered order. The problem is, the tree (extent tree) containing block groups is very, very very large. It's a tree shared by all subvolumes. And since tree nodes and leaves can be scattered around the whole disk, it's pretty hard to do batch readahead. > > Could we alleviate cold mounts by starting more readaheads in > btrfs_read_block_groups, so that the extent tree is scanned more linearly? Since extent tree is not linear, it won't be as effective as we believe. Thanks, Qu > > cheers, > Holger > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 520 bytes --] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2016-07-15 3:56 ` Qu Wenruo [not found] ` <CAJ3TwYRXwDVVfT0TRRiM9dEw-7TvY8qG=WvMYKczZOv6wkFWAQ@mail.gmail.com> @ 2016-07-15 11:29 ` Christian Rohmann 2016-07-16 23:53 ` Qu Wenruo 1 sibling, 1 reply; 54+ messages in thread From: Christian Rohmann @ 2016-07-15 11:29 UTC (permalink / raw) To: Qu Wenruo, John Ettedgui, Austin S Hemmelgarn; +Cc: btrfs Hey Qu, all On 07/15/2016 05:56 AM, Qu Wenruo wrote: > > The good news is, we have patch to slightly speedup the mount, by > avoiding reading out unrelated tree blocks. > > In our test environment, it takes 15% less time to mount a fs filled > with 16K files(2T used space). > > https://patchwork.kernel.org/patch/9021421/ I have a 30TB RAID6 filesystem with compression on and I've seen mount times of up to 20 minutes (!). I don't want to sound unfair, but 15% improvement is good, but not in the league where BTRFS needs to be. Do I understand you comments correctly that further improvement would result in a change of the on-disk format? Thanks and with regards Christian ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2016-07-15 11:29 ` Christian Rohmann @ 2016-07-16 23:53 ` Qu Wenruo 2016-07-18 13:42 ` Josef Bacik 0 siblings, 1 reply; 54+ messages in thread From: Qu Wenruo @ 2016-07-16 23:53 UTC (permalink / raw) To: Christian Rohmann, Qu Wenruo, John Ettedgui, Austin S Hemmelgarn Cc: btrfs, Chris Mason, David Sterba, Josef Bacik On 07/15/2016 07:29 PM, Christian Rohmann wrote: > Hey Qu, all > > On 07/15/2016 05:56 AM, Qu Wenruo wrote: >> >> The good news is, we have patch to slightly speedup the mount, by >> avoiding reading out unrelated tree blocks. >> >> In our test environment, it takes 15% less time to mount a fs filled >> with 16K files(2T used space). >> >> https://patchwork.kernel.org/patch/9021421/ > > I have a 30TB RAID6 filesystem with compression on and I've seen mount > times of up to 20 minutes (!). > > I don't want to sound unfair, but 15% improvement is good, but not in > the league where BTRFS needs to be. > Do I understand you comments correctly that further improvement would > result in a change of the on-disk format? Yes, that's the case. The problem is, we put BLOCK_GROUP_ITEM into extent tree, along with tons of EXTENT_ITEM/METADATA_ITEM. This makes search for BLOCK_GROUP_ITEM very very very slow if extent tree is really big. On the handle, we search CHUNK_ITEM very very fast, because CHUNK_ITEM are in their own tree. (CHUNK_ITEM and BLOCK_GROUP_ITEM are 1:1 mapped) So to completely fix it, btrfs needs on-disk format change to put BLOCK_GROUP_ITEM into their own tree. IMHO there maybe be some objection from other devs though. Anyway, I add the three maintainers to Cc, and hopes we can get a better idea to fix it. Thanks, Qu > > > > Thanks and with regards > > Christian > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2016-07-16 23:53 ` Qu Wenruo @ 2016-07-18 13:42 ` Josef Bacik 2016-07-19 0:35 ` Qu Wenruo 2016-07-25 13:01 ` David Sterba 0 siblings, 2 replies; 54+ messages in thread From: Josef Bacik @ 2016-07-18 13:42 UTC (permalink / raw) To: Qu Wenruo, Christian Rohmann, Qu Wenruo, John Ettedgui, Austin S Hemmelgarn Cc: btrfs, Chris Mason, David Sterba On 07/16/2016 07:53 PM, Qu Wenruo wrote: > > > On 07/15/2016 07:29 PM, Christian Rohmann wrote: >> Hey Qu, all >> >> On 07/15/2016 05:56 AM, Qu Wenruo wrote: >>> >>> The good news is, we have patch to slightly speedup the mount, by >>> avoiding reading out unrelated tree blocks. >>> >>> In our test environment, it takes 15% less time to mount a fs filled >>> with 16K files(2T used space). >>> >>> https://patchwork.kernel.org/patch/9021421/ >> >> I have a 30TB RAID6 filesystem with compression on and I've seen mount >> times of up to 20 minutes (!). >> >> I don't want to sound unfair, but 15% improvement is good, but not in >> the league where BTRFS needs to be. >> Do I understand you comments correctly that further improvement would >> result in a change of the on-disk format? > > Yes, that's the case. > > The problem is, we put BLOCK_GROUP_ITEM into extent tree, along with tons of > EXTENT_ITEM/METADATA_ITEM. > > This makes search for BLOCK_GROUP_ITEM very very very slow if extent tree is > really big. > > On the handle, we search CHUNK_ITEM very very fast, because CHUNK_ITEM are in > their own tree. > (CHUNK_ITEM and BLOCK_GROUP_ITEM are 1:1 mapped) > > So to completely fix it, btrfs needs on-disk format change to put > BLOCK_GROUP_ITEM into their own tree. > > IMHO there maybe be some objection from other devs though. > > Anyway, I add the three maintainers to Cc, and hopes we can get a better idea to > fix it. Yeah I'm going to fix this when I do the per-block group extent tree thing. Thanks, Josef ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2016-07-18 13:42 ` Josef Bacik @ 2016-07-19 0:35 ` Qu Wenruo 2016-07-25 13:01 ` David Sterba 1 sibling, 0 replies; 54+ messages in thread From: Qu Wenruo @ 2016-07-19 0:35 UTC (permalink / raw) To: Josef Bacik, Qu Wenruo, Christian Rohmann, John Ettedgui, Austin S Hemmelgarn Cc: btrfs, Chris Mason, David Sterba At 07/18/2016 09:42 PM, Josef Bacik wrote: > On 07/16/2016 07:53 PM, Qu Wenruo wrote: >> >> >> On 07/15/2016 07:29 PM, Christian Rohmann wrote: >>> Hey Qu, all >>> >>> On 07/15/2016 05:56 AM, Qu Wenruo wrote: >>>> >>>> The good news is, we have patch to slightly speedup the mount, by >>>> avoiding reading out unrelated tree blocks. >>>> >>>> In our test environment, it takes 15% less time to mount a fs filled >>>> with 16K files(2T used space). >>>> >>>> https://patchwork.kernel.org/patch/9021421/ >>> >>> I have a 30TB RAID6 filesystem with compression on and I've seen mount >>> times of up to 20 minutes (!). >>> >>> I don't want to sound unfair, but 15% improvement is good, but not in >>> the league where BTRFS needs to be. >>> Do I understand you comments correctly that further improvement would >>> result in a change of the on-disk format? >> >> Yes, that's the case. >> >> The problem is, we put BLOCK_GROUP_ITEM into extent tree, along with >> tons of >> EXTENT_ITEM/METADATA_ITEM. >> >> This makes search for BLOCK_GROUP_ITEM very very very slow if extent >> tree is >> really big. >> >> On the handle, we search CHUNK_ITEM very very fast, because CHUNK_ITEM >> are in >> their own tree. >> (CHUNK_ITEM and BLOCK_GROUP_ITEM are 1:1 mapped) >> >> So to completely fix it, btrfs needs on-disk format change to put >> BLOCK_GROUP_ITEM into their own tree. >> >> IMHO there maybe be some objection from other devs though. >> >> Anyway, I add the three maintainers to Cc, and hopes we can get a >> better idea to >> fix it. > > Yeah I'm going to fix this when I do the per-block group extent tree > thing. Thanks, > > Josef Awesome! Can't wait to see the implementation. Thanks, Qu ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2016-07-18 13:42 ` Josef Bacik 2016-07-19 0:35 ` Qu Wenruo @ 2016-07-25 13:01 ` David Sterba 2016-07-25 13:38 ` Josef Bacik 1 sibling, 1 reply; 54+ messages in thread From: David Sterba @ 2016-07-25 13:01 UTC (permalink / raw) To: Josef Bacik Cc: Qu Wenruo, Christian Rohmann, Qu Wenruo, John Ettedgui, Austin S Hemmelgarn, btrfs, Chris Mason On Mon, Jul 18, 2016 at 09:42:50AM -0400, Josef Bacik wrote: > > > > This makes search for BLOCK_GROUP_ITEM very very very slow if extent tree is > > really big. > > > > On the handle, we search CHUNK_ITEM very very fast, because CHUNK_ITEM are in > > their own tree. > > (CHUNK_ITEM and BLOCK_GROUP_ITEM are 1:1 mapped) > > > > So to completely fix it, btrfs needs on-disk format change to put > > BLOCK_GROUP_ITEM into their own tree. > > > > IMHO there maybe be some objection from other devs though. > > > > Anyway, I add the three maintainers to Cc, and hopes we can get a better idea to > > fix it. > > Yeah I'm going to fix this when I do the per-block group extent tree thing. Thanks, Will it be capable of "per- subvolume set" extent trees? IOW, a set of subvolumes will could share extents only among the members of the same group. The usecase is to start an isolate subvolume and allow snapshots (and obviously forbid reflinks outside of the group). ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2016-07-25 13:01 ` David Sterba @ 2016-07-25 13:38 ` Josef Bacik 0 siblings, 0 replies; 54+ messages in thread From: Josef Bacik @ 2016-07-25 13:38 UTC (permalink / raw) To: dsterba, Qu Wenruo, Christian Rohmann, Qu Wenruo, John Ettedgui, Austin S Hemmelgarn, btrfs, Chris Mason On 07/25/2016 09:01 AM, David Sterba wrote: > On Mon, Jul 18, 2016 at 09:42:50AM -0400, Josef Bacik wrote: >>> >>> This makes search for BLOCK_GROUP_ITEM very very very slow if extent tree is >>> really big. >>> >>> On the handle, we search CHUNK_ITEM very very fast, because CHUNK_ITEM are in >>> their own tree. >>> (CHUNK_ITEM and BLOCK_GROUP_ITEM are 1:1 mapped) >>> >>> So to completely fix it, btrfs needs on-disk format change to put >>> BLOCK_GROUP_ITEM into their own tree. >>> >>> IMHO there maybe be some objection from other devs though. >>> >>> Anyway, I add the three maintainers to Cc, and hopes we can get a better idea to >>> fix it. >> >> Yeah I'm going to fix this when I do the per-block group extent tree thing. Thanks, > > Will it be capable of "per- subvolume set" extent trees? IOW, a set of > subvolumes will could share extents only among the members of the same > group. The usecase is to start an isolate subvolume and allow snapshots > (and obviously forbid reflinks outside of the group). > I suppose. The problem is the btrfs_header doesn't have much room for verbose descriptions of which root owns it. We have objectid since it was always unique, but in the case of per bg extents we can't use that anymore, so we have to abuse flags to say this is an extent root block, and then we know that btrfs_header_owner(eb) is really the offset of the root and not the objectid. Doing something like having it per subvolume would mean having another flag that says this block belongs to a subvolume root, and then have the btrfs_header_owner(eb) set to the new offset. The point I'm making is we can do whatever we want here, but it'll be a little strange since we have to use flag bits to indicate what type of root the owner points to, so any future modifications will also be format changes. At least once I get this work done we'll be able to more easily add new variations on the per-whatever setup. Thanks, Josef ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-08-04 3:01 ` Qu Wenruo 2015-08-04 4:58 ` John Ettedgui @ 2015-08-04 14:38 ` Chris Murphy 1 sibling, 0 replies; 54+ messages in thread From: Chris Murphy @ 2015-08-04 14:38 UTC (permalink / raw) To: Qu Wenruo; +Cc: btrfs, Hugo Mills On Mon, Aug 3, 2015 at 9:01 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > Oh, converted... > That's too bad. :( > > [[What's wrong with convert]] > Although btrfs is flex enough in theory to fit itself into the free space of > ext* and works fine, > But in practice, ext* is too fragmental in the standard of btrfs, not to > mention it also enables mixed-blockgroup. There is an -f flag for mkfs to help users avoid accidents. Is there a case to be made for btrfs-convert having either a -f flag, or an interactive "Convert has limitations that could increase risk to data, please see the wiki. Continue? y/n" OR "Convert has limitations, is not recommended for production usage, please see the wiki. Continue? y/n" It just seems users are jumping into convert without reading the wiki warning. Is it a good idea to reduce problems for less experienced users by actively discouraging btrfs-convert for production use? -- Chris Murphy ^ permalink raw reply [flat|nested] 54+ messages in thread
* mount btrfs takes 30 minutes, btrfs check runs out of memory @ 2015-07-29 5:46 Georgi Georgiev 2015-07-29 6:19 ` Qu Wenruo 0 siblings, 1 reply; 54+ messages in thread From: Georgi Georgiev @ 2015-07-29 5:46 UTC (permalink / raw) To: linux-btrfs; +Cc: Georgi Georgiev [-- Attachment #1: Type: text/plain, Size: 5910 bytes --] Using BTRFS on a very large filesystem, and as we put and more data to it, the time it takes to mount it grew to, presently, about 30 minutes. Is there something wrong with the filesystem? Is there a way to bring this time down? ... Here is a snippet from dmesg, showing how long it takes to mount (the EXT4-fs line is the filesystem mounted next in the boot sequence): $ dmesg | grep -A1 btrfs [ 12.215764] TECH PREVIEW: btrfs may not be fully supported. [ 12.215766] Please review provided documentation for limitations. -- [ 12.220266] btrfs: use zlib compression [ 12.220815] btrfs: disk space caching is enabled [ 22.427258] btrfs: bdev /dev/mapper/datavg-backuplv errs: wr 0, rd 0, flush 0, corrupt 0, gen 0 [ 2022.397318] EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts: The btrfs filesystem is quite large: $ sudo btrfs filesystem usage /dev/mapper/datavg-backuplv Overall: Device size: 82.58TiB Device allocated: 82.58TiB Device unallocated: 0.00B Device missing: 0.00B Used: 62.01TiB Free (estimated): 17.76TiB (min: 17.76TiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 0.00B (used: 0.00B) Data,single: Size:79.28TiB, Used:61.52TiB /dev/mapper/datavg-backuplv 79.28TiB Metadata,single: Size:8.00MiB, Used:0.00B /dev/mapper/datavg-backuplv 8.00MiB Metadata,DUP: Size:1.65TiB, Used:252.68GiB /dev/mapper/datavg-backuplv 3.30TiB System,single: Size:4.00MiB, Used:0.00B /dev/mapper/datavg-backuplv 4.00MiB System,DUP: Size:40.00MiB, Used:8.66MiB /dev/mapper/datavg-backuplv 80.00MiB Unallocated: /dev/mapper/datavg-backuplv 0.00B Other info about the filesystem is that it has a rather large number of files and subvolumes and read only snapshots, which started from about zero in March, and grew over to the current state of 3000 snapshots and no idea how many files (filesystem usage is quite stable at the moment). I also noticed that while the machine is rebooted on a weekly basis, the time it takes to come up after a reboot has been growing. This is likely correlated to how long it takes to mount the filesystem, and maybe correlated to how much data there is on the filesystem. Reboot time used to be normally about 3 minutes, then it jumped to 8 minutes on March 21 and the following weeks it went like this: 8 minutes, 11 minutes, 15 minutes... 19, 19, 19, 19, 23, 21, 22 32, 33, 36, 42, 46, 37, 30 This is on CentOS 6.6, and while I understand that the version of btrfs is definitely oldish, even trying to mount the filesystem on a much more recent kernel (3.14.43) there is no improvement. Switching the regular OS kernel from the CentOS one (2.6.32-504.12.2.el6.x86_64) to something more recent is also feasible. I wanted to check the sytem for problems, so tried an offline "btrfs check" using the latest btrfs-progs (version 4.1.2 freshly compiled from source), but "btrfs check" ran out of memory after about 30 minutes. The only output I get is this (timestamps added by me): 2015-07-28 18:14:45 $ sudo btrfs check /dev/datavg/backuplv 2015-07-28 18:33:05 checking extents And at 19:04:55 btrfs was killed by OOM: (abbreviated log below, full excerpt as an attachment). 2015-07-28T19:04:55.224855+09:00 localhost kernel: [11689.692680] htop invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0 ... 2015-07-28T19:04:55.225855+09:00 localhost kernel: [11689.801354] 631 total pagecache pages 2015-07-28T19:04:55.225857+09:00 localhost kernel: [11689.801829] 0 pages in swap cache 2015-07-28T19:04:55.225859+09:00 localhost kernel: [11689.802305] Swap cache stats: add 0, delete 0, find 0/0 2015-07-28T19:04:55.225861+09:00 localhost kernel: [11689.802781] Free swap = 0kB 2015-07-28T19:04:55.225863+09:00 localhost kernel: [11689.803341] Total swap = 0kB 2015-07-28T19:04:55.225864+09:00 localhost kernel: [11689.946223] 16777215 pages RAM 2015-07-28T19:04:55.225867+09:00 localhost kernel: [11689.946724] 295175 pages reserved 2015-07-28T19:04:55.225869+09:00 localhost kernel: [11689.947223] 5173 pages shared 2015-07-28T19:04:55.225871+09:00 localhost kernel: [11689.947721] 16369184 pages non-shared 2015-07-28T19:04:55.225874+09:00 localhost kernel: [11689.948222] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name ... 2015-07-28T19:04:55.225970+09:00 localhost kernel: [11689.994240] [16291] 0 16291 47166 177 18 0 0 sudo 2015-07-28T19:04:55.225972+09:00 localhost kernel: [11689.995232] [16292] 1000 16292 981 20 3 0 0 tai64n 2015-07-28T19:04:55.225974+09:00 localhost kernel: [11689.996241] [16293] 0 16293 47166 177 22 0 0 sudo 2015-07-28T19:04:55.225978+09:00 localhost kernel: [11689.997230] [16294] 1000 16294 1018 21 1 0 0 tai64nlocal 2015-07-28T19:04:55.225993+09:00 localhost kernel: [11689.998227] [16295] 0 16295 16122385 16118611 7 0 0 btrfs 2015-07-28T19:04:55.225995+09:00 localhost kernel: [11689.999210] [16296] 0 16296 25228 25 5 0 0 tee 2015-07-28T19:04:55.225997+09:00 localhost kernel: [11690.000201] [16297] 1000 16297 27133 162 1 0 0 bash ... 2015-07-28T19:04:55.226030+09:00 localhost kernel: [11690.008288] Out of memory: Kill process 16295 (btrfs) score 949 or sacrifice child 2015-07-28T19:04:55.226031+09:00 localhost kernel: [11690.009300] Killed process 16295, UID 0, (btrfs) total-vm:64489540kB, anon-rss:64474408kB, file-rss:36kB Thanks in advance for any advice, -- Georgi [-- Attachment #2: oom.log --] [-- Type: text/plain, Size: 28271 bytes --] 2015-07-28T19:04:55.224855+09:00 localhost kernel: [11689.692680] htop invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0 2015-07-28T19:04:55.225076+09:00 localhost kernel: [11689.693636] htop cpuset=/ mems_allowed=0-1 2015-07-28T19:04:55.225269+09:00 localhost kernel: [11689.694114] Pid: 16323, comm: htop Tainted: G --------------- T 2.6.32-504.12.2.el6.x86_64 #1 2015-07-28T19:04:55.225274+09:00 localhost kernel: [11689.695062] Call Trace: 2015-07-28T19:04:55.225278+09:00 localhost kernel: [11689.695551] [<ffffffff810d40c1>] ? cpuset_print_task_mems_allowed+0x91/0xb0 2015-07-28T19:04:55.225281+09:00 localhost kernel: [11689.696045] [<ffffffff81127300>] ? dump_header+0x90/0x1b0 2015-07-28T19:04:55.225283+09:00 localhost kernel: [11689.696534] [<ffffffff8122eb5c>] ? security_real_capable_noaudit+0x3c/0x70 2015-07-28T19:04:55.225285+09:00 localhost kernel: [11689.697021] [<ffffffff81127782>] ? oom_kill_process+0x82/0x2a0 2015-07-28T19:04:55.225288+09:00 localhost kernel: [11689.697507] [<ffffffff811276c1>] ? select_bad_process+0xe1/0x120 2015-07-28T19:04:55.225290+09:00 localhost kernel: [11689.697991] [<ffffffff81127bc0>] ? out_of_memory+0x220/0x3c0 2015-07-28T19:04:55.225292+09:00 localhost kernel: [11689.698479] [<ffffffff811344df>] ? __alloc_pages_nodemask+0x89f/0x8d0 2015-07-28T19:04:55.225295+09:00 localhost kernel: [11689.698967] [<ffffffff8116c69a>] ? alloc_pages_current+0xaa/0x110 2015-07-28T19:04:55.225297+09:00 localhost kernel: [11689.699451] [<ffffffff811246f7>] ? __page_cache_alloc+0x87/0x90 2015-07-28T19:04:55.225300+09:00 localhost kernel: [11689.699929] [<ffffffff811240de>] ? find_get_page+0x1e/0xa0 2015-07-28T19:04:55.225302+09:00 localhost kernel: [11689.700413] [<ffffffff81125697>] ? filemap_fault+0x1a7/0x500 2015-07-28T19:04:55.225305+09:00 localhost kernel: [11689.700896] [<ffffffff8114eae4>] ? __do_fault+0x54/0x530 2015-07-28T19:04:55.225307+09:00 localhost kernel: [11689.701377] [<ffffffff8114f0b7>] ? handle_pte_fault+0xf7/0xb00 2015-07-28T19:04:55.225310+09:00 localhost kernel: [11689.701862] [<ffffffff811b07e0>] ? mntput_no_expire+0x30/0x110 2015-07-28T19:04:55.225314+09:00 localhost kernel: [11689.702348] [<ffffffff8118b18f>] ? __dentry_open+0x23f/0x360 2015-07-28T19:04:55.225316+09:00 localhost kernel: [11689.702827] [<ffffffff8122e6ff>] ? security_inode_permission+0x1f/0x30 2015-07-28T19:04:55.225318+09:00 localhost kernel: [11689.703308] [<ffffffff8114fcea>] ? handle_mm_fault+0x22a/0x300 2015-07-28T19:04:55.225321+09:00 localhost kernel: [11689.703873] [<ffffffff8104d0d8>] ? __do_page_fault+0x138/0x480 2015-07-28T19:04:55.225323+09:00 localhost kernel: [11689.704357] [<ffffffff8129901b>] ? strncpy_from_user+0x5b/0x90 2015-07-28T19:04:55.225325+09:00 localhost kernel: [11689.704842] [<ffffffff8153003e>] ? do_page_fault+0x3e/0xa0 2015-07-28T19:04:55.225328+09:00 localhost kernel: [11689.705329] [<ffffffff8152d3f5>] ? page_fault+0x25/0x30 2015-07-28T19:04:55.225331+09:00 localhost kernel: [11689.705807] Mem-Info: 2015-07-28T19:04:55.225333+09:00 localhost kernel: [11689.706280] Node 0 DMA per-cpu: 2015-07-28T19:04:55.225336+09:00 localhost kernel: [11689.706756] CPU 0: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225338+09:00 localhost kernel: [11689.707233] CPU 1: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225341+09:00 localhost kernel: [11689.707709] CPU 2: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225344+09:00 localhost kernel: [11689.708190] CPU 3: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225347+09:00 localhost kernel: [11689.708667] CPU 4: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225349+09:00 localhost kernel: [11689.709144] CPU 5: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225352+09:00 localhost kernel: [11689.709622] CPU 6: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225354+09:00 localhost kernel: [11689.710100] CPU 7: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225357+09:00 localhost kernel: [11689.710577] CPU 8: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225359+09:00 localhost kernel: [11689.711052] CPU 9: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225361+09:00 localhost kernel: [11689.711531] CPU 10: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225364+09:00 localhost kernel: [11689.712005] CPU 11: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225366+09:00 localhost kernel: [11689.712485] CPU 12: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225368+09:00 localhost kernel: [11689.712959] CPU 13: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225371+09:00 localhost kernel: [11689.713440] CPU 14: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225373+09:00 localhost kernel: [11689.713912] CPU 15: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225376+09:00 localhost kernel: [11689.714392] CPU 16: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225379+09:00 localhost kernel: [11689.714868] CPU 17: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225381+09:00 localhost kernel: [11689.715350] CPU 18: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225383+09:00 localhost kernel: [11689.715830] CPU 19: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225386+09:00 localhost kernel: [11689.716310] CPU 20: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225388+09:00 localhost kernel: [11689.716788] CPU 21: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225390+09:00 localhost kernel: [11689.717267] CPU 22: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225393+09:00 localhost kernel: [11689.717838] CPU 23: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225396+09:00 localhost kernel: [11689.718314] CPU 24: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225398+09:00 localhost kernel: [11689.718787] CPU 25: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225401+09:00 localhost kernel: [11689.719264] CPU 26: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225403+09:00 localhost kernel: [11689.719741] CPU 27: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225406+09:00 localhost kernel: [11689.720217] CPU 28: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225408+09:00 localhost kernel: [11689.720698] CPU 29: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225411+09:00 localhost kernel: [11689.721177] CPU 30: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225413+09:00 localhost kernel: [11689.721655] CPU 31: hi: 0, btch: 1 usd: 0 2015-07-28T19:04:55.225416+09:00 localhost kernel: [11689.722136] Node 0 DMA32 per-cpu: 2015-07-28T19:04:55.225418+09:00 localhost kernel: [11689.722616] CPU 0: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225421+09:00 localhost kernel: [11689.723095] CPU 1: hi: 186, btch: 31 usd: 30 2015-07-28T19:04:55.225422+09:00 localhost kernel: [11689.734923] CPU 2: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225425+09:00 localhost kernel: [11689.735406] CPU 3: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225427+09:00 localhost kernel: [11689.735880] CPU 4: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225429+09:00 localhost kernel: [11689.736359] CPU 5: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225431+09:00 localhost kernel: [11689.736834] CPU 6: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225433+09:00 localhost kernel: [11689.737312] CPU 7: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225436+09:00 localhost kernel: [11689.737793] CPU 8: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225438+09:00 localhost kernel: [11689.738275] CPU 9: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225440+09:00 localhost kernel: [11689.738755] CPU 10: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225443+09:00 localhost kernel: [11689.739233] CPU 11: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225445+09:00 localhost kernel: [11689.739713] CPU 12: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225448+09:00 localhost kernel: [11689.740192] CPU 13: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225451+09:00 localhost kernel: [11689.740670] CPU 14: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225454+09:00 localhost kernel: [11689.741264] CPU 15: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225469+09:00 localhost kernel: [11689.741744] CPU 16: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225471+09:00 localhost kernel: [11689.742224] CPU 17: hi: 186, btch: 31 usd: 31 2015-07-28T19:04:55.225473+09:00 localhost kernel: [11689.742701] CPU 18: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225476+09:00 localhost kernel: [11689.743178] CPU 19: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225478+09:00 localhost kernel: [11689.743654] CPU 20: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225480+09:00 localhost kernel: [11689.744131] CPU 21: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225483+09:00 localhost kernel: [11689.744610] CPU 22: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225485+09:00 localhost kernel: [11689.745088] CPU 23: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225488+09:00 localhost kernel: [11689.745572] CPU 24: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225490+09:00 localhost kernel: [11689.746144] CPU 25: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225493+09:00 localhost kernel: [11689.746618] CPU 26: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225495+09:00 localhost kernel: [11689.747089] CPU 27: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225498+09:00 localhost kernel: [11689.747570] CPU 28: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225501+09:00 localhost kernel: [11689.748051] CPU 29: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225503+09:00 localhost kernel: [11689.748531] CPU 30: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225505+09:00 localhost kernel: [11689.749007] CPU 31: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225507+09:00 localhost kernel: [11689.749489] Node 0 Normal per-cpu: 2015-07-28T19:04:55.225509+09:00 localhost kernel: [11689.749965] CPU 0: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225510+09:00 localhost kernel: [11689.750444] CPU 1: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225512+09:00 localhost kernel: [11689.750921] CPU 2: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225514+09:00 localhost kernel: [11689.751405] CPU 3: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225517+09:00 localhost kernel: [11689.751886] CPU 4: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225532+09:00 localhost kernel: [11689.752368] CPU 5: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225534+09:00 localhost kernel: [11689.752849] CPU 6: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225536+09:00 localhost kernel: [11689.753329] CPU 7: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225538+09:00 localhost kernel: [11689.753808] CPU 8: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225540+09:00 localhost kernel: [11689.754291] CPU 9: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225543+09:00 localhost kernel: [11689.754770] CPU 10: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225545+09:00 localhost kernel: [11689.755250] CPU 11: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225548+09:00 localhost kernel: [11689.755731] CPU 12: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225550+09:00 localhost kernel: [11689.756214] CPU 13: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225552+09:00 localhost kernel: [11689.756692] CPU 14: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225554+09:00 localhost kernel: [11689.757174] CPU 15: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225556+09:00 localhost kernel: [11689.757652] CPU 16: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225558+09:00 localhost kernel: [11689.758129] CPU 17: hi: 186, btch: 31 usd: 5 2015-07-28T19:04:55.225561+09:00 localhost kernel: [11689.758607] CPU 18: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225563+09:00 localhost kernel: [11689.759081] CPU 19: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225570+09:00 localhost kernel: [11689.759559] CPU 20: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225573+09:00 localhost kernel: [11689.760038] CPU 21: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225575+09:00 localhost kernel: [11689.760608] CPU 22: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225578+09:00 localhost kernel: [11689.761082] CPU 23: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225580+09:00 localhost kernel: [11689.761562] CPU 24: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225582+09:00 localhost kernel: [11689.762040] CPU 25: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225584+09:00 localhost kernel: [11689.762519] CPU 26: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225585+09:00 localhost kernel: [11689.762998] CPU 27: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225587+09:00 localhost kernel: [11689.763481] CPU 28: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225589+09:00 localhost kernel: [11689.763961] CPU 29: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225591+09:00 localhost kernel: [11689.764442] CPU 30: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225593+09:00 localhost kernel: [11689.764921] CPU 31: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225595+09:00 localhost kernel: [11689.765402] Node 1 Normal per-cpu: 2015-07-28T19:04:55.225597+09:00 localhost kernel: [11689.765882] CPU 0: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225599+09:00 localhost kernel: [11689.766365] CPU 1: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225601+09:00 localhost kernel: [11689.766839] CPU 2: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225603+09:00 localhost kernel: [11689.767316] CPU 3: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225605+09:00 localhost kernel: [11689.767791] CPU 4: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225607+09:00 localhost kernel: [11689.768268] CPU 5: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225609+09:00 localhost kernel: [11689.768746] CPU 6: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225611+09:00 localhost kernel: [11689.769228] CPU 7: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225614+09:00 localhost kernel: [11689.769708] CPU 8: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225616+09:00 localhost kernel: [11689.770187] CPU 9: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225618+09:00 localhost kernel: [11689.770665] CPU 10: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225620+09:00 localhost kernel: [11689.771145] CPU 11: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225622+09:00 localhost kernel: [11689.771624] CPU 12: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225625+09:00 localhost kernel: [11689.772105] CPU 13: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225774+09:00 localhost kernel: [11689.772585] CPU 14: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225777+09:00 localhost kernel: [11689.773063] CPU 15: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225780+09:00 localhost kernel: [11689.773545] CPU 16: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225782+09:00 localhost kernel: [11689.774023] CPU 17: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225784+09:00 localhost kernel: [11689.774506] CPU 18: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225786+09:00 localhost kernel: [11689.775078] CPU 19: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225789+09:00 localhost kernel: [11689.775557] CPU 20: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225792+09:00 localhost kernel: [11689.776031] CPU 21: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225794+09:00 localhost kernel: [11689.776511] CPU 22: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225796+09:00 localhost kernel: [11689.776986] CPU 23: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225797+09:00 localhost kernel: [11689.777469] CPU 24: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225799+09:00 localhost kernel: [11689.777948] CPU 25: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225802+09:00 localhost kernel: [11689.778429] CPU 26: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225804+09:00 localhost kernel: [11689.778906] CPU 27: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225806+09:00 localhost kernel: [11689.779386] CPU 28: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225808+09:00 localhost kernel: [11689.779861] CPU 29: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225810+09:00 localhost kernel: [11689.780343] CPU 30: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225813+09:00 localhost kernel: [11689.780823] CPU 31: hi: 186, btch: 31 usd: 0 2015-07-28T19:04:55.225815+09:00 localhost kernel: [11689.781309] active_anon:16209682 inactive_anon:26 isolated_anon:0 2015-07-28T19:04:55.225817+09:00 localhost kernel: [11689.781309] active_file:124 inactive_file:123 isolated_file:0 2015-07-28T19:04:55.225819+09:00 localhost kernel: [11689.781310] unevictable:0 dirty:3 writeback:0 unstable:0 2015-07-28T19:04:55.225822+09:00 localhost kernel: [11689.781311] free:98591 slab_reclaimable:4276 slab_unreclaimable:15061 2015-07-28T19:04:55.225824+09:00 localhost kernel: [11689.781311] mapped:202 shmem:132 pagetables:32940 bounce:0 2015-07-28T19:04:55.225827+09:00 localhost kernel: [11689.783710] Node 0 DMA free:15740kB min:60kB low:72kB high:88kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes 2015-07-28T19:04:55.225830+09:00 localhost kernel: [11689.786576] lowmem_reserve[]: 0 2955 32245 32245 2015-07-28T19:04:55.225832+09:00 localhost kernel: [11689.787078] Node 0 DMA32 free:129148kB min:11992kB low:14988kB high:17988kB active_anon:2270248kB inactive_anon:0kB active_file:76kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3026080kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:1268kB slab_unreclaimable:1140kB kernel_stack:0kB pagetables:4288kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:124 all_unreclaimable? yes 2015-07-28T19:04:55.225835+09:00 localhost kernel: [11689.790032] lowmem_reserve[]: 0 0 29290 29290 2015-07-28T19:04:55.225837+09:00 localhost kernel: [11689.790536] Node 0 Normal free:118632kB min:118892kB low:148612kB high:178336kB active_anon:29973164kB inactive_anon:40kB active_file:0kB inactive_file:108kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:29992960kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:232kB slab_reclaimable:8184kB slab_unreclaimable:35756kB kernel_stack:4992kB pagetables:61180kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:168 all_unreclaimable? yes 2015-07-28T19:04:55.225839+09:00 localhost kernel: [11689.793404] lowmem_reserve[]: 0 0 0 0 2015-07-28T19:04:55.225841+09:00 localhost kernel: [11689.793904] Node 1 Normal free:130844kB min:131192kB low:163988kB high:196788kB active_anon:32595316kB inactive_anon:64kB active_file:420kB inactive_file:484kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:33095680kB mlocked:0kB dirty:12kB writeback:0kB mapped:804kB shmem:296kB slab_reclaimable:7652kB slab_unreclaimable:23348kB kernel_stack:376kB pagetables:66292kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1426 all_unreclaimable? yes 2015-07-28T19:04:55.225844+09:00 localhost kernel: [11689.796784] lowmem_reserve[]: 0 0 0 0 2015-07-28T19:04:55.225846+09:00 localhost kernel: [11689.797286] Node 0 DMA: 3*4kB 0*8kB 1*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15740kB 2015-07-28T19:04:55.225848+09:00 localhost kernel: [11689.798302] Node 0 DMA32: 348*4kB 323*8kB 294*16kB 230*32kB 189*64kB 145*128kB 92*256kB 47*512kB 26*1024kB 4*2048kB 0*4096kB = 129128kB 2015-07-28T19:04:55.225850+09:00 localhost kernel: [11689.799321] Node 0 Normal: 30009*4kB 1*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 120044kB 2015-07-28T19:04:55.225852+09:00 localhost kernel: [11689.800336] Node 1 Normal: 32983*4kB 11*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 132036kB 2015-07-28T19:04:55.225855+09:00 localhost kernel: [11689.801354] 631 total pagecache pages 2015-07-28T19:04:55.225857+09:00 localhost kernel: [11689.801829] 0 pages in swap cache 2015-07-28T19:04:55.225859+09:00 localhost kernel: [11689.802305] Swap cache stats: add 0, delete 0, find 0/0 2015-07-28T19:04:55.225861+09:00 localhost kernel: [11689.802781] Free swap = 0kB 2015-07-28T19:04:55.225863+09:00 localhost kernel: [11689.803341] Total swap = 0kB 2015-07-28T19:04:55.225864+09:00 localhost kernel: [11689.946223] 16777215 pages RAM 2015-07-28T19:04:55.225867+09:00 localhost kernel: [11689.946724] 295175 pages reserved 2015-07-28T19:04:55.225869+09:00 localhost kernel: [11689.947223] 5173 pages shared 2015-07-28T19:04:55.225871+09:00 localhost kernel: [11689.947721] 16369184 pages non-shared 2015-07-28T19:04:55.225874+09:00 localhost kernel: [11689.948222] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name 2015-07-28T19:04:55.225876+09:00 localhost kernel: [11689.949309] [ 1327] 0 1327 2874 317 1 -17 -1000 udevd 2015-07-28T19:04:55.225879+09:00 localhost kernel: [11689.950308] [ 3227] 0 3227 25814 77 0 0 0 lvmetad 2015-07-28T19:04:55.225881+09:00 localhost kernel: [11689.951302] [ 8574] 0 8574 6899 61 5 -17 -1000 auditd 2015-07-28T19:04:55.225883+09:00 localhost kernel: [11689.952296] [ 8594] 0 8594 125317 287 3 0 0 rsyslogd 2015-07-28T19:04:55.225886+09:00 localhost kernel: [11689.953289] [ 8718] 0 8718 40367 243 0 0 0 pbx_exchange 2015-07-28T19:04:55.225888+09:00 localhost kernel: [11689.954284] [ 8730] 81 8730 5358 63 1 0 0 dbus-daemon 2015-07-28T19:04:55.225890+09:00 localhost kernel: [11689.955281] [ 8768] 0 8768 63314 14236 16 0 0 snmpd 2015-07-28T19:04:55.225892+09:00 localhost kernel: [11689.956277] [ 8785] 0 8785 16554 178 0 -17 -1000 sshd 2015-07-28T19:04:55.225894+09:00 localhost kernel: [11689.957265] [ 8796] 0 8796 5429 59 1 0 0 xinetd 2015-07-28T19:04:55.225897+09:00 localhost kernel: [11689.958259] [ 8823] 38 8823 6627 147 0 0 0 ntpd 2015-07-28T19:04:55.225899+09:00 localhost kernel: [11689.959254] [ 8902] 0 8902 20214 226 21 0 0 master 2015-07-28T19:04:55.225901+09:00 localhost kernel: [11689.960325] [ 8912] 0 8912 29216 156 16 0 0 crond 2015-07-28T19:04:55.225903+09:00 localhost kernel: [11689.961316] [ 8914] 89 8914 20277 238 1 0 0 qmgr 2015-07-28T19:04:55.225906+09:00 localhost kernel: [11689.962315] [ 8935] 0 8935 5276 45 1 0 0 atd 2015-07-28T19:04:55.225908+09:00 localhost kernel: [11689.963310] [ 9141] 0 9141 257570 5334 0 0 0 dsm_sa_datamgrd 2015-07-28T19:04:55.225910+09:00 localhost kernel: [11689.964305] [ 9334] 0 9334 73207 203 17 0 0 dsm_sa_eventmgr 2015-07-28T19:04:55.225913+09:00 localhost kernel: [11689.977183] [ 9347] 0 9347 125807 2198 20 0 0 dsm_sa_snmpd 2015-07-28T19:04:55.225915+09:00 localhost kernel: [11689.978202] [ 9381] 0 9381 33145 113 3 0 0 dsm_om_connsvcd 2015-07-28T19:04:55.225929+09:00 localhost kernel: [11689.979228] [ 9382] 0 9382 889850 61373 19 0 0 dsm_om_connsvcd 2015-07-28T19:04:55.225931+09:00 localhost kernel: [11689.980218] [ 9414] 0 9414 189407 5176 0 0 0 dsm_sa_datamgrd 2015-07-28T19:04:55.225934+09:00 localhost kernel: [11689.981215] [ 9435] 0 9435 159830 1217 1 0 0 dsm_om_shrsvcd 2015-07-28T19:04:55.225936+09:00 localhost kernel: [11689.982216] [10192] 0 10192 1016 20 9 0 0 mingetty 2015-07-28T19:04:55.225938+09:00 localhost kernel: [11689.983201] [10194] 0 10194 1016 21 19 0 0 mingetty 2015-07-28T19:04:55.225940+09:00 localhost kernel: [11689.984202] [10196] 0 10196 1016 21 19 0 0 mingetty 2015-07-28T19:04:55.225942+09:00 localhost kernel: [11689.985201] [10200] 0 10200 1016 21 27 0 0 mingetty 2015-07-28T19:04:55.225944+09:00 localhost kernel: [11689.986180] [10202] 0 10202 1016 22 25 0 0 mingetty 2015-07-28T19:04:55.225946+09:00 localhost kernel: [11689.987176] [13176] 1000 13176 7468 1112 21 0 0 tmux 2015-07-28T19:04:55.225949+09:00 localhost kernel: [11689.988269] [13177] 1000 13177 27187 201 1 0 0 bash 2015-07-28T19:04:55.225951+09:00 localhost kernel: [11689.989268] [13242] 0 13242 1016 21 0 0 0 mingetty 2015-07-28T19:04:55.225961+09:00 localhost kernel: [11689.990262] [15161] 0 15161 2663 109 5 -17 -1000 udevd 2015-07-28T19:04:55.225964+09:00 localhost kernel: [11689.991245] [15179] 0 15179 2661 104 2 -17 -1000 udevd 2015-07-28T19:04:55.225966+09:00 localhost kernel: [11689.992266] [15471] 1000 15471 27133 168 0 0 0 bash 2015-07-28T19:04:55.225968+09:00 localhost kernel: [11689.993246] [15577] 89 15577 20234 218 1 0 0 pickup 2015-07-28T19:04:55.225970+09:00 localhost kernel: [11689.994240] [16291] 0 16291 47166 177 18 0 0 sudo 2015-07-28T19:04:55.225972+09:00 localhost kernel: [11689.995232] [16292] 1000 16292 981 20 3 0 0 tai64n 2015-07-28T19:04:55.225974+09:00 localhost kernel: [11689.996241] [16293] 0 16293 47166 177 22 0 0 sudo 2015-07-28T19:04:55.225978+09:00 localhost kernel: [11689.997230] [16294] 1000 16294 1018 21 1 0 0 tai64nlocal 2015-07-28T19:04:55.225993+09:00 localhost kernel: [11689.998227] [16295] 0 16295 16122385 16118611 7 0 0 btrfs 2015-07-28T19:04:55.225995+09:00 localhost kernel: [11689.999210] [16296] 0 16296 25228 25 5 0 0 tee 2015-07-28T19:04:55.225997+09:00 localhost kernel: [11690.000201] [16297] 1000 16297 27133 162 1 0 0 bash 2015-07-28T19:04:55.225999+09:00 localhost kernel: [11690.001179] [16322] 0 16322 47166 178 19 0 0 sudo 2015-07-28T19:04:55.226015+09:00 localhost kernel: [11690.002167] [16323] 0 16323 28411 433 1 0 0 htop 2015-07-28T19:04:55.226020+09:00 localhost kernel: [11690.003270] [16329] 1000 16329 25240 38 0 0 0 iostat 2015-07-28T19:04:55.226022+09:00 localhost kernel: [11690.004244] [16436] 0 16436 24490 233 0 0 0 sshd 2015-07-28T19:04:55.226024+09:00 localhost kernel: [11690.005229] [16454] 1000 16454 24490 237 2 0 0 sshd 2015-07-28T19:04:55.226026+09:00 localhost kernel: [11690.006230] [16455] 1000 16455 27142 178 16 0 0 bash 2015-07-28T19:04:55.226028+09:00 localhost kernel: [11690.007272] [16481] 1000 16481 5925 82 18 0 0 tmux 2015-07-28T19:04:55.226030+09:00 localhost kernel: [11690.008288] Out of memory: Kill process 16295 (btrfs) score 949 or sacrifice child 2015-07-28T19:04:55.226031+09:00 localhost kernel: [11690.009300] Killed process 16295, UID 0, (btrfs) total-vm:64489540kB, anon-rss:64474408kB, file-rss:36kB ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: mount btrfs takes 30 minutes, btrfs check runs out of memory 2015-07-29 5:46 Georgi Georgiev @ 2015-07-29 6:19 ` Qu Wenruo 0 siblings, 0 replies; 54+ messages in thread From: Qu Wenruo @ 2015-07-29 6:19 UTC (permalink / raw) To: Georgi Georgiev, linux-btrfs Hi, Georgi Georgiev wrote on 2015/07/29 14:46 +0900: > Using BTRFS on a very large filesystem, and as we put and more data to > it, the time it takes to mount it grew to, presently, about 30 minutes. > Is there something wrong with the filesystem? Is there a way to bring > this time down? > > ... > > Here is a snippet from dmesg, showing how long it takes to mount (the > EXT4-fs line is the filesystem mounted next in the boot sequence): > > $ dmesg | grep -A1 btrfs > [ 12.215764] TECH PREVIEW: btrfs may not be fully supported. > [ 12.215766] Please review provided documentation for limitations. > -- > [ 12.220266] btrfs: use zlib compression > [ 12.220815] btrfs: disk space caching is enabled > [ 22.427258] btrfs: bdev /dev/mapper/datavg-backuplv errs: wr 0, rd 0, flush 0, corrupt 0, gen 0 > [ 2022.397318] EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts: > Quite common, especial when it grows large. But it would be much better to use ftrace to show which btrfs operation takes the most time. We have some guess on this, from reading space cache to reading chunk info. But didn't know which takes the most of time. > The btrfs filesystem is quite large: > > $ sudo btrfs filesystem usage /dev/mapper/datavg-backuplv > Overall: > Device size: 82.58TiB > Device allocated: 82.58TiB > Device unallocated: 0.00B > Device missing: 0.00B > Used: 62.01TiB > Free (estimated): 17.76TiB (min: 17.76TiB) > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 0.00B (used: 0.00B) > > Data,single: Size:79.28TiB, Used:61.52TiB > /dev/mapper/datavg-backuplv 79.28TiB > > Metadata,single: Size:8.00MiB, Used:0.00B > /dev/mapper/datavg-backuplv 8.00MiB > > Metadata,DUP: Size:1.65TiB, Used:252.68GiB > /dev/mapper/datavg-backuplv 3.30TiB > > System,single: Size:4.00MiB, Used:0.00B > /dev/mapper/datavg-backuplv 4.00MiB > > System,DUP: Size:40.00MiB, Used:8.66MiB > /dev/mapper/datavg-backuplv 80.00MiB > > Unallocated: > /dev/mapper/datavg-backuplv 0.00B Wow, near 100T, that really huge now. > > Other info about the filesystem is that it has a rather large number of > files and subvolumes and read only snapshots, which started from about > zero in March, and grew over to the current state of 3000 snapshots and > no idea how many files (filesystem usage is quite stable at the moment). > > I also noticed that while the machine is rebooted on a weekly basis, the > time it takes to come up after a reboot has been growing. This is likely > correlated to how long it takes to mount the filesystem, and maybe > correlated to how much data there is on the filesystem. > > Reboot time used to be normally about 3 minutes, then it jumped to 8 > minutes on March 21 and the following weeks it went like this: > 8 minutes, 11 minutes, 15 minutes... > 19, 19, 19, 19, 23, 21, 22 > 32, 33, 36, 42, 46, 37, 30 > > This is on CentOS 6.6, and while I understand that the version of btrfs > is definitely oldish, even trying to mount the filesystem on a much more > recent kernel (3.14.43) there is no improvement. Switching the regular > OS kernel from the CentOS one (2.6.32-504.12.2.el6.x86_64) to something > more recent is also feasible. > > I wanted to check the sytem for problems, so tried an offline "btrfs > check" using the latest btrfs-progs (version 4.1.2 freshly compiled from > source), but "btrfs check" ran out of memory after about 30 minutes. > > The only output I get is this (timestamps added by me): > > 2015-07-28 18:14:45 $ sudo btrfs check /dev/datavg/backuplv > 2015-07-28 18:33:05 checking extents > > And at 19:04:55 btrfs was killed by OOM: (abbreviated log below, > full excerpt as an attachment). Not surprised at all. As for extent/chunk tree checking, it will read all the the chunk and extents, and restore needed info into memory, and then do cross reference check. The btrfsck process really takes a lot of memory. Maybe 1/10 or more of the metadata space. In your case, your metadata is about 250GB, so maybe 25GB memory is used to hold the needed info. That's already known but we don't have some good idea or deveopler to reduce the space usage yet. Maybe we can change the behavior to do chunk by chunk extent cross checking to reduce memory usage, but not now... Thanks, Qu > > 2015-07-28T19:04:55.224855+09:00 localhost kernel: [11689.692680] htop invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0 > ... > 2015-07-28T19:04:55.225855+09:00 localhost kernel: [11689.801354] 631 total pagecache pages > 2015-07-28T19:04:55.225857+09:00 localhost kernel: [11689.801829] 0 pages in swap cache > 2015-07-28T19:04:55.225859+09:00 localhost kernel: [11689.802305] Swap cache stats: add 0, delete 0, find 0/0 > 2015-07-28T19:04:55.225861+09:00 localhost kernel: [11689.802781] Free swap = 0kB > 2015-07-28T19:04:55.225863+09:00 localhost kernel: [11689.803341] Total swap = 0kB > 2015-07-28T19:04:55.225864+09:00 localhost kernel: [11689.946223] 16777215 pages RAM > 2015-07-28T19:04:55.225867+09:00 localhost kernel: [11689.946724] 295175 pages reserved > 2015-07-28T19:04:55.225869+09:00 localhost kernel: [11689.947223] 5173 pages shared > 2015-07-28T19:04:55.225871+09:00 localhost kernel: [11689.947721] 16369184 pages non-shared > 2015-07-28T19:04:55.225874+09:00 localhost kernel: [11689.948222] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name > ... > 2015-07-28T19:04:55.225970+09:00 localhost kernel: [11689.994240] [16291] 0 16291 47166 177 18 0 0 sudo > 2015-07-28T19:04:55.225972+09:00 localhost kernel: [11689.995232] [16292] 1000 16292 981 20 3 0 0 tai64n > 2015-07-28T19:04:55.225974+09:00 localhost kernel: [11689.996241] [16293] 0 16293 47166 177 22 0 0 sudo > 2015-07-28T19:04:55.225978+09:00 localhost kernel: [11689.997230] [16294] 1000 16294 1018 21 1 0 0 tai64nlocal > 2015-07-28T19:04:55.225993+09:00 localhost kernel: [11689.998227] [16295] 0 16295 16122385 16118611 7 0 0 btrfs > 2015-07-28T19:04:55.225995+09:00 localhost kernel: [11689.999210] [16296] 0 16296 25228 25 5 0 0 tee > 2015-07-28T19:04:55.225997+09:00 localhost kernel: [11690.000201] [16297] 1000 16297 27133 162 1 0 0 bash > ... > 2015-07-28T19:04:55.226030+09:00 localhost kernel: [11690.008288] Out of memory: Kill process 16295 (btrfs) score 949 or sacrifice child > 2015-07-28T19:04:55.226031+09:00 localhost kernel: [11690.009300] Killed process 16295, UID 0, (btrfs) total-vm:64489540kB, anon-rss:64474408kB, file-rss:36kB > > Thanks in advance for any advice, > ^ permalink raw reply [flat|nested] 54+ messages in thread
end of thread, other threads:[~2018-02-14 0:43 UTC | newest] Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <CAJ3TwYQXqUZiKhYc5rciTmvGX1RLkHnkQb5SSYAJ7AD+kbudag@mail.gmail.com> 2015-07-31 2:34 ` mount btrfs takes 30 minutes, btrfs check runs out of memory Qu Wenruo 2015-07-31 4:10 ` John Ettedgui 2015-08-02 5:44 ` Georgi Georgiev [not found] ` <CAJ3TwYRN+1tJY+paz=qZT0_XP=r9CcTKbBgX_kZRFOWj8vSK=w@mail.gmail.com> 2015-07-31 4:52 ` Qu Wenruo [not found] ` <CAJ3TwYR5g-JhjmGnZUXqLXc7qV1_=AN5_6sj54JQODbtgG9Aag@mail.gmail.com> 2015-07-31 5:40 ` Qu Wenruo 2015-07-31 5:45 ` John Ettedgui 2015-08-01 4:35 ` John Ettedgui 2015-08-01 10:05 ` Russell Coker 2015-08-04 1:39 ` Qu Wenruo 2015-08-04 1:55 ` John Ettedgui 2015-08-04 2:31 ` John Ettedgui 2015-08-04 3:01 ` Qu Wenruo 2015-08-04 4:58 ` John Ettedgui 2015-08-04 6:47 ` Duncan 2015-08-04 11:28 ` Austin S Hemmelgarn 2015-08-04 17:36 ` John Ettedgui 2015-08-05 11:30 ` Austin S Hemmelgarn 2015-08-13 22:38 ` Vincent Olivier 2015-08-13 23:19 ` Chris Murphy 2015-08-14 0:30 ` Duncan 2015-08-14 2:42 ` Vincent Olivier 2015-08-18 17:36 ` Vincent Olivier 2015-08-14 2:39 ` Vincent Olivier [not found] ` <CAJ3TwYSW+SvbBrh1u_x+c3HTRx03qSR6BoH5cj_VzCXxZYv6EA@mail.gmail.com> 2016-07-15 3:56 ` Qu Wenruo [not found] ` <CAJ3TwYRXwDVVfT0TRRiM9dEw-7TvY8qG=WvMYKczZOv6wkFWAQ@mail.gmail.com> 2016-07-15 5:24 ` Qu Wenruo 2016-07-15 6:56 ` Kai Krakow [not found] ` <CAJ3TwYSTnQfj=qmBLtnmtXQKexMMD4x=9Gk3p3anf4uF+G26kw@mail.gmail.com> [not found] ` <CAJ3TwYTnMPVwkrZEU-=Q_Nq+9Bn0vM3z+EFC8RP=RTyaufSoqw@mail.gmail.com> 2016-07-18 1:13 ` Qu Wenruo [not found] ` <CAJ3TwYRpc_R-wVur0T6+Uy_aPVXTGpvp_ag1Ar9K2HoB0H1ySQ@mail.gmail.com> 2016-07-18 8:41 ` Qu Wenruo [not found] ` <CAJ3TwYRH8JVkuv2Hu7FYb+BSwKGrq1spx079zwOF_FO1y=9NFA@mail.gmail.com> 2016-07-18 9:07 ` Qu Wenruo 2016-07-18 15:31 ` Duncan [not found] ` <CAJ3TwYS6UTkWf=PNku3RG7hPrXMKz3yhk2WqCRLix4v_VwgrmA@mail.gmail.com> 2016-07-21 8:10 ` Qu Wenruo [not found] ` <CAJ3TwYQ47SVpbO1Pb-TWjhaTCCpMFFmijwTgmV8=7+1_a6_3Ww@mail.gmail.com> 2016-07-21 8:19 ` Qu Wenruo 2016-07-21 15:47 ` Graham Cobb 2017-04-10 0:52 ` Qu Wenruo 2018-02-13 10:21 ` John Ettedgui 2018-02-13 11:04 ` Qu Wenruo 2018-02-13 11:25 ` John Ettedgui 2018-02-13 11:40 ` Qu Wenruo 2018-02-13 12:06 ` John Ettedgui 2018-02-13 12:46 ` Qu Wenruo 2018-02-13 12:52 ` John Ettedgui 2018-02-13 12:26 ` Holger Hoffstätte 2018-02-13 12:54 ` Qu Wenruo 2018-02-13 16:24 ` Holger Hoffstätte 2018-02-14 0:43 ` Qu Wenruo 2016-07-15 11:29 ` Christian Rohmann 2016-07-16 23:53 ` Qu Wenruo 2016-07-18 13:42 ` Josef Bacik 2016-07-19 0:35 ` Qu Wenruo 2016-07-25 13:01 ` David Sterba 2016-07-25 13:38 ` Josef Bacik 2015-08-04 14:38 ` Chris Murphy 2015-07-29 5:46 Georgi Georgiev 2015-07-29 6:19 ` Qu Wenruo
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.