All of lore.kernel.org
 help / color / mirror / Atom feed
* Runaway SLAB usage by 'bio' during 'device replace'
@ 2016-05-30 18:48 Chris Johnson
  2016-05-30 20:55 ` Duncan
  2016-05-31  7:42 ` Filipe Manana
  0 siblings, 2 replies; 5+ messages in thread
From: Chris Johnson @ 2016-05-30 18:48 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1599 bytes --]

I have a RAID6 array that had a failed HDD. The drive failed
completely and has been removed from the system. I'm running a 'device
replace' operation with a new disk. The array is ~20TB so this will
take a few days.

Yesterday the system crashed hard with OOM errors about 24 hours into
the replace. Rebooting after the crash and remounting the array
automatically resumed the replace where it left off.

Today I kept a close eye on it and have watched the memory usage creep
up slowly.

htop says this is user process memory (green bar) but shows no user
processes using this much memory

free says this is almost entirely cached/buffered memory that is
taking up the space.

slabtop reveals that there is a highly unusual amount of SLAB going to
'bio' which has to do with block allocation apparently. slabtop output
is attached.

'sync && echo 3 > /proc/sys/vm/drop_caches' clears the high usage
(~4GB) from dentry but 'bio' does not release any (11GB) memory and
continues to grow slowly.

This is running the Rockstor distro based on CentOS. The system has 16GB of RAM.

Kernel: 4.4.5-1.el7.elrepo.x86_64
btrfs-progs: 4.4.1

Kernel messages aren't showing anything of note during the replace
until it starts throwing out OOM errors.

I would like to collect enough information for a useful bug report
here, but I also can't babysit this rebuild during the work week and
reboot it once a day for OOM crashes. Should I cancel the replace
operation and use 'dev delete missing' instead? Will using 'delete
missing' cause any problem if it's done after a partially completed
and canceled replace?

[-- Attachment #2: FWbn3S3C.txt --]
[-- Type: text/plain, Size: 4026 bytes --]

# slabtop -o -s=a
 Active / Total Objects (% used)    : 33431432 / 33664160 (99.3%)
 Active / Total Slabs (% used)      : 1346736 / 1346736 (100.0%)
 Active / Total Caches (% used)     : 78 / 114 (68.4%)
 Active / Total Size (% used)       : 10512136.19K / 10737701.80K (97.9%)
 Minimum / Average / Maximum Object : 0.01K / 0.32K / 15.62K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
32493650 32492775  99%    0.31K 1299746       25  10397968K bio-1                  
323505 323447  99%    0.19K  15405       21     61620K dentry                 
176680 176680 100%    0.07K   3155       56     12620K btrfs_free_space       
118208  41288  34%    0.12K   3694       32     14776K kmalloc-128            
 94528  43378  45%    0.25K   2954       32     23632K kmalloc-256            
 91872  41682  45%    0.50K   2871       32     45936K kmalloc-512            
 83048  39031  46%    4.00K  10381        8    332192K kmalloc-4096           
 69049  69049 100%    0.27K   2381       29     19048K btrfs_extent_buffer    
 46872  46385  98%    0.57K   1674       28     26784K radix_tree_node        
 23460  23460 100%    0.12K    690       34      2760K kernfs_node_cache      
 17536  17536 100%    0.98K    548       32     17536K btrfs_inode            
 16380  16007  97%    0.14K    585       28      2340K btrfs_path             
 12444  11635  93%    0.08K    244       51       976K Acpi-State             
 12404  12404 100%    0.55K    443       28      7088K inode_cache            
 11648  10851  93%    0.06K    182       64       728K kmalloc-64             
 10404   5716  54%    0.08K    204       51       816K btrfs_extent_state     
  8954   8703  97%    0.18K    407       22      1628K vm_area_struct         
  5888   4946  84%    0.03K     46      128       184K kmalloc-32             
  5632   5632 100%    0.01K     11      512        44K kmalloc-8              
  5049   4905  97%    0.08K     99       51       396K anon_vma               
  4352   4352 100%    0.02K     17      256        68K kmalloc-16             
  3723   3723 100%    0.05K     51       73       204K Acpi-Parse             
  3230   3230 100%    0.05K     38       85       152K ftrace_event_field     
  3213   2949  91%    0.19K    153       21       612K kmalloc-192            
  3120   3090  99%    0.61K    120       26      1920K proc_inode_cache       
  2814   2814 100%    0.09K     67       42       268K kmalloc-96             
  1984   1510  76%    1.00K     62       32      1984K kmalloc-1024           
  1904   1904 100%    0.07K     34       56       136K Acpi-Operand           
  1472   1472 100%    0.09K     32       46       128K trace_event_file       
  1224   1224 100%    0.04K     12      102        48K Acpi-Namespace         
  1152   1152 100%    0.64K     48       24       768K shmem_inode_cache      
   592    581  98%    2.00K     37       16      1184K kmalloc-2048           
   528    457  86%    0.36K     24       22       192K blkdev_requests        
   462    355  76%    0.38K     22       21       176K mnt_cache              
   450    433  96%    1.06K     15       30       480K signal_cache           
   429    429 100%    0.20K     11       39        88K btrfs_delayed_ref_head 
   420    420 100%    2.05K     28       15       896K idr_layer_cache        
   408    408 100%    0.04K      4      102        16K btrfs_delayed_extent_op
   400    400 100%    0.62K     16       25       256K sock_inode_cache       
   364    364 100%    0.30K     14       26       112K btrfs_delayed_node     
   351    351 100%    0.10K      9       39        36K buffer_head            
   345    312  90%    2.06K     23       15       736K sighand_cache          
   318    298  93%    5.25K     53        6      1696K task_struct            
   256    256 100%    0.06K      4       64        16K kmem_cache_node        
   256    256 100%    0.02K      1      256         4K jbd2_revoke_table_s

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Runaway SLAB usage by 'bio' during 'device replace'
  2016-05-30 18:48 Runaway SLAB usage by 'bio' during 'device replace' Chris Johnson
@ 2016-05-30 20:55 ` Duncan
  2016-05-31 18:32   ` g6094199
  2016-05-31  7:42 ` Filipe Manana
  1 sibling, 1 reply; 5+ messages in thread
From: Duncan @ 2016-05-30 20:55 UTC (permalink / raw)
  To: linux-btrfs

Chris Johnson posted on Mon, 30 May 2016 11:48:02 -0700 as excerpted:

> I have a RAID6 array that had a failed HDD. The drive failed completely
> and has been removed from the system. I'm running a 'device replace'
> operation with a new disk. The array is ~20TB so this will take a few
> days.

This isn't a direct answer to your issue as I'm a user and list regular, 
not a dev, and that's beyond me, but it's something you need to know, if 
you don't already...

Btrfs raid56 mode remains for the time being in general negatively-
recommended, except specifically for testing with throw-away data, due to 
two critical but not immediately data destroying bugs, one related to 
serial device replacement, the other to balance restriping.  They may or 
may not be related to each other, as neither one has been fully traced.

The serial replace bug has to do with replacing multiple devices, one at 
a time.  The first replace appears to work fine by all visible measures, 
but apparently doesn't return the array to full working condition after 
all, because an attempt to replace a second device fails, and can bring 
down the filesystem.  Unfortunately it doesn't always happen, and due to 
the size of devices these days, working arrays tend to be multi-TB 
monsters that take time to get to this point, so all we have at this 
point is multiple reports of the same issue, but no real way to reproduce 
it.  I believe but am not sure that the problem can occur regardless of 
whether btrfs replace or device add/delete was used.

The restriping bug has to do with restriping to a different width, either 
manually doing a filtered balance after adding devices, or automatically, 
as triggered by btrfs device delete.  Again, multiple reports but not 
nailed down to anything specifically reproducible yet.  The problem here 
is that the restripes, while apparently producing correct results, can 
inexplicably take an order of magnitude (or worse) longer than they 
should.  What one might expect to take hours takes over a week, and on 
the big arrays that might be expected to take 2-3 days, months.

The problem, again, isn't correctness, but the fact that over such long 
periods, the risk of device loss is increased, and if the array was 
already being reshaped/rebalanced to repair loss of one device, loss of 
another device may kill it.

Neither of these bugs affects normal runtime operation, but both are 
critical enough with regard to what people normally use parity-raid for, 
so they /can/ take a device (or two with raid6) loss and repair the array 
to get back to normal operation, that raid56 remains negatively 
recommended for anything but testing with throw-away data, until after 
these bugs can be fully traced and fixed.


Your particular issue doesn't appear to be directly related to either of 
the above.  In fact, I know I've seen patches recently having to do with 
memory leaks that may well fix your problem (tho you'd have to be running 
4.6 at least to have them at this point, and perhaps even 4.7-rc1.

But given the situation, either be sure you have backups and are prepared 
to use them if the array goes south on you due to failed or impractical 
device replacement, or switch to something other than btrfs raid56 mode.  
Btrfs redundancy-raid (raid1 and raid10) are more mature and tested, and 
thus may be options if they fit your filesystem space and device layout 
needs.  Alternatively, btrfs (or other filesystems) on top of dm/md-raid 
may be an option, tho you obviously lose some features of btrfs that 
way.  And of course zfs is the closest btrfs-comparable that's reasonably 
mature and may be an option, tho there are licensing and hardware issues 
(it likes lots of memory on linux due to double-caching of some elements 
as its caching scheme doesn't work well with that of linux, and ecc 
memory is very strongly recommended) if using it on linux.

I'd suggest giving btrfs raid56 another few kernel releases, six months 
to a year, and then check back.  I'd hope the bugs can be properly traced 
and fixed within a couple kernel cycles, so four months or so, but I 
prefer a few cycles to stabilize with no known critical bugs, before I 
recommend it (I was getting close to recommending it after the last known 
critical bug was fixed in 4.1, when these came up), which puts the 
projected timeframe at 8-12 months, before I could really consider raid56 
mode as reasonably stable as btrfs in general, which is to say, 
stabilizing, but not yet fully stable, so even then, the standard admin 
backup rule that if you don't have backups you consider the data to be 
worth less than the time/resources/hassle to do those backups, still 
applies more strongly than it would to a fully mature filesystem.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Runaway SLAB usage by 'bio' during 'device replace'
  2016-05-30 18:48 Runaway SLAB usage by 'bio' during 'device replace' Chris Johnson
  2016-05-30 20:55 ` Duncan
@ 2016-05-31  7:42 ` Filipe Manana
  2016-05-31 13:53   ` Scott Talbert
  1 sibling, 1 reply; 5+ messages in thread
From: Filipe Manana @ 2016-05-31  7:42 UTC (permalink / raw)
  To: Chris Johnson; +Cc: linux-btrfs

On Mon, May 30, 2016 at 7:48 PM, Chris Johnson <hittingsmoke@gmail.com> wrote:
> I have a RAID6 array that had a failed HDD. The drive failed
> completely and has been removed from the system. I'm running a 'device
> replace' operation with a new disk. The array is ~20TB so this will
> take a few days.
>
> Yesterday the system crashed hard with OOM errors about 24 hours into
> the replace. Rebooting after the crash and remounting the array
> automatically resumed the replace where it left off.
>
> Today I kept a close eye on it and have watched the memory usage creep
> up slowly.
>
> htop says this is user process memory (green bar) but shows no user
> processes using this much memory
>
> free says this is almost entirely cached/buffered memory that is
> taking up the space.
>
> slabtop reveals that there is a highly unusual amount of SLAB going to
> 'bio' which has to do with block allocation apparently. slabtop output
> is attached.
>
> 'sync && echo 3 > /proc/sys/vm/drop_caches' clears the high usage
> (~4GB) from dentry but 'bio' does not release any (11GB) memory and
> continues to grow slowly.

Probably you are experiencing a leak that was recently fixed and, at
the moment, available only in the 4.7-rc1 kernel:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4673272f43ae790ab9ec04e38a7542f82bb8f020

>
> This is running the Rockstor distro based on CentOS. The system has 16GB of RAM.
>
> Kernel: 4.4.5-1.el7.elrepo.x86_64
> btrfs-progs: 4.4.1
>
> Kernel messages aren't showing anything of note during the replace
> until it starts throwing out OOM errors.
>
> I would like to collect enough information for a useful bug report
> here, but I also can't babysit this rebuild during the work week and
> reboot it once a day for OOM crashes. Should I cancel the replace
> operation and use 'dev delete missing' instead? Will using 'delete
> missing' cause any problem if it's done after a partially completed
> and canceled replace?



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Runaway SLAB usage by 'bio' during 'device replace'
  2016-05-31  7:42 ` Filipe Manana
@ 2016-05-31 13:53   ` Scott Talbert
  0 siblings, 0 replies; 5+ messages in thread
From: Scott Talbert @ 2016-05-31 13:53 UTC (permalink / raw)
  Cc: Chris Johnson, linux-btrfs



On Tue, 31 May 2016, Filipe Manana wrote:

> On Mon, May 30, 2016 at 7:48 PM, Chris Johnson <hittingsmoke@gmail.com> wrote:
>> I have a RAID6 array that had a failed HDD. The drive failed
>> completely and has been removed from the system. I'm running a 'device
>> replace' operation with a new disk. The array is ~20TB so this will
>> take a few days.
>>
>> Yesterday the system crashed hard with OOM errors about 24 hours into
>> the replace. Rebooting after the crash and remounting the array
>> automatically resumed the replace where it left off.
>>
>> Today I kept a close eye on it and have watched the memory usage creep
>> up slowly.
>>
>> htop says this is user process memory (green bar) but shows no user
>> processes using this much memory
>>
>> free says this is almost entirely cached/buffered memory that is
>> taking up the space.
>>
>> slabtop reveals that there is a highly unusual amount of SLAB going to
>> 'bio' which has to do with block allocation apparently. slabtop output
>> is attached.
>>
>> 'sync && echo 3 > /proc/sys/vm/drop_caches' clears the high usage
>> (~4GB) from dentry but 'bio' does not release any (11GB) memory and
>> continues to grow slowly.
>
> Probably you are experiencing a leak that was recently fixed and, at
> the moment, available only in the 4.7-rc1 kernel:
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4673272f43ae790ab9ec04e38a7542f82bb8f020

Yes, you would almost certainly be hitting that memory leak.

>> This is running the Rockstor distro based on CentOS. The system has 16GB of RAM.
>>
>> Kernel: 4.4.5-1.el7.elrepo.x86_64
>> btrfs-progs: 4.4.1
>>
>> Kernel messages aren't showing anything of note during the replace
>> until it starts throwing out OOM errors.
>>
>> I would like to collect enough information for a useful bug report
>> here, but I also can't babysit this rebuild during the work week and
>> reboot it once a day for OOM crashes. Should I cancel the replace
>> operation and use 'dev delete missing' instead? Will using 'delete
>> missing' cause any problem if it's done after a partially completed
>> and canceled replace?

If you can't get a kernel with the memory leak patched, 'dev delete missing' 
doesn't suffer from the memory leak, so it's possible you could use that. 
Also, in our testing we've seen 'dev delete missing' to be more reliable 
than replace.

As to whether it will be problematic to cancel the replace and do a delete 
missing - that I'm not sure.

Scott

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Runaway SLAB usage by 'bio' during 'device replace'
  2016-05-30 20:55 ` Duncan
@ 2016-05-31 18:32   ` g6094199
  0 siblings, 0 replies; 5+ messages in thread
From: g6094199 @ 2016-05-31 18:32 UTC (permalink / raw)
  To: hittingsmoke, linux-btrfs

Hi Chris,


since you are using a recent LTS kernel on your centos/rockstor, i guess
the kernel errors might help to find some bugs here.

can you give the devs the errors from your logs?
additionally basic info on your raid settings would be nice to, but
which specific details the devs should ask on demand.


But generally speaking raid5/6 works quiet ok in every day use for less
important data, but there a major bugs when it comes to failing disks or
in general when you try to replace harddrives.
I have a similar problem right now. I added a new drive to an array and
while deleting an older drive the new drive failed :-( So i ended up
rescuing all data (8TB) to an new array with "btrfs restore". This took
over a week, cause there is currently no switch to automaticly cancel
looping while recovering. So you will have to manually apply the cancel
command on every file it starts to loop, which might be a lot.

In general adding a new drive and afterwards removing the old one is
more save than the replace method, at least right now (as of kernel
4.5/4.6). But major bug fixes are in the works and there is hope that
raid5/6 becomes more reliable next year.


so good luck!


Am 30.05.2016 um 22:55 schrieb Duncan:
> Chris Johnson posted on Mon, 30 May 2016 11:48:02 -0700 as excerpted:
>
>> I have a RAID6 array that had a failed HDD. The drive failed completely
>> and has been removed from the system. I'm running a 'device replace'
>> operation with a new disk. The array is ~20TB so this will take a few
>> days.
> This isn't a direct answer to your issue as I'm a user and list regular, 
> not a dev, and that's beyond me, but it's something you need to know, if 
> you don't already...
>
> Btrfs raid56 mode remains for the time being in general negatively-
> recommended, except specifically for testing with throw-away data, due to 
> two critical but not immediately data destroying bugs, one related to 
> serial device replacement, the other to balance restriping.  They may or 
> may not be related to each other, as neither one has been fully traced.
>
> The serial replace bug has to do with replacing multiple devices, one at 
> a time.  The first replace appears to work fine by all visible measures, 
> but apparently doesn't return the array to full working condition after 
> all, because an attempt to replace a second device fails, and can bring 
> down the filesystem.  Unfortunately it doesn't always happen, and due to 
> the size of devices these days, working arrays tend to be multi-TB 
> monsters that take time to get to this point, so all we have at this 
> point is multiple reports of the same issue, but no real way to reproduce 
> it.  I believe but am not sure that the problem can occur regardless of 
> whether btrfs replace or device add/delete was used.
>
> The restriping bug has to do with restriping to a different width, either 
> manually doing a filtered balance after adding devices, or automatically, 
> as triggered by btrfs device delete.  Again, multiple reports but not 
> nailed down to anything specifically reproducible yet.  The problem here 
> is that the restripes, while apparently producing correct results, can 
> inexplicably take an order of magnitude (or worse) longer than they 
> should.  What one might expect to take hours takes over a week, and on 
> the big arrays that might be expected to take 2-3 days, months.
>
> The problem, again, isn't correctness, but the fact that over such long 
> periods, the risk of device loss is increased, and if the array was 
> already being reshaped/rebalanced to repair loss of one device, loss of 
> another device may kill it.
>
> Neither of these bugs affects normal runtime operation, but both are 
> critical enough with regard to what people normally use parity-raid for, 
> so they /can/ take a device (or two with raid6) loss and repair the array 
> to get back to normal operation, that raid56 remains negatively 
> recommended for anything but testing with throw-away data, until after 
> these bugs can be fully traced and fixed.
>
>
> Your particular issue doesn't appear to be directly related to either of 
> the above.  In fact, I know I've seen patches recently having to do with 
> memory leaks that may well fix your problem (tho you'd have to be running 
> 4.6 at least to have them at this point, and perhaps even 4.7-rc1.
>
> But given the situation, either be sure you have backups and are prepared 
> to use them if the array goes south on you due to failed or impractical 
> device replacement, or switch to something other than btrfs raid56 mode.  
> Btrfs redundancy-raid (raid1 and raid10) are more mature and tested, and 
> thus may be options if they fit your filesystem space and device layout 
> needs.  Alternatively, btrfs (or other filesystems) on top of dm/md-raid 
> may be an option, tho you obviously lose some features of btrfs that 
> way.  And of course zfs is the closest btrfs-comparable that's reasonably 
> mature and may be an option, tho there are licensing and hardware issues 
> (it likes lots of memory on linux due to double-caching of some elements 
> as its caching scheme doesn't work well with that of linux, and ecc 
> memory is very strongly recommended) if using it on linux.
>
> I'd suggest giving btrfs raid56 another few kernel releases, six months 
> to a year, and then check back.  I'd hope the bugs can be properly traced 
> and fixed within a couple kernel cycles, so four months or so, but I 
> prefer a few cycles to stabilize with no known critical bugs, before I 
> recommend it (I was getting close to recommending it after the last known 
> critical bug was fixed in 4.1, when these came up), which puts the 
> projected timeframe at 8-12 months, before I could really consider raid56 
> mode as reasonably stable as btrfs in general, which is to say, 
> stabilizing, but not yet fully stable, so even then, the standard admin 
> backup rule that if you don't have backups you consider the data to be 
> worth less than the time/resources/hassle to do those backups, still 
> applies more strongly than it would to a fully mature filesystem.
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-05-31 18:36 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-30 18:48 Runaway SLAB usage by 'bio' during 'device replace' Chris Johnson
2016-05-30 20:55 ` Duncan
2016-05-31 18:32   ` g6094199
2016-05-31  7:42 ` Filipe Manana
2016-05-31 13:53   ` Scott Talbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.