All of lore.kernel.org
 help / color / mirror / Atom feed
* Blocket for more than 120 seconds
@ 2013-12-14 20:30 Hans-Kristian Bakke
  2013-12-14 21:35 ` Chris Murphy
  0 siblings, 1 reply; 23+ messages in thread
From: Hans-Kristian Bakke @ 2013-12-14 20:30 UTC (permalink / raw)
  To: linux-btrfs

Hi

During high disk loads, like backups combinded with lot of writers,
rsync at high speed locally or btrfs defrag I always get these
messages, and everything grinds to a halt on the btrfs filesystem:

[ 3123.062229] INFO: task rtorrent:8431 blocked for more than 120 seconds.
[ 3123.062251]       Not tainted 3.12.4 #1
[ 3123.062263] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 3123.062284] rtorrent        D ffff88043fc12e80     0  8431   8429 0x00000000
[ 3123.062287]  ffff8804289d07b0 0000000000000082 ffffffff81610440
ffffffff810408ff
[ 3123.062290]  0000000000012e80 ffff88035f433fd8 ffff88035f433fd8
ffff8804289d07b0
[ 3123.062293]  0000000000000246 ffff88034bda8068 ffff8800ba5a39e8
ffff88035f433740
[ 3123.062295] Call Trace:
[ 3123.062302]  [<ffffffff810408ff>] ? detach_if_pending+0x18/0x6c
[ 3123.062331]  [<ffffffffa0193545>] ?
wait_current_trans.isra.30+0xbc/0x117 [btrfs]
[ 3123.062334]  [<ffffffff810515a1>] ? wake_up_atomic_t+0x22/0x22
[ 3123.062346]  [<ffffffffa0194ef4>] ? start_transaction+0x1d1/0x46b [btrfs]
[ 3123.062359]  [<ffffffffa0199537>] ? btrfs_dirty_inode+0x25/0xa6 [btrfs]
[ 3123.062362]  [<ffffffff8111afe2>] ? file_update_time+0x95/0xb5
[ 3123.062374]  [<ffffffffa01a08a0>] ? btrfs_page_mkwrite+0x68/0x2bc [btrfs]
[ 3123.062377]  [<ffffffff810c3e06>] ? filemap_fault+0x1fa/0x36e
[ 3123.062379]  [<ffffffff810dec6f>] ? __do_fault+0x15b/0x360
[ 3123.062382]  [<ffffffff810e0ffe>] ? handle_mm_fault+0x22c/0x8aa
[ 3123.062385]  [<ffffffff812c6445>] ? dev_hard_start_xmit+0x271/0x3ec
[ 3123.062388]  [<ffffffff81380c2a>] ? __do_page_fault+0x38d/0x3d7
[ 3123.062393]  [<ffffffffa04eeb2e>] ? br_dev_queue_push_xmit+0x9d/0xa1 [bridge]
[ 3123.062397]  [<ffffffffa04ed4b3>] ? br_dev_xmit+0x1c3/0x1e0 [bridge]
[ 3123.062400]  [<ffffffff81060eaa>] ? update_group_power+0xb7/0x1b9
[ 3123.062403]  [<ffffffff811c3456>] ? cpumask_next_and+0x2a/0x3a
[ 3123.062405]  [<ffffffff8106114f>] ? update_sd_lb_stats+0x1a3/0x35a
[ 3123.062407]  [<ffffffff8137e172>] ? page_fault+0x22/0x30
[ 3123.062410]  [<ffffffff811ccc80>] ? copy_user_generic_string+0x30/0x40
[ 3123.062413]  [<ffffffff811d101b>] ? memcpy_toiovec+0x2f/0x5c
[ 3123.062417]  [<ffffffff812bcc5a>] ? skb_copy_datagram_iovec+0x76/0x20d
[ 3123.062419]  [<ffffffff8137dc08>] ? _raw_spin_lock_bh+0xe/0x1c
[ 3123.062422]  [<ffffffff81059ad3>] ? should_resched+0x5/0x23
[ 3123.062426]  [<ffffffff812fa113>] ? tcp_recvmsg+0x72e/0xaa3
[ 3123.062428]  [<ffffffff810615dc>] ? load_balance+0x12c/0x6b5
[ 3123.062431]  [<ffffffff813170b1>] ? inet_recvmsg+0x5a/0x6e
[ 3123.062434]  [<ffffffff810015dc>] ? __switch_to+0x1b1/0x3c4
[ 3123.062437]  [<ffffffff812b32d9>] ? sock_recvmsg+0x54/0x71
[ 3123.062440]  [<ffffffff81139d43>] ? ep_item_poll+0x16/0x1b
[ 3123.062442]  [<ffffffff81139e6f>] ? ep_pm_stay_awake+0xf/0xf
[ 3123.062445]  [<ffffffff8111c81a>] ? fget_light+0x6b/0x7c
[ 3123.062447]  [<ffffffff812b33c0>] ? SYSC_recvfrom+0xca/0x12e
[ 3123.062449]  [<ffffffff8105c309>] ? try_to_wake_up+0x190/0x190
[ 3123.062452]  [<ffffffff81109189>] ? fput+0xf/0x9d
[ 3123.062454]  [<ffffffff8113b4b8>] ? SyS_epoll_wait+0x9c/0xc7
[ 3123.062457]  [<ffffffff81382d62>] ? system_call_fastpath+0x16/0x1b
[ 3123.062462] INFO: task kworker/u16:0:21158 blocked for more than 120 seconds.
[ 3123.062491]       Not tainted 3.12.4 #1
[ 3123.062513] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 3123.062557] kworker/u16:0   D ffff88043fcd2e80     0 21158      2 0x00000000
[ 3123.062561] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-1)
[ 3123.062562]  ffff88026a163830 0000000000000046 ffff88042f0c67b0
0000001000011210
[ 3123.062565]  0000000000012e80 ffff88027067bfd8 ffff88027067bfd8
ffff88026a163830
[ 3123.062567]  0000000000000246 ffff88034bda8068 ffff8800ba5a39e8
ffff88027067b750
[ 3123.062569] Call Trace:
[ 3123.062581]  [<ffffffffa0193545>] ?
wait_current_trans.isra.30+0xbc/0x117 [btrfs]
[ 3123.062584]  [<ffffffff810515a1>] ? wake_up_atomic_t+0x22/0x22
[ 3123.062596]  [<ffffffffa0194ef4>] ? start_transaction+0x1d1/0x46b [btrfs]
[ 3123.062608]  [<ffffffffa019a114>] ? run_delalloc_nocow+0x9c/0x752 [btrfs]
[ 3123.062621]  [<ffffffffa019a82e>] ? run_delalloc_range+0x64/0x333 [btrfs]
[ 3123.062635]  [<ffffffffa01a936c>] ? free_extent_state+0x12/0x21 [btrfs]
[ 3123.062648]  [<ffffffffa01ac32f>] ? __extent_writepage+0x1e5/0x62a [btrfs]
[ 3123.062659]  [<ffffffffa018f5c8>] ? btree_submit_bio_hook+0x7e/0xd7 [btrfs]
[ 3123.062662]  [<ffffffff810c20d1>] ? find_get_pages_tag+0x66/0x121
[ 3123.062675]  [<ffffffffa01ac8be>] ?
extent_write_cache_pages.isra.23.constprop.47+0x14a/0x255 [btrfs]
[ 3123.062688]  [<ffffffffa01acc5c>] ? extent_writepages+0x49/0x60 [btrfs]
[ 3123.062700]  [<ffffffffa0198017>] ? btrfs_submit_direct+0x412/0x412 [btrfs]
[ 3123.062703]  [<ffffffff8112660b>] ? __writeback_single_inode+0x6d/0x1e8
[ 3123.062705]  [<ffffffff8112746a>] ? writeback_sb_inodes+0x1f0/0x322
[ 3123.062707]  [<ffffffff81127605>] ? __writeback_inodes_wb+0x69/0xab
[ 3123.062709]  [<ffffffff8112777d>] ? wb_writeback+0x136/0x292
[ 3123.062712]  [<ffffffff810fbffb>] ? __cache_free.isra.46+0x178/0x187
[ 3123.062714]  [<ffffffff81127a6d>] ? bdi_writeback_workfn+0xc1/0x2fe
[ 3123.062716]  [<ffffffff8105a469>] ? resched_task+0x35/0x5d
[ 3123.062718]  [<ffffffff8105a83d>] ? ttwu_do_wakeup+0xf/0xc1
[ 3123.062721]  [<ffffffff8105c2f7>] ? try_to_wake_up+0x17e/0x190
[ 3123.062723]  [<ffffffff8104bca7>] ? process_one_work+0x191/0x294
[ 3123.062725]  [<ffffffff8104c159>] ? worker_thread+0x121/0x1e7
[ 3123.062726]  [<ffffffff8104c038>] ? rescuer_thread+0x269/0x269
[ 3123.062729]  [<ffffffff81050c01>] ? kthread+0x81/0x89
[ 3123.062731]  [<ffffffff81050b80>] ? __kthread_parkme+0x5d/0x5d
[ 3123.062733]  [<ffffffff81382cbc>] ? ret_from_fork+0x7c/0xb0
[ 3123.062736]  [<ffffffff81050b80>] ? __kthread_parkme+0x5d/0x5d

These are repeated for several processes trying to do something.

I have had no data  losses, only availability issues during high load.
The surest way to trigger these messages is for me to start a copy
from my other local array while doing something like heavy torrenting
at the same time.

Smartd have not reported any disk issues and iostat -d only indicates
a lot of disk activity at the limits of the drives with no drive
seeminlig behaving any different than others (until the error hit,
where the activity goes to zero)

Mount options is default kernel 3.12.4 with compress=lzo. I have 16 GB
ECC RAM and a Quad core Xeon CPU.

I am running this on a 8 disk WD SE 4TB btrfs RAID10 system with a
single snapshot.

I have no expectations of btrfs delivering stellar performance during
heavy IOPs load on ordinary 7200rpm drives, but I do expect it to just
be slow until the load is removed, not more or less completely stall
the entire server.

The filesystem have used about 26TB of the available 29TB (real
available data), and some of the files on it are heavily fragmented
(around 100 000 extents at about 25GB)

Regards,

Hans-Kristian Bakke

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-14 20:30 Blocket for more than 120 seconds Hans-Kristian Bakke
@ 2013-12-14 21:35 ` Chris Murphy
  2013-12-14 23:19   ` Hans-Kristian Bakke
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2013-12-14 21:35 UTC (permalink / raw)
  To: Hans-Kristian Bakke; +Cc: linux-btrfs


On Dec 14, 2013, at 1:30 PM, Hans-Kristian Bakke <hkbakke@gmail.com> wrote:
> 
> During high disk loads, like backups combinded with lot of writers,
> rsync at high speed locally or btrfs defrag I always get these
> messages, and everything grinds to a halt on the btrfs filesystem:
> 
> [ 3123.062229] INFO: task rtorrent:8431 blocked for more than 120 seconds
> [ 3123.062251]       Not tainted 3.12.4 #1

On blocks, if this is an unknown problem, often devs will want to see dmesg after you've issued dmesg -n7 and sysrq+w. More on sysrq triggering is here:
https://www.kernel.org/doc/Documentation/sysrq.txt

> The filesystem have used about 26TB of the available 29TB (real
> available data), and some of the files on it are heavily fragmented
> (around 100 000 extents at about 25GB)

Please include results from btrfs fi show, and btrfs fi df <mp>.


Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-14 21:35 ` Chris Murphy
@ 2013-12-14 23:19   ` Hans-Kristian Bakke
  2013-12-14 23:50     ` Chris Murphy
  0 siblings, 1 reply; 23+ messages in thread
From: Hans-Kristian Bakke @ 2013-12-14 23:19 UTC (permalink / raw)
  To: linux-btrfs

Looking into triggering the error again and dmesg and sysrq, but here
are the other two:

# btrfs fi show
Label: none  uuid: 9302fc8f-15c6-46e9-9217-951d7423927c
        Total devices 8 FS bytes used 13.00TB
        devid    4 size 3.64TB used 3.48TB path /dev/sdt
        devid    3 size 3.64TB used 3.48TB path /dev/sds
        devid    8 size 3.64TB used 3.48TB path /dev/sdr
        devid    6 size 3.64TB used 3.48TB path /dev/sdp
        devid    7 size 3.64TB used 3.48TB path /dev/sdq
        devid    5 size 3.64TB used 3.48TB path /dev/sdo
        devid    1 size 3.64TB used 3.48TB path /dev/sdl
        devid    2 size 3.64TB used 3.48TB path /dev/sdm

Btrfs v0.20-rc1


# btrfs fi df /storage/storage-vol0/
Data, RAID10: total=13.89TB, used=12.99TB
System, RAID10: total=64.00MB, used=1.19MB
System: total=4.00MB, used=0.00
Metadata, RAID10: total=21.00GB, used=17.59GB

Regards,

Hans-Kristian Bakke


On 14 December 2013 22:35, Chris Murphy <lists@colorremedies.com> wrote:
>
> On Dec 14, 2013, at 1:30 PM, Hans-Kristian Bakke <hkbakke@gmail.com> wrote:
>>
>> During high disk loads, like backups combinded with lot of writers,
>> rsync at high speed locally or btrfs defrag I always get these
>> messages, and everything grinds to a halt on the btrfs filesystem:
>>
>> [ 3123.062229] INFO: task rtorrent:8431 blocked for more than 120 seconds
>> [ 3123.062251]       Not tainted 3.12.4 #1
>
> On blocks, if this is an unknown problem, often devs will want to see dmesg after you've issued dmesg -n7 and sysrq+w. More on sysrq triggering is here:
> https://www.kernel.org/doc/Documentation/sysrq.txt
>
>> The filesystem have used about 26TB of the available 29TB (real
>> available data), and some of the files on it are heavily fragmented
>> (around 100 000 extents at about 25GB)
>
> Please include results from btrfs fi show, and btrfs fi df <mp>.
>
>
> Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-14 23:19   ` Hans-Kristian Bakke
@ 2013-12-14 23:50     ` Chris Murphy
  2013-12-15  0:28       ` Hans-Kristian Bakke
  2013-12-15 23:39       ` Charles Cazabon
  0 siblings, 2 replies; 23+ messages in thread
From: Chris Murphy @ 2013-12-14 23:50 UTC (permalink / raw)
  To: Btrfs BTRFS


On Dec 14, 2013, at 4:19 PM, Hans-Kristian Bakke <hkbakke@gmail.com> wrote:

> Looking into triggering the error again and dmesg and sysrq, but here
> are the other two:
> 
> # btrfs fi show
> Label: none  uuid: 9302fc8f-15c6-46e9-9217-951d7423927c
>        Total devices 8 FS bytes used 13.00TB
>        devid    4 size 3.64TB used 3.48TB path /dev/sdt
>        devid    3 size 3.64TB used 3.48TB path /dev/sds
>        devid    8 size 3.64TB used 3.48TB path /dev/sdr
>        devid    6 size 3.64TB used 3.48TB path /dev/sdp
>        devid    7 size 3.64TB used 3.48TB path /dev/sdq
>        devid    5 size 3.64TB used 3.48TB path /dev/sdo
>        devid    1 size 3.64TB used 3.48TB path /dev/sdl
>        devid    2 size 3.64TB used 3.48TB path /dev/sdm
> 
> Btrfs v0.20-rc1
> 
> 
> # btrfs fi df /storage/storage-vol0/
> Data, RAID10: total=13.89TB, used=12.99TB
> System, RAID10: total=64.00MB, used=1.19MB
> System: total=4.00MB, used=0.00
> Metadata, RAID10: total=21.00GB, used=17.59GB

By my count this is ~ 95.6% full. My past experience with other file systems, including btree file systems, is they get unpredictably fussy when they're this full. I start migration planning once 80% full is reached, and make it a policy to avoid going over 90% full.

I don't know what behavior Btrfs developers anticipate for this scenario. On the one hand it seems reasonable to  expect it to only be slow, rather than block the whole server for 2 minutes. But on the other hand, it's reasonable to expect server storage won't get this full.


Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-14 23:50     ` Chris Murphy
@ 2013-12-15  0:28       ` Hans-Kristian Bakke
  2013-12-15  1:59         ` Chris Murphy
  2013-12-15  3:47         ` George Mitchell
  2013-12-15 23:39       ` Charles Cazabon
  1 sibling, 2 replies; 23+ messages in thread
From: Hans-Kristian Bakke @ 2013-12-15  0:28 UTC (permalink / raw)
  To: Btrfs BTRFS

When I look at the entire FS with df-like tools it is reported as
89.4% used (26638.65 of 29808.2 GB). But this is shared amongst both
data and metadata I guess?

I do know that ~90%+ seems full, but it is still around 3TB in my
case! Are the "percentage rules" of old times still valid with modern
disk sizes? It seems extremely inconvenient that a filesystem like
btrfs is starting to misbehave at "only" 3TB available space for
RAID10 mirroring and metadata, which is probably a little bit over 1TB
actual filestorage counting everything in.

I would normally expect that there is no difference in 1TB free space
on a FS that is 2TB in total, and 1TB free space on a filesystem that
is 30TB in total, other than my sense of urge and that you would
probably expect data growth to be more rapid on the 30TB FS as there
is obviously a need to store a lot of stuff.
Is "free space needed" really a different concept dependning on the
size of your FS?
Mvh

Hans-Kristian Bakke


On 15 December 2013 00:50, Chris Murphy <lists@colorremedies.com> wrote:
>
> On Dec 14, 2013, at 4:19 PM, Hans-Kristian Bakke <hkbakke@gmail.com> wrote:
>
>> Looking into triggering the error again and dmesg and sysrq, but here
>> are the other two:
>>
>> # btrfs fi show
>> Label: none  uuid: 9302fc8f-15c6-46e9-9217-951d7423927c
>>        Total devices 8 FS bytes used 13.00TB
>>        devid    4 size 3.64TB used 3.48TB path /dev/sdt
>>        devid    3 size 3.64TB used 3.48TB path /dev/sds
>>        devid    8 size 3.64TB used 3.48TB path /dev/sdr
>>        devid    6 size 3.64TB used 3.48TB path /dev/sdp
>>        devid    7 size 3.64TB used 3.48TB path /dev/sdq
>>        devid    5 size 3.64TB used 3.48TB path /dev/sdo
>>        devid    1 size 3.64TB used 3.48TB path /dev/sdl
>>        devid    2 size 3.64TB used 3.48TB path /dev/sdm
>>
>> Btrfs v0.20-rc1
>>
>>
>> # btrfs fi df /storage/storage-vol0/
>> Data, RAID10: total=13.89TB, used=12.99TB
>> System, RAID10: total=64.00MB, used=1.19MB
>> System: total=4.00MB, used=0.00
>> Metadata, RAID10: total=21.00GB, used=17.59GB
>
> By my count this is ~ 95.6% full. My past experience with other file systems, including btree file systems, is they get unpredictably fussy when they're this full. I start migration planning once 80% full is reached, and make it a policy to avoid going over 90% full.
>
> I don't know what behavior Btrfs developers anticipate for this scenario. On the one hand it seems reasonable to  expect it to only be slow, rather than block the whole server for 2 minutes. But on the other hand, it's reasonable to expect server storage won't get this full.
>
>
> Chris Murphy--
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-15  0:28       ` Hans-Kristian Bakke
@ 2013-12-15  1:59         ` Chris Murphy
  2013-12-15  2:35           ` Hans-Kristian Bakke
  2013-12-15  3:47         ` George Mitchell
  1 sibling, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2013-12-15  1:59 UTC (permalink / raw)
  To: Hans-Kristian Bakke; +Cc: Btrfs BTRFS


On Dec 14, 2013, at 5:28 PM, Hans-Kristian Bakke <hkbakke@gmail.com> wrote:

> When I look at the entire FS with df-like tools it is reported as
> 89.4% used (26638.65 of 29808.2 GB). But this is shared amongst both
> data and metadata I guess?

Yes.

> 
> I do know that ~90%+ seems full, but it is still around 3TB in my
> case! Are the "percentage rules" of old times still valid with modern
> disk sizes?

Probably not. But you also reported rather significant fragmentation. And it's also still an experimental file system when not ~ 90% full. I think it's fair to say that this level of fullness is a less tested use case.



> It seems extremely inconvenient that a filesystem like
> btrfs is starting to misbehave at "only" 3TB available space for
> RAID10 mirroring and metadata, which is probably a little bit over 1TB
> actual filestorage counting everything in.

I'm not suggesting the behavior is either desired or expected, but certainly blocking is better than an oops or a broken file system, and in the not too distant past such things have happened on full volumes. Given the level of fragmentation this behavior might be expected at the current state of development, for all I know.

But if you care about this data, I'd take the blocking as a warning to back off on this usage pattern, unless of course you're intentionally trying to see at what point it breaks and why.

> 
> I would normally expect that there is no difference in 1TB free space
> on a FS that is 2TB in total, and 1TB free space on a filesystem that
> is 30TB in total, other than my sense of urge and that you would
> probably expect data growth to be more rapid on the 30TB FS as there
> is obviously a need to store a lot of stuff.

Seems reasonable.


> Is "free space needed" really a different concept dependning on the
> size of your FS?

Maybe it depends more on the size and fragmentation of the files being access, and of remaining free space.

Can you do an lsattr on these 25GB files that you say have ~ 100,000 extents? And what are these files?



Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-15  1:59         ` Chris Murphy
@ 2013-12-15  2:35           ` Hans-Kristian Bakke
  2013-12-15 13:24             ` Duncan
  2013-12-16 15:18             ` Chris Mason
  0 siblings, 2 replies; 23+ messages in thread
From: Hans-Kristian Bakke @ 2013-12-15  2:35 UTC (permalink / raw)
  To: Btrfs BTRFS

I have done some more testing. I turned off everything using the disk
and only did defrag. I have created a script that gives me a list of
the files with the most extents. I started from the top to improve the
fragmentation of the worst files. The most fragmentet file was a file
of about 32GB with over 250 000 extents!
It seems that I can defrag a two to three largish (15-30GB) ~100 000
extents files just fine, but after a while the system locks up (not a
complete hard lock, but everythings hangs and a restart is necessary
to get a fully working system again)

It seems like defrag operations is triggering the issue. Probably in
combination with the large and heavily fragmentet files.

I have slowly managed to defragment the most fragmented files,
rebooting 4 times, so one of the worst files now is this one:

# filefrag vide01.mkv
vide01.mkv: 77810 extents found
# lsattr vide01.mkv
---------------- vide01.mkv

All the large fragmented files are ordinary mkv-files (video). The
reason for the heavy fragmentation was that perhaps 50 to 100 files
were written at the same time over a period of several days, with lots
of other activity going on as well. No problem for the system as it
was network limited most of the time.
Although defrag alone can trigger blocking, so can also straight rsync
from another internal 1000 MB/s continous reads internal array
combined with some random activity. It seems that the cause is just
heavy IO. Is it possible that even though I have seemingly lots of
space free in measured MBytes, that it is all so fragmented that btrfs
can't allocate space efficiently enough? Or would that give other
errors?

I actually downgraded from kernel 3.13-rc2 because of not being able
to do anything else if copying between the internal arrays without
btrfs hanging, although seemingly just temporarily and not as bad as
the defrag blocking.

I will try to free up some space before running more defrag too, just
to check if that is the issue.

Mvh

Hans-Kristian Bakke


On 15 December 2013 02:59, Chris Murphy <lists@colorremedies.com> wrote:
>
> On Dec 14, 2013, at 5:28 PM, Hans-Kristian Bakke <hkbakke@gmail.com> wrote:
>
>> When I look at the entire FS with df-like tools it is reported as
>> 89.4% used (26638.65 of 29808.2 GB). But this is shared amongst both
>> data and metadata I guess?
>
> Yes.
>
>>
>> I do know that ~90%+ seems full, but it is still around 3TB in my
>> case! Are the "percentage rules" of old times still valid with modern
>> disk sizes?
>
> Probably not. But you also reported rather significant fragmentation. And it's also still an experimental file system when not ~ 90% full. I think it's fair to say that this level of fullness is a less tested use case.
>
>
>
>> It seems extremely inconvenient that a filesystem like
>> btrfs is starting to misbehave at "only" 3TB available space for
>> RAID10 mirroring and metadata, which is probably a little bit over 1TB
>> actual filestorage counting everything in.
>
> I'm not suggesting the behavior is either desired or expected, but certainly blocking is better than an oops or a broken file system, and in the not too distant past such things have happened on full volumes. Given the level of fragmentation this behavior might be expected at the current state of development, for all I know.
>
> But if you care about this data, I'd take the blocking as a warning to back off on this usage pattern, unless of course you're intentionally trying to see at what point it breaks and why.
>
>>
>> I would normally expect that there is no difference in 1TB free space
>> on a FS that is 2TB in total, and 1TB free space on a filesystem that
>> is 30TB in total, other than my sense of urge and that you would
>> probably expect data growth to be more rapid on the 30TB FS as there
>> is obviously a need to store a lot of stuff.
>
> Seems reasonable.
>
>
>> Is "free space needed" really a different concept dependning on the
>> size of your FS?
>
> Maybe it depends more on the size and fragmentation of the files being access, and of remaining free space.
>
> Can you do an lsattr on these 25GB files that you say have ~ 100,000 extents? And what are these files?
>
>
>
> Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-15  0:28       ` Hans-Kristian Bakke
  2013-12-15  1:59         ` Chris Murphy
@ 2013-12-15  3:47         ` George Mitchell
  1 sibling, 0 replies; 23+ messages in thread
From: George Mitchell @ 2013-12-15  3:47 UTC (permalink / raw)
  To: Hans-Kristian Bakke, Btrfs BTRFS

On 12/14/2013 04:28 PM, Hans-Kristian Bakke wrote:
>
> I would normally expect that there is no difference in 1TB free space
> on a FS that is 2TB in total, and 1TB free space on a filesystem that
> is 30TB in total, other than my sense of urge and that you would
> probably expect data growth to be more rapid on the 30TB FS as there
> is obviously a need to store a lot of stuff.
> Is "free space needed" really a different concept dependning on the
> size of your FS?
I would suggest there just might be a very significant difference. In 
the case of a 30TB array as opposed to a 3TB array, you are dealing with 
a much higher ratio of used space to free space.  I believe this creates 
a higher likelihood that the free space is occurring as a larger number 
of very small pieces of drive space as opposed to a 3TB drive where 
1/3rd of the drive space free would imply actual USABLE space on the 
drives.  My concern would be that with only 1/30th of the space on the 
drives left free, that remaining space likely involves a lot of very 
small segments that create a situation where the filesystem is 
struggling to compute how to lay out new files.  And, on top of that, 
defragmentation could become a nightmare of complexity as well, since 
the filesystem first has to clear contiguous space to somewhere in order 
to defragment each file.  And then throw in the striping and mirroring 
requirements.  I know those algorithms are likely pretty sophisticated, 
but something tells me that the higher the RATIO of used space to free 
space, the more difficult things might get for the filesystem.  Just 
about everybody here knows a whole lot more about this than I do, but 
something really concerns me about this ratio issue.  Ideally of course 
it probably should work, but its just got to be significantly more 
complex than a 3TB situation. These are just my thoughts as a 
comparative novice when it comes to btrfs or filesystems in general.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-15  2:35           ` Hans-Kristian Bakke
@ 2013-12-15 13:24             ` Duncan
  2013-12-15 14:51               ` Hans-Kristian Bakke
  2013-12-16 15:18             ` Chris Mason
  1 sibling, 1 reply; 23+ messages in thread
From: Duncan @ 2013-12-15 13:24 UTC (permalink / raw)
  To: linux-btrfs

Hans-Kristian Bakke posted on Sun, 15 Dec 2013 03:35:53 +0100 as
excerpted:

> I have done some more testing. I turned off everything using the disk
> and only did defrag. I have created a script that gives me a list of the
> files with the most extents. I started from the top to improve the
> fragmentation of the worst files. The most fragmentet file was a file of
> about 32GB with over 250 000 extents!
> It seems that I can defrag a two to three largish (15-30GB) ~100 000
> extents files just fine, but after a while the system locks up (not a
> complete hard lock, but everythings hangs and a restart is necessary to
> get a fully working system again)
> 
> It seems like defrag operations is triggering the issue. Probably in
> combination with the large and heavily fragmentet files.
> 
> I have slowly managed to defragment the most fragmented files,
> rebooting 4 times, so one of the worst files now is this one:
> 
> # filefrag vide01.mkv
> vide01.mkv: 77810 extents found 
> # lsattr vide01.mkv
> ---------------- vide01.mkv
> 
> All the large fragmented files are ordinary mkv-files (video). The
> reason for the heavy fragmentation was that perhaps 50 to 100 files were
> written at the same time over a period of several days, with lots of
> other activity going on as well. No problem for the system as it was
> network limited most of the time.
> Although defrag alone can trigger blocking, so can also straight rsync
> from another internal 1000 MB/s continous reads internal array combined
> with some random activity. It seems that the cause is just heavy IO. Is
> it possible that even though I have seemingly lots of space free in
> measured MBytes, that it is all so fragmented that btrfs can't allocate
> space efficiently enough? Or would that give other errors?
> 
> I actually downgraded from kernel 3.13-rc2 because of not being able to
> do anything else if copying between the internal arrays without btrfs
> hanging, although seemingly just temporarily and not as bad as the
> defrag blocking.
> 
> I will try to free up some space before running more defrag too, just to
> check if that is the issue.

Three points based on bits you've mentioned, the third likely being the 
most critical for this thread, plus a fourth point, not something you've 
mentioned but just in case...:

1) You mentioned compress=lzo.  It's worth noting that at present, 
filefrag interprets the file segments btrfs breaks compressed files up 
into as part of the compression process as fragments (of IIRC 128 KiB 
each, altho I'm not absolutely sure on that number), so anything that's 
compressed and over that size will be reported by filefrag as fragmented, 
even if it's not.

They're working on teaching filefrag about this sort of thing, and in 
fact I saw some proposed patches for the kernel side of things just 
yesterday, IIRC, but it'll be a few months before all the various pieces 
are in the kernel and filefrag upstreams, and it'll probably be a few 
months to a year or more beyond that before those fixes filter out to 
what the distros are shipping.

However, btrfs won't ordinarily attempt to compress known video files 
(unless the compress-force mount option is used) since they're normally 
already compressed, so that's unlikely to be the issue with your mkvs.  
Additionally, if defragging them is reducing the fragmentation 
dramatically, that's not the problem, as if it was defragging wouldn't 
make a difference.

But it might make a difference on some other files you have...

2) You mentioned backups.  Your backups aren't of the type that use lots 
and lots of hardlinks are they?  Because btrfs isn't the most efficient 
at processing large numbers of hardlinks.  For hardlink-type backups, 
etc, a filesystem other than btrfs will be preferred.  (Additionally, 
since btrfs is still experimental, it's probably a good idea to avoid 
having both your working system and backups on btrfs anyway.  Better to 
have the backups on something else, in case btrfs lives up to the risk 
level implied by its development status.)

3) Critically, the blocked task in your first post was rtorrent.  Given 
that and the filetypes (mkv video files) involved, one can guess that you 
do quite a bit of torrenting.

I'm not sure about rtorrent, but a lot of torrent clients (possibly 
optionally) pre-allocate the space required for a file, then fill in 
random individual chunks they are downloaded and written.

*** THIS IS ONE OF THE WORST USE-CASES POSSIBLE FOR ANY COPY-ON-WRITE 
FILESYSTEM, BTRFS INCLUDED!! ***

What happens is that each of those random chunk-writes creates a new 
extent, a new fragment of the file, since COW means it isn't rewritten in-
place and thus must be mapped to a new location on the disk.  If that's 
what you're doing, then no WONDER those files have so many extents -- a 
32-gig file with a quarter million extents in the worst-case you 
mentioned.  And especially on spinning rust, YES, something that heavily 
fragmented WILL trigger I/O blockages for minutes at a time!

(The other very common bad-case, tho I don't believe quite as bad as the 
torrent case as I don't think they commonly re-write the /entire/ thing, 
only large parts of it, is virtual machine images, where writes to the 
virtual disk in the VM end up being "internal file writes" in the file 
containing that image on the host filesystem.  The below recommendations 
apply there as well.)

There's several possible workarounds including turning off the pre-
allocate option in your torrent client, if possible, and several variants 
on the theme of telling btrfs not to COW those particular files so they 
get rewritten in-place instead.

3a) Create a separate filesystem for your torrent files and either use 
something other than a COW filesystem (ext4 or xfs might be usable 
options), or if you use btrfs, mount that filesystem with the nodatacow 
mount-option.

3b) Configure your btrfs client to use a particular directory (which it 
probably already does by default, but make sure all the files are going 
there -- you're not telling it to directly write some torrent downloads 
elsewhere instead), then set the NOCOW attribute on that directory.  
Newly created files in it should inherit that NOCOW.

3c) Arrange to set NOCOW on individual files before you start writing 
into them.  Often this is done by touching the file to create it, then 
setting the NOCOW attribute, then writing into the existing zero-length 
file.  The attribute needs to be set before there's data in the file -- 
setting it after the fact doesn't really help, and this is one way to do 
it (with inherit from the directory as in 3b another).  However, this 
could be impossible or at minimum rather complicated to handle with the 
torrent client, so 3a or 3b are likely to be more practical choices.

3d) As mentioned, in some clients it's possible to turn off the pre-
allocation option.  However, this can have other effects as well or pre-
allocation wouldn't be a common torrent client practice in the first 
place, so it may not be what you want in any case.  Pre-allocation is 
fine, as long as the file is set NOCOW using one of the methods above.


Of course once you have that setup, you'll still have to deal with the 
existing heavily fragmented files, but at least you won't have a 
continuing regenerating problem you have to deal with. =:^)

4) This one you didn't mention but just in case...  There have been some 
issues with btrfs qgroups that I'm not sure are fully ironed out yet.  In 
general, I'd recommend staying away from quotas and their btrfs qgroups 
implementation for now.  As with hardlink-heavy use-cases, use a 
different filesystem if you are dependent on quotes, at least for the 
time being.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-15 13:24             ` Duncan
@ 2013-12-15 14:51               ` Hans-Kristian Bakke
  2013-12-15 23:08                 ` Duncan
  0 siblings, 1 reply; 23+ messages in thread
From: Hans-Kristian Bakke @ 2013-12-15 14:51 UTC (permalink / raw)
  To: Btrfs BTRFS

Thank you for your very thorough answer Duncan.

Just to clear up a couple of questions.

# Backups
The backups I am speaking of is backup of data on the btrfs filesystem
to another place. The btrfs filesystem sees this as large reads at
about 100 mbit/s, at the time for about a week continuous. In other
words the backups are not storing any data on the btrfs array. The
backup is not running when I am testing this, just to have said that.

# Regarding torrents and preallocation
I have actually turned preallocation on specifically in rtorrent
thinking that it did btrfs a favour like with ext4
(system.file_allocate.set = yes). It is easy to turn it off.
Is the "ideal" solution for btrfs and torrenting (or any other random
writes to large files) to use preallocation and NOCOW, or use no
preallocation and NOCOW? I am thinking the first, although I still do
not understand quite why preallocation is worse than no preallocation
for btrfs with COW enabled (or is both just as bad?)

# qgroups
I am not using qgroups.

Regards

Hans-Kristian


On 15 December 2013 14:24, Duncan <1i5t5.duncan@cox.net> wrote:
> Hans-Kristian Bakke posted on Sun, 15 Dec 2013 03:35:53 +0100 as
> excerpted:
>
>> I have done some more testing. I turned off everything using the disk
>> and only did defrag. I have created a script that gives me a list of the
>> files with the most extents. I started from the top to improve the
>> fragmentation of the worst files. The most fragmentet file was a file of
>> about 32GB with over 250 000 extents!
>> It seems that I can defrag a two to three largish (15-30GB) ~100 000
>> extents files just fine, but after a while the system locks up (not a
>> complete hard lock, but everythings hangs and a restart is necessary to
>> get a fully working system again)
>>
>> It seems like defrag operations is triggering the issue. Probably in
>> combination with the large and heavily fragmentet files.
>>
>> I have slowly managed to defragment the most fragmented files,
>> rebooting 4 times, so one of the worst files now is this one:
>>
>> # filefrag vide01.mkv
>> vide01.mkv: 77810 extents found
>> # lsattr vide01.mkv
>> ---------------- vide01.mkv
>>
>> All the large fragmented files are ordinary mkv-files (video). The
>> reason for the heavy fragmentation was that perhaps 50 to 100 files were
>> written at the same time over a period of several days, with lots of
>> other activity going on as well. No problem for the system as it was
>> network limited most of the time.
>> Although defrag alone can trigger blocking, so can also straight rsync
>> from another internal 1000 MB/s continous reads internal array combined
>> with some random activity. It seems that the cause is just heavy IO. Is
>> it possible that even though I have seemingly lots of space free in
>> measured MBytes, that it is all so fragmented that btrfs can't allocate
>> space efficiently enough? Or would that give other errors?
>>
>> I actually downgraded from kernel 3.13-rc2 because of not being able to
>> do anything else if copying between the internal arrays without btrfs
>> hanging, although seemingly just temporarily and not as bad as the
>> defrag blocking.
>>
>> I will try to free up some space before running more defrag too, just to
>> check if that is the issue.
>
> Three points based on bits you've mentioned, the third likely being the
> most critical for this thread, plus a fourth point, not something you've
> mentioned but just in case...:
>
> 1) You mentioned compress=lzo.  It's worth noting that at present,
> filefrag interprets the file segments btrfs breaks compressed files up
> into as part of the compression process as fragments (of IIRC 128 KiB
> each, altho I'm not absolutely sure on that number), so anything that's
> compressed and over that size will be reported by filefrag as fragmented,
> even if it's not.
>
> They're working on teaching filefrag about this sort of thing, and in
> fact I saw some proposed patches for the kernel side of things just
> yesterday, IIRC, but it'll be a few months before all the various pieces
> are in the kernel and filefrag upstreams, and it'll probably be a few
> months to a year or more beyond that before those fixes filter out to
> what the distros are shipping.
>
> However, btrfs won't ordinarily attempt to compress known video files
> (unless the compress-force mount option is used) since they're normally
> already compressed, so that's unlikely to be the issue with your mkvs.
> Additionally, if defragging them is reducing the fragmentation
> dramatically, that's not the problem, as if it was defragging wouldn't
> make a difference.
>
> But it might make a difference on some other files you have...
>
> 2) You mentioned backups.  Your backups aren't of the type that use lots
> and lots of hardlinks are they?  Because btrfs isn't the most efficient
> at processing large numbers of hardlinks.  For hardlink-type backups,
> etc, a filesystem other than btrfs will be preferred.  (Additionally,
> since btrfs is still experimental, it's probably a good idea to avoid
> having both your working system and backups on btrfs anyway.  Better to
> have the backups on something else, in case btrfs lives up to the risk
> level implied by its development status.)
>
> 3) Critically, the blocked task in your first post was rtorrent.  Given
> that and the filetypes (mkv video files) involved, one can guess that you
> do quite a bit of torrenting.
>
> I'm not sure about rtorrent, but a lot of torrent clients (possibly
> optionally) pre-allocate the space required for a file, then fill in
> random individual chunks they are downloaded and written.
>
> *** THIS IS ONE OF THE WORST USE-CASES POSSIBLE FOR ANY COPY-ON-WRITE
> FILESYSTEM, BTRFS INCLUDED!! ***
>
> What happens is that each of those random chunk-writes creates a new
> extent, a new fragment of the file, since COW means it isn't rewritten in-
> place and thus must be mapped to a new location on the disk.  If that's
> what you're doing, then no WONDER those files have so many extents -- a
> 32-gig file with a quarter million extents in the worst-case you
> mentioned.  And especially on spinning rust, YES, something that heavily
> fragmented WILL trigger I/O blockages for minutes at a time!
>
> (The other very common bad-case, tho I don't believe quite as bad as the
> torrent case as I don't think they commonly re-write the /entire/ thing,
> only large parts of it, is virtual machine images, where writes to the
> virtual disk in the VM end up being "internal file writes" in the file
> containing that image on the host filesystem.  The below recommendations
> apply there as well.)
>
> There's several possible workarounds including turning off the pre-
> allocate option in your torrent client, if possible, and several variants
> on the theme of telling btrfs not to COW those particular files so they
> get rewritten in-place instead.
>
> 3a) Create a separate filesystem for your torrent files and either use
> something other than a COW filesystem (ext4 or xfs might be usable
> options), or if you use btrfs, mount that filesystem with the nodatacow
> mount-option.
>
> 3b) Configure your btrfs client to use a particular directory (which it
> probably already does by default, but make sure all the files are going
> there -- you're not telling it to directly write some torrent downloads
> elsewhere instead), then set the NOCOW attribute on that directory.
> Newly created files in it should inherit that NOCOW.
>
> 3c) Arrange to set NOCOW on individual files before you start writing
> into them.  Often this is done by touching the file to create it, then
> setting the NOCOW attribute, then writing into the existing zero-length
> file.  The attribute needs to be set before there's data in the file --
> setting it after the fact doesn't really help, and this is one way to do
> it (with inherit from the directory as in 3b another).  However, this
> could be impossible or at minimum rather complicated to handle with the
> torrent client, so 3a or 3b are likely to be more practical choices.
>
> 3d) As mentioned, in some clients it's possible to turn off the pre-
> allocation option.  However, this can have other effects as well or pre-
> allocation wouldn't be a common torrent client practice in the first
> place, so it may not be what you want in any case.  Pre-allocation is
> fine, as long as the file is set NOCOW using one of the methods above.
>
>
> Of course once you have that setup, you'll still have to deal with the
> existing heavily fragmented files, but at least you won't have a
> continuing regenerating problem you have to deal with. =:^)
>
> 4) This one you didn't mention but just in case...  There have been some
> issues with btrfs qgroups that I'm not sure are fully ironed out yet.  In
> general, I'd recommend staying away from quotas and their btrfs qgroups
> implementation for now.  As with hardlink-heavy use-cases, use a
> different filesystem if you are dependent on quotes, at least for the
> time being.
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-15 14:51               ` Hans-Kristian Bakke
@ 2013-12-15 23:08                 ` Duncan
  2013-12-16  0:06                   ` Hans-Kristian Bakke
  0 siblings, 1 reply; 23+ messages in thread
From: Duncan @ 2013-12-15 23:08 UTC (permalink / raw)
  To: linux-btrfs

Hans-Kristian Bakke posted on Sun, 15 Dec 2013 15:51:37 +0100 as
excerpted:

> # Regarding torrents and preallocation I have actually turned
> preallocation on specifically in rtorrent thinking that it did btrfs a
> favour like with ext4 (system.file_allocate.set = yes). It is easy to
> turn it off.
> Is the "ideal" solution for btrfs and torrenting (or any other random
> writes to large files) to use preallocation and NOCOW, or use no
> preallocation and NOCOW? I am thinking the first, although I still do
> not understand quite why preallocation is worse than no preallocation
> for btrfs with COW enabled (or is both just as bad?)

I'm not a dev only an admin who follows this list as I run btrfs too, and 
thus don't claim to be an expert on the above -- it's mostly echoing what 
I've seen here previously.

That said, preallocation with nocow is the choice I'd make here.

Meanwhile, a subpoint I didn't make explicit previously, tho it's a 
logical conclusion from the explanation, is that once the writing is 
finished and the file becomes like most media files effectively read-
only, no further writes, NOCOW is no longer important.  That is, you can 
(sequentially) copy the file somewhere else and not have to worry about 
it.  In fact, that's a reasonably good idea, since NOCOW turns off btrfs 
checksumming too, and presumably you're still interested in maintaining 
file integrity on the thing.

So what I'd do is setup a torrent download dir (or as I mentioned, a 
dedicated partition, since I like that sort of thing because it enforces 
size discipline on the stuff I've downloaded but not fully sorted thru... 
that's what I do with binary newsgroup downloading, which I've been doing 
on and off since well before bittorrent was around), set/mount it NOCOW/
nowdatacow, and use it as a temporary download "cache".  Then after a 
file is fully downloaded to "cache", I'd copy it off to a final 
destination in my normal media partition, ultimately removing my NOCOW 
copy.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-14 23:50     ` Chris Murphy
  2013-12-15  0:28       ` Hans-Kristian Bakke
@ 2013-12-15 23:39       ` Charles Cazabon
  2013-12-16  0:16         ` Hans-Kristian Bakke
  1 sibling, 1 reply; 23+ messages in thread
From: Charles Cazabon @ 2013-12-15 23:39 UTC (permalink / raw)
  To: Btrfs BTRFS

Chris Murphy <lists@colorremedies.com> wrote:
> On Dec 14, 2013, at 4:19 PM, Hans-Kristian Bakke <hkbakke@gmail.com> wrote:
> 
> > # btrfs fi df /storage/storage-vol0/
> > Data, RAID10: total=13.89TB, used=12.99TB
> > System, RAID10: total=64.00MB, used=1.19MB
> > System: total=4.00MB, used=0.00
> > Metadata, RAID10: total=21.00GB, used=17.59GB
> 

> By my count this is ~ 95.6% full. My past experience with other file
> systems, including btree file systems, is they get unpredictably fussy when
> they're this full. I start migration planning once 80% full is reached, and
> make it a policy to avoid going over 90% full.

For what it's worth, I see exactly the same behaviour on a system where the
filesystem is only ~60% full, with more than 5TB of free space.  All I have to
do is copy a single file of several gigabytes to the filesystem (over the
network, so it's only coming in at ~30MB/s) and I get similar task-blocked
messages:

INFO: task btrfs-transacti:4118 blocked for more than 120 seconds.
Not tainted 3.12.5-custom+ #10
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
btrfs-transacti D ffff88082fd14140     0  4118      2 0x00000000
ffff880805a06040 0000000000000002 ffff8807f7665d40 ffff8808078f2040
0000000000014140 ffff8807f7665fd8 ffff8807f7665fd8 ffff880805a06040
0000000000000001 ffff88082fd14140 ffff880805a06040 ffff8807f7665c70
Call Trace:
[<ffffffff810d1a19>] ? __lock_page+0x66/0x66
[<ffffffff813b26dd>] ? io_schedule+0x56/0x6c
[<ffffffff810d1a20>] ? sleep_on_page+0x7/0xc
[<ffffffff813b0ad6>] ? __wait_on_bit+0x40/0x79
[<ffffffff810d1df1>] ? find_get_pages_tag+0x66/0x121
[<ffffffff810d1ad8>] ? wait_on_page_bit+0x72/0x77
[<ffffffff8105f540>] ? wake_atomic_t_function+0x21/0x21
[<ffffffff810d218f>] ? filemap_fdatawait_range+0x66/0xfe
[<ffffffffa0545bb5>] ? clear_extent_bit+0x25d/0x29d [btrfs]
[<ffffffffa052ff9a>] ? btrfs_wait_marked_extents+0x79/0xca [btrfs]
[<ffffffffa0530059>] ? btrfs_write_and_wait_transaction+0x6e/0x7e [btrfs]
[<ffffffffa05307ad>] ? btrfs_commit_transaction+0x651/0x843 [btrfs]
[<ffffffffa05297e8>] ? transaction_kthread+0xf4/0x191 [btrfs]
[<ffffffffa05296f4>] ? try_to_freeze_unsafe+0x30/0x30 [btrfs]
[<ffffffffa05296f4>] ? try_to_freeze_unsafe+0x30/0x30 [btrfs]
[<ffffffff8105eb45>] ? kthread+0x81/0x89
[<ffffffff81013291>] ? paravirt_sched_clock+0x5/0x8
[<ffffffff8105eac4>] ? __kthread_parkme+0x5d/0x5d
[<ffffffff813b880c>] ? ret_from_fork+0x7c/0xb0
[<ffffffff8105eac4>] ? __kthread_parkme+0x5d/0x5d


So it's not, at least in my case, due to the filesystem approaching full.

I've seen this behaviour over many kernel versions; the above is with 3.12.5.

Charles
-- 
-----------------------------------------------------------------------
Charles Cazabon
GPL'ed software available at:               http://pyropus.ca/software/
-----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-15 23:08                 ` Duncan
@ 2013-12-16  0:06                   ` Hans-Kristian Bakke
  2013-12-16 10:19                     ` Duncan
  0 siblings, 1 reply; 23+ messages in thread
From: Hans-Kristian Bakke @ 2013-12-16  0:06 UTC (permalink / raw)
  To: Btrfs BTRFS

torrents are really only one thing my storage server get hammered
with. It also does a lot more IO intensive stuff. I actually run
enterprise storage drives in a Supermicro-server for a reason, even if
it is my home setup, consumer stuff just don't cut it with my storage
abuse :)
It runs KVM virtualisation (not on btrfs though) with several VMs,
including windows machines, do lots of manipulation of large files,
offsite backups at 100 mbit/s for days on end, reencoding large
amounts of audio files, runs lots of web sites, constantly streams
blu-rays to at least one computer, and chews through enormous amounts
of internet bandwith constantly. Last week it consumed ~10TB of
internet bandwith alone. I was at about 140 mbit/s average throughput
on a  100/100 link over a full 7 day week, peaking at 177 mbit/s
average over 24 hours, and that is not counting the local gigabit
traffic for all the video remuxing and stuff.
In other words, all 19 storage drives in that server is driven really
hard, and it is no wonder that this triggers some subtleties that
normal users just don't hit.

But since torrenting are clearly the worst offender when it comes to
fragmentation I can comment on that.
Using btrfs with partitioning stops me from using the btrfs multidisk
handling that I ideally need, so that is really not an option. I also
think that if I were to use partitions (no multidisk), no COW and
hence no checksumming, I might as well use ext4 which is more
optimized for that usage scenario. Ideally I could use just a subvol
with nodatacow and quota for this purpose, but per subvolume nodatacow
is not available yet as far as I have understood (correct me if I'm
wrong).

What I will do now, as a way of removing the worst offender from
messing up the general storage pool, is to shrink the btrfs array from
8x4TB drives in btrfs RAID10 to a 7 disk array, and dedicate a drive
for rtorrent, running ext4 with preallocation.
I have, until btrfs, normally just made one large array of all storage
drives matching in performance characteristics, thinking that all the
data can benefit from the extra IO-performance of the array. This has
been a good compromise for a limited budget home setup where ideal
storage teering with SSD hybrid SANs and such is not an option. But as
I am now experiencing with btrfs, COW kind of changes the rules in a
profound noticable all-the-time way. With COWs inherent
random-write-to-large-file fragmentation penalty I think there is no
other way than to separate the different workloads into separate
storage pools going to different hardware. In my case it would
probably mean having one storage pool for general storage, one for VMs
and one for torrenting, as all of those react in their own way to COW
and will get heavily affected by the other workloads in the worst case
if run from the same drives with COW.

Your system of a "cache" is actually already implemented logically in
my setup, in the form of a post-processing script that rtorrent runs
on completion. It moves completed files in dedicated per-tracker
seeding folders, and then makes a copy (using cp --reflink=auto on
btrfs) of the file, processes it if needed (tag clean up, reencoding,
decompresssing or what not), and then moves it to another "finished"
folder. This makes it easy to know what the new stuff is, and I can
manipulate, rename and clean up all the data without messing up the
seeds.

I think that the "finished" folder could still be located on the
RAID10 btrfs volume with COW, as I can use an internal move into the
organized archive when I am actually sitting at the computer instead
of a drive to drive copy via the network.

Regards,
H-K


On 16 December 2013 00:08, Duncan <1i5t5.duncan@cox.net> wrote:
> Hans-Kristian Bakke posted on Sun, 15 Dec 2013 15:51:37 +0100 as
> excerpted:
>
>> # Regarding torrents and preallocation I have actually turned
>> preallocation on specifically in rtorrent thinking that it did btrfs a
>> favour like with ext4 (system.file_allocate.set = yes). It is easy to
>> turn it off.
>> Is the "ideal" solution for btrfs and torrenting (or any other random
>> writes to large files) to use preallocation and NOCOW, or use no
>> preallocation and NOCOW? I am thinking the first, although I still do
>> not understand quite why preallocation is worse than no preallocation
>> for btrfs with COW enabled (or is both just as bad?)
>
> I'm not a dev only an admin who follows this list as I run btrfs too, and
> thus don't claim to be an expert on the above -- it's mostly echoing what
> I've seen here previously.
>
> That said, preallocation with nocow is the choice I'd make here.
>
> Meanwhile, a subpoint I didn't make explicit previously, tho it's a
> logical conclusion from the explanation, is that once the writing is
> finished and the file becomes like most media files effectively read-
> only, no further writes, NOCOW is no longer important.  That is, you can
> (sequentially) copy the file somewhere else and not have to worry about
> it.  In fact, that's a reasonably good idea, since NOCOW turns off btrfs
> checksumming too, and presumably you're still interested in maintaining
> file integrity on the thing.
>
> So what I'd do is setup a torrent download dir (or as I mentioned, a
> dedicated partition, since I like that sort of thing because it enforces
> size discipline on the stuff I've downloaded but not fully sorted thru...
> that's what I do with binary newsgroup downloading, which I've been doing
> on and off since well before bittorrent was around), set/mount it NOCOW/
> nowdatacow, and use it as a temporary download "cache".  Then after a
> file is fully downloaded to "cache", I'd copy it off to a final
> destination in my normal media partition, ultimately removing my NOCOW
> copy.
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-15 23:39       ` Charles Cazabon
@ 2013-12-16  0:16         ` Hans-Kristian Bakke
  0 siblings, 0 replies; 23+ messages in thread
From: Hans-Kristian Bakke @ 2013-12-16  0:16 UTC (permalink / raw)
  To: Btrfs BTRFS

There are actually more. Like this one:

http://iohq.net/index.php?title=Btrfs:RAID_5_Rsync_Freeze

It seems to be the exact same issue as I have, as I too can't do high
speed rsyncs writing to the btrfs array without blocking (reading is
fine).
Mvh

Hans-Kristian Bakke


On 16 December 2013 00:39, Charles Cazabon
<charlesc-lists-btrfs@pyropus.ca> wrote:
> Chris Murphy <lists@colorremedies.com> wrote:
>> On Dec 14, 2013, at 4:19 PM, Hans-Kristian Bakke <hkbakke@gmail.com> wrote:
>>
>> > # btrfs fi df /storage/storage-vol0/
>> > Data, RAID10: total=13.89TB, used=12.99TB
>> > System, RAID10: total=64.00MB, used=1.19MB
>> > System: total=4.00MB, used=0.00
>> > Metadata, RAID10: total=21.00GB, used=17.59GB
>>
>
>> By my count this is ~ 95.6% full. My past experience with other file
>> systems, including btree file systems, is they get unpredictably fussy when
>> they're this full. I start migration planning once 80% full is reached, and
>> make it a policy to avoid going over 90% full.
>
> For what it's worth, I see exactly the same behaviour on a system where the
> filesystem is only ~60% full, with more than 5TB of free space.  All I have to
> do is copy a single file of several gigabytes to the filesystem (over the
> network, so it's only coming in at ~30MB/s) and I get similar task-blocked
> messages:
>
> INFO: task btrfs-transacti:4118 blocked for more than 120 seconds.
> Not tainted 3.12.5-custom+ #10
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> btrfs-transacti D ffff88082fd14140     0  4118      2 0x00000000
> ffff880805a06040 0000000000000002 ffff8807f7665d40 ffff8808078f2040
> 0000000000014140 ffff8807f7665fd8 ffff8807f7665fd8 ffff880805a06040
> 0000000000000001 ffff88082fd14140 ffff880805a06040 ffff8807f7665c70
> Call Trace:
> [<ffffffff810d1a19>] ? __lock_page+0x66/0x66
> [<ffffffff813b26dd>] ? io_schedule+0x56/0x6c
> [<ffffffff810d1a20>] ? sleep_on_page+0x7/0xc
> [<ffffffff813b0ad6>] ? __wait_on_bit+0x40/0x79
> [<ffffffff810d1df1>] ? find_get_pages_tag+0x66/0x121
> [<ffffffff810d1ad8>] ? wait_on_page_bit+0x72/0x77
> [<ffffffff8105f540>] ? wake_atomic_t_function+0x21/0x21
> [<ffffffff810d218f>] ? filemap_fdatawait_range+0x66/0xfe
> [<ffffffffa0545bb5>] ? clear_extent_bit+0x25d/0x29d [btrfs]
> [<ffffffffa052ff9a>] ? btrfs_wait_marked_extents+0x79/0xca [btrfs]
> [<ffffffffa0530059>] ? btrfs_write_and_wait_transaction+0x6e/0x7e [btrfs]
> [<ffffffffa05307ad>] ? btrfs_commit_transaction+0x651/0x843 [btrfs]
> [<ffffffffa05297e8>] ? transaction_kthread+0xf4/0x191 [btrfs]
> [<ffffffffa05296f4>] ? try_to_freeze_unsafe+0x30/0x30 [btrfs]
> [<ffffffffa05296f4>] ? try_to_freeze_unsafe+0x30/0x30 [btrfs]
> [<ffffffff8105eb45>] ? kthread+0x81/0x89
> [<ffffffff81013291>] ? paravirt_sched_clock+0x5/0x8
> [<ffffffff8105eac4>] ? __kthread_parkme+0x5d/0x5d
> [<ffffffff813b880c>] ? ret_from_fork+0x7c/0xb0
> [<ffffffff8105eac4>] ? __kthread_parkme+0x5d/0x5d
>
>
> So it's not, at least in my case, due to the filesystem approaching full.
>
> I've seen this behaviour over many kernel versions; the above is with 3.12.5.
>
> Charles
> --
> -----------------------------------------------------------------------
> Charles Cazabon
> GPL'ed software available at:               http://pyropus.ca/software/
> -----------------------------------------------------------------------
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-16  0:06                   ` Hans-Kristian Bakke
@ 2013-12-16 10:19                     ` Duncan
  2013-12-16 10:55                       ` Hans-Kristian Bakke
  0 siblings, 1 reply; 23+ messages in thread
From: Duncan @ 2013-12-16 10:19 UTC (permalink / raw)
  To: linux-btrfs

Hans-Kristian Bakke posted on Mon, 16 Dec 2013 01:06:36 +0100 as
excerpted:

> torrents are really only one thing my storage server get hammered with.
> It also does a lot more IO intensive stuff. I actually run enterprise
> storage drives in a Supermicro-server for a reason, even if it is my
> home setup, consumer stuff just don't cut it with my storage abuse :)
> It runs KVM virtualisation (not on btrfs though) with several VMs,
> including windows machines, do lots of manipulation of large files,
> offsite backups at 100 mbit/s for days on end, reencoding large amounts
> of audio files, runs lots of web sites, constantly streams blu-rays to
> at least one computer, and chews through enormous amounts of internet
> bandwith constantly. Last week it consumed ~10TB of internet bandwith
> alone. I was at about 140 mbit/s average throughput on a  100/100 link
> over a full 7 day week, peaking at 177 mbit/s average over 24 hours, and
> that is not counting the local gigabit traffic for all the video
> remuxing and stuff.
> In other words, all 19 storage drives in that server is driven really
> hard, and it is no wonder that this triggers some subtleties that normal
> users just don't hit.

Wow!  Indeed!

> But since torrenting are clearly the worst offender when it comes to
> fragmentation I can comment on that.
> Using btrfs with partitioning stops me from using the btrfs multidisk
> handling that I ideally need, so that is really not an option.

??  I'm not running near what you're running, but I *AM* running multiple 
independent multi-device btrfs filesystems (raid1 mode) on a single pair 
of partitioned 256 MB (238 MiB) SSDs, just as pre-btrfs and pre-SSD, I 
ran multiple 4-way md/raid1 volumes on individual partitions on
4-physical-spindle spinning rust.

Like md/raid, btrfs' multi-device support takes generic block devices.  
It doesn't care whether they're physical devices, partitions on physical 
devices, LVM2 volumes on physical devices, md/raid volumes on physical 
devices, partitions on md-raid on lvm2 on physical devices... you get the 
idea.  As long as you can mkfs.btrfs it, you can run multiple-device 
btrfs on it.

In fact, I have that pair of SSDs GPT partitioned up, with 11 independent 
btrfs, 9 of which are btrfs raid1 mode across similar partitions (one 
/var/log, plus working and primary backup for each of root, /home, gentoo 
distro packages tree with sources and binpkgs as well, and a 32-bit chroot 
that's an install image for my netbook) on each device, with the other 
two being /boot and its backup on the other device, my only two non-raid1-
mode btrfs.

So yes, you can definitely run btrs multi-device on partition block-
devices instead of directly on the physical device block devices, as I 
know quite well since my setup depends on that! =:^)

> I also
> think that if I were to use partitions (no multidisk), no COW and hence
> no checksumming, I might as well use ext4 which is more optimized for
> that usage scenario. Ideally I could use just a subvol with nodatacow
> and quota for this purpose, but per subvolume nodatacow is not available
> yet as far as I have understood (correct me if I'm wrong).

Well, if your base assumption, that you couldn't use btrfs multi-device 
on partitions, only on physical devices, was correct... But it's not.

Which means you /can/ partition if you like, and then use whatever 
filesystem on those partitions you want, combining multi-device btrfs on 
some of them, with ext4 on md/raid if you want multi-device support for 
it, since unlike btrfs, ext4 doesn't support multi-device natively.

You could even throw lvm2 in there, if you like, giving you additional 
sizing and deployment flexibility.  Before btrfs here, I actually used 
reiserfs on lvm2 on mdraid on physical devices, and it worked, but that 
was complex enough I wasn't confident of my ability to manage it in a 
disaster recovery scenario, and lvm2 requires userspace and thus an initr* 
to handle root on lvm2, while root on mdraid can be handled directly from 
the kernel commandline so no initr* required, so I kept the mdraid and 
dropped lvm2.

[snipped further discussion along that invalid assumption line]

> I have, until btrfs, normally just made one large array of all storage
> drives matching in performance characteristics, thinking that all the
> data can benefit from the extra IO-performance of the array. This has
> been a good compromise for a limited budget home setup where ideal
> storage teering with SSD hybrid SANs and such is not an option. But as I
> am now experiencing with btrfs, COW kind of changes the rules in a
> profound noticable all-the-time way. With COWs inherent
> random-write-to-large-file fragmentation penalty I think there is no
> other way than to separate the different workloads into separate storage
> pools going to different hardware. In my case it would probably mean
> having one storage pool for general storage, one for VMs and one for
> torrenting, as all of those react in their own way to COW and will get
> heavily affected by the other workloads in the worst case if run from
> the same drives with COW.

Luckily, the partitioning thing does work.  Additionally, as mentioned 
you can set NOCOW on directories and have new files in them inherit 
that.  So you have quite a bit more flexibility than you might have 
though.  Tho of course it's your system and you may well prefer 
administering whole physical devices to dealing with permissions, just as 
I decided lvm2 wasn't appropriate to me, altho many people use it for 
everything.

> Your system of a "cache" is actually already implemented logically in my
> setup, in the form of a post-processing script that rtorrent runs on
> completion. It moves completed files in dedicated per-tracker seeding
> folders, and then makes a copy (using cp --reflink=auto on btrfs) of the
> file, processes it if needed (tag clean up, reencoding, decompresssing
> or what not), and then moves it to another "finished" folder. This makes
> it easy to know what the new stuff is, and I can manipulate, rename and
> clean up all the data without messing up the seeds.
> 
> I think that the "finished" folder could still be located on the RAID10
> btrfs volume with COW, as I can use an internal move into the organized
> archive when I am actually sitting at the computer instead of a drive to
> drive copy via the network.

That makes sense.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-16 10:19                     ` Duncan
@ 2013-12-16 10:55                       ` Hans-Kristian Bakke
  2013-12-16 15:00                         ` Duncan
  0 siblings, 1 reply; 23+ messages in thread
From: Hans-Kristian Bakke @ 2013-12-16 10:55 UTC (permalink / raw)
  To: Btrfs BTRFS

Stupid me, I completely forgot that you can run multidisk arrays with
just block level partitions, just like with md raid! It will introduce
a rather significant management overhead in my case though, as
managing several individual partitions per drive is quite annoying
with so many drives.

What happens If I do cp --reflink=auto from a NOCOW file in a NOCOW
folder to a folder with COW set on the same btrfs volume? Do I still
get "free" copying, and is the resulting file COW or NOCOW?


Mvh

Hans-Kristian Bakke


On 16 December 2013 11:19, Duncan <1i5t5.duncan@cox.net> wrote:
> Hans-Kristian Bakke posted on Mon, 16 Dec 2013 01:06:36 +0100 as
> excerpted:
>
>> torrents are really only one thing my storage server get hammered with.
>> It also does a lot more IO intensive stuff. I actually run enterprise
>> storage drives in a Supermicro-server for a reason, even if it is my
>> home setup, consumer stuff just don't cut it with my storage abuse :)
>> It runs KVM virtualisation (not on btrfs though) with several VMs,
>> including windows machines, do lots of manipulation of large files,
>> offsite backups at 100 mbit/s for days on end, reencoding large amounts
>> of audio files, runs lots of web sites, constantly streams blu-rays to
>> at least one computer, and chews through enormous amounts of internet
>> bandwith constantly. Last week it consumed ~10TB of internet bandwith
>> alone. I was at about 140 mbit/s average throughput on a  100/100 link
>> over a full 7 day week, peaking at 177 mbit/s average over 24 hours, and
>> that is not counting the local gigabit traffic for all the video
>> remuxing and stuff.
>> In other words, all 19 storage drives in that server is driven really
>> hard, and it is no wonder that this triggers some subtleties that normal
>> users just don't hit.
>
> Wow!  Indeed!
>
>> But since torrenting are clearly the worst offender when it comes to
>> fragmentation I can comment on that.
>> Using btrfs with partitioning stops me from using the btrfs multidisk
>> handling that I ideally need, so that is really not an option.
>
> ??  I'm not running near what you're running, but I *AM* running multiple
> independent multi-device btrfs filesystems (raid1 mode) on a single pair
> of partitioned 256 MB (238 MiB) SSDs, just as pre-btrfs and pre-SSD, I
> ran multiple 4-way md/raid1 volumes on individual partitions on
> 4-physical-spindle spinning rust.
>
> Like md/raid, btrfs' multi-device support takes generic block devices.
> It doesn't care whether they're physical devices, partitions on physical
> devices, LVM2 volumes on physical devices, md/raid volumes on physical
> devices, partitions on md-raid on lvm2 on physical devices... you get the
> idea.  As long as you can mkfs.btrfs it, you can run multiple-device
> btrfs on it.
>
> In fact, I have that pair of SSDs GPT partitioned up, with 11 independent
> btrfs, 9 of which are btrfs raid1 mode across similar partitions (one
> /var/log, plus working and primary backup for each of root, /home, gentoo
> distro packages tree with sources and binpkgs as well, and a 32-bit chroot
> that's an install image for my netbook) on each device, with the other
> two being /boot and its backup on the other device, my only two non-raid1-
> mode btrfs.
>
> So yes, you can definitely run btrs multi-device on partition block-
> devices instead of directly on the physical device block devices, as I
> know quite well since my setup depends on that! =:^)
>
>> I also
>> think that if I were to use partitions (no multidisk), no COW and hence
>> no checksumming, I might as well use ext4 which is more optimized for
>> that usage scenario. Ideally I could use just a subvol with nodatacow
>> and quota for this purpose, but per subvolume nodatacow is not available
>> yet as far as I have understood (correct me if I'm wrong).
>
> Well, if your base assumption, that you couldn't use btrfs multi-device
> on partitions, only on physical devices, was correct... But it's not.
>
> Which means you /can/ partition if you like, and then use whatever
> filesystem on those partitions you want, combining multi-device btrfs on
> some of them, with ext4 on md/raid if you want multi-device support for
> it, since unlike btrfs, ext4 doesn't support multi-device natively.
>
> You could even throw lvm2 in there, if you like, giving you additional
> sizing and deployment flexibility.  Before btrfs here, I actually used
> reiserfs on lvm2 on mdraid on physical devices, and it worked, but that
> was complex enough I wasn't confident of my ability to manage it in a
> disaster recovery scenario, and lvm2 requires userspace and thus an initr*
> to handle root on lvm2, while root on mdraid can be handled directly from
> the kernel commandline so no initr* required, so I kept the mdraid and
> dropped lvm2.
>
> [snipped further discussion along that invalid assumption line]
>
>> I have, until btrfs, normally just made one large array of all storage
>> drives matching in performance characteristics, thinking that all the
>> data can benefit from the extra IO-performance of the array. This has
>> been a good compromise for a limited budget home setup where ideal
>> storage teering with SSD hybrid SANs and such is not an option. But as I
>> am now experiencing with btrfs, COW kind of changes the rules in a
>> profound noticable all-the-time way. With COWs inherent
>> random-write-to-large-file fragmentation penalty I think there is no
>> other way than to separate the different workloads into separate storage
>> pools going to different hardware. In my case it would probably mean
>> having one storage pool for general storage, one for VMs and one for
>> torrenting, as all of those react in their own way to COW and will get
>> heavily affected by the other workloads in the worst case if run from
>> the same drives with COW.
>
> Luckily, the partitioning thing does work.  Additionally, as mentioned
> you can set NOCOW on directories and have new files in them inherit
> that.  So you have quite a bit more flexibility than you might have
> though.  Tho of course it's your system and you may well prefer
> administering whole physical devices to dealing with permissions, just as
> I decided lvm2 wasn't appropriate to me, altho many people use it for
> everything.
>
>> Your system of a "cache" is actually already implemented logically in my
>> setup, in the form of a post-processing script that rtorrent runs on
>> completion. It moves completed files in dedicated per-tracker seeding
>> folders, and then makes a copy (using cp --reflink=auto on btrfs) of the
>> file, processes it if needed (tag clean up, reencoding, decompresssing
>> or what not), and then moves it to another "finished" folder. This makes
>> it easy to know what the new stuff is, and I can manipulate, rename and
>> clean up all the data without messing up the seeds.
>>
>> I think that the "finished" folder could still be located on the RAID10
>> btrfs volume with COW, as I can use an internal move into the organized
>> archive when I am actually sitting at the computer instead of a drive to
>> drive copy via the network.
>
> That makes sense.
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-16 10:55                       ` Hans-Kristian Bakke
@ 2013-12-16 15:00                         ` Duncan
  0 siblings, 0 replies; 23+ messages in thread
From: Duncan @ 2013-12-16 15:00 UTC (permalink / raw)
  To: linux-btrfs

Hans-Kristian Bakke posted on Mon, 16 Dec 2013 11:55:40 +0100 as
excerpted:

> Stupid me, I completely forgot that you can run multidisk arrays with
> just block level partitions, just like with md raid! It will introduce a
> rather significant management overhead in my case though, as managing
> several individual partitions per drive is quite annoying with so many
> drives.

What I did here, both with mdraid and now with btrfs raid1, is use a 
parallel partition setup on all target drives.  In a couple special cases 
it results in some wasted space[1], but for most cases it's possible to 
simply plan partition sizes so the end result after raid combination is 
the desired size.

And some years ago I switched to GPT partitions, for checksummed/
redundant partition table reliability, partition naming (similar to 
filesystem labels but in the GPT partition table itself), and to not have 
to mess with primary/secondary/logical partitions.  That lets me set 
partition names, which given the scheme I use for both partition names 
and filesystem labeling, means I have unique name/label IDs for 
everything, across multiple machines and with thumb-drives too!

The scheme is 15 chars long, reiserfs' max label length since I was using 
it at the time I designed the scheme.  Here's the content of text file I 
keep, documenting it:

>>>>>

* 15 characters long
123456789012345
ff     bbB ymd
  ssssS   t   n


Example: rt0005gmd3+9bc0


Function:
ff:     2-char function abbreviation (bt/rt/lg, etc)

Device ID (size, brand, 0-based number/letter)
ssssS:  4-digit size, 1-char multiplier (m=meg, g=gig, etc)
        This is the size of the underlying media, NOT the partition!
bbB     2-char media brand ID, 1-digit sequence number.
        pa=patriot, io=iomega, md=md/raid, etc.
        Note that /dev/md10=mda...

Target/separator
t:      1-char target ID and separator.
        .=aa1, +=workstation, %=both (bt/swap on portable disk)

Date code
ymd:    1-char-each year/month/day prepared
        y=last digit of year
        m=month (1-9abc)
        d=day (1-9a-v)

Number (working, backup-n)
        n=copy number (zero-based)

So together, we have 2 chars of function, 8 of size/mfr/n as device-id,
1 of target/separator,  3 of date prepared, 1 of copy number.

So our example: rt0005gmd3+9bc0

rt=root
0005gmd3=5-gig /dev/md3
+=targeted at the workstation
9bc0=2009.1112 (Nov. 12), first/main version.

<<<<<

For a multi-device btrfs, I set the "hardware" sequence number 
appropriately for the partitions, with the filesystem label identical, 
except its "hardware" sequence number is "x", indicating it's across 
multiple hardware devices.

The "filesystem" sequence number, meanwhile, is 0 for the working copy, 1 
for the primary backup, etc.

With that scheme, I have both the partitions and the filesystems on top 
of them uniquely labeled with function, hardware/media ID (brand, size, 
sequence number), target machine, and partition/filesystem ID (date of 
layout, working/backup sequence number).  If it ever got /too/ complex I 
could keep a list of them somewhere, but so far, it hasn't gotten beyond 
the context-manageable scope level, so between seeing the name/label and 
knowing the context of what I'm accessing, I've been able to track it 
without resorting to a written tracking list.

But that's not to say you gotta do what I do.  If you still find the 
administrative overhead of all those partitions too high, well so be it.
This is just the solution that I've come up with after a couple decades 
of incremental modification, to where it works pretty well for me now.  
If some of the stuff I've come up with the hard way makes useful hints 
for someone else, great.  Otherwise, just ignore it and do what works for 
you.  It's your system and you're the one dealing with it, after all, not 
mine/me. =:^)

> What happens If I do cp --reflink=auto from a NOCOW file in a NOCOW
> folder to a folder with COW set on the same btrfs volume? Do I still get
> "free" copying, and is the resulting file COW or NOCOW?

I don't actually know as that doesn't fit my USE case so well, tho a 
comment I read awhile back hinted it may spit out an error.

FWIW, I tend to either use symlinks one direction or the other or I'm 
trying to keep deliberately redundant backups where I don't want 
potential damage to kill the common-due-to-COW parts of both files, so I 
don't actually tend to find reflinks particularly useful here, even if I 
appreciate the flexibility that option allows.

---
[1] For instance, swap with a hibernate image back before I switched to 
SSD (hibernate was broken on my new machine, last I checked about a year 
ago, and I've enough memory on this machine I usually don't fill it even 
with cache, so I wasn't hibernate or swap anyway, when I switched to 
SSD).  The hibernate image must ordinarily be on a single device and 
should be half the size of RAM or so to avoid dumping cache to fit, but 
making all the parallel swap partitions that size made for a 
prohibitively large swap, that even if I WERE to need it, would take well 
longer to transfer all those gigs
to/from spinning rust than I'd want to take, so one way or another, I'd 
never actually use it all.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-15  2:35           ` Hans-Kristian Bakke
  2013-12-15 13:24             ` Duncan
@ 2013-12-16 15:18             ` Chris Mason
  2013-12-16 16:32               ` Hans-Kristian Bakke
  1 sibling, 1 reply; 23+ messages in thread
From: Chris Mason @ 2013-12-16 15:18 UTC (permalink / raw)
  To: hkbakke; +Cc: linux-btrfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1137 bytes --]

On Sun, 2013-12-15 at 03:35 +0100, Hans-Kristian Bakke wrote:
> I have done some more testing. I turned off everything using the disk
> and only did defrag. I have created a script that gives me a list of
> the files with the most extents. I started from the top to improve the
> fragmentation of the worst files. The most fragmentet file was a file
> of about 32GB with over 250 000 extents!
> It seems that I can defrag a two to three largish (15-30GB) ~100 000
> extents files just fine, but after a while the system locks up (not a
> complete hard lock, but everythings hangs and a restart is necessary
> to get a fully working system again)
> 
> It seems like defrag operations is triggering the issue. Probably in
> combination with the large and heavily fragmentet files.
> 

I'm trying to understand how defrag factors into your backup workload?
Do you have autodefrag on, or are you running a defrag as part of the
backup when you see these stalls?

If not, we're seeing a different problem.

-chris

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨è­Ú&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~†­†Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ߣøm

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-16 15:18             ` Chris Mason
@ 2013-12-16 16:32               ` Hans-Kristian Bakke
  2013-12-16 18:16                 ` Chris Mason
  0 siblings, 1 reply; 23+ messages in thread
From: Hans-Kristian Bakke @ 2013-12-16 16:32 UTC (permalink / raw)
  To: linux-btrfs

Ok, I guess the essence have been lost in the meta discussion.

Basically I get blocking for more than 120 seconds during these workloads:
- defragmenting several large fragmentet files in succession (leaving
time for btrfs to finish writing each file). This have *always*
happened in my array, even when it just consisted of 4x4TB drives.
or
- rsyncing *to* the btrfs array from another internal array (rsync -a
<source_on_ext4_mdadm_array> <dest_on_btrfs_raid10_array>)

rsyncing *from* the btrfs array is not a problem, so my issue seems to
be contained to heavy writing.
This is happening even if the server is doing nothing else, no
backups, no torrenting, no copying. The only "external" thing that is
happening is a regular poll from smartd to the drives and regular
filesystem size checks from check_mk (Icinga monitoring).

The FS has a little over 3 TB free (of 29 TB available for RAID10 data
and metadata) and contains mainly largish files like FLAC-files,
photos and large mkv files, ranging from 250 MB to around 70 GB, one
subvolume and one snapshot of that subvolume.
"find /storage/storage-vol0/ -xdev -type f | wc -l" gives a result of
131 820 files. No hard linking is used.

I am currenty removing a drive from the array, reducing the number of
drvies from 8 to 7. The rebalance have not blocked for more than 120
seconds yet, but it is clearly blocking for a quite a few seconds once
in a while as all other software using the drives can't get anything
through and hangs for a period.
I do expect slowdowns during heavy load, but not blocking. The ext4
mdadm RAID6 array in the same server have only been slow during heavy
load, but never blocked noticeably.
Mvh

Hans-Kristian Bakke


On 16 December 2013 16:18, Chris Mason <clm@fb.com> wrote:
> On Sun, 2013-12-15 at 03:35 +0100, Hans-Kristian Bakke wrote:
>> I have done some more testing. I turned off everything using the disk
>> and only did defrag. I have created a script that gives me a list of
>> the files with the most extents. I started from the top to improve the
>> fragmentation of the worst files. The most fragmentet file was a file
>> of about 32GB with over 250 000 extents!
>> It seems that I can defrag a two to three largish (15-30GB) ~100 000
>> extents files just fine, but after a while the system locks up (not a
>> complete hard lock, but everythings hangs and a restart is necessary
>> to get a fully working system again)
>>
>> It seems like defrag operations is triggering the issue. Probably in
>> combination with the large and heavily fragmentet files.
>>
>
> I'm trying to understand how defrag factors into your backup workload?
> Do you have autodefrag on, or are you running a defrag as part of the
> backup when you see these stalls?
>
> If not, we're seeing a different problem.
>
> -chris
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-16 16:32               ` Hans-Kristian Bakke
@ 2013-12-16 18:16                 ` Chris Mason
  2013-12-16 18:22                   ` Hans-Kristian Bakke
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Mason @ 2013-12-16 18:16 UTC (permalink / raw)
  To: hkbakke; +Cc: linux-btrfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 800 bytes --]

On Mon, 2013-12-16 at 17:32 +0100, Hans-Kristian Bakke wrote:
> Ok, I guess the essence have been lost in the meta discussion.
> 
> Basically I get blocking for more than 120 seconds during these workloads:
> - defragmenting several large fragmentet files in succession (leaving
> time for btrfs to finish writing each file). This have *always*
> happened in my array, even when it just consisted of 4x4TB drives.
> or
> - rsyncing *to* the btrfs array from another internal array (rsync -a
> <source_on_ext4_mdadm_array> <dest_on_btrfs_raid10_array>)
> 

Ok, and do you have autodefrag enabled on the btrfs FS you are copying
to?  Also, how much ram do you have?

-chris

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨è­Ú&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~†­†Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ߣøm

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-16 18:16                 ` Chris Mason
@ 2013-12-16 18:22                   ` Hans-Kristian Bakke
  2013-12-16 18:33                     ` Chris Mason
  0 siblings, 1 reply; 23+ messages in thread
From: Hans-Kristian Bakke @ 2013-12-16 18:22 UTC (permalink / raw)
  To: linux-btrfs

I have explicitly set compression=lzo, and later noatime just to test
now, else it's just default 3.12.4 options (or 3.13-rc2 when I tested
that).

To make sure, here is my btrfs mounts from /proc/mounts:
/dev/sdl /btrfs btrfs rw,noatime,compress=lzo,space_cache 0 0
/dev/sdl /storage/storage-vol0 btrfs rw,noatime,compress=lzo,space_cache 0 0

/etc/fstab:
UUID=9302fc8f-15c6-46e9-9217-951d7423927c   /btrfs  btrfs
defaults,compress=lzo,noatime           0   2
UUID=9302fc8f-15c6-46e9-9217-951d7423927c   /storage/storage-vol0
btrfs   defaults,subvol=storage-vol0,noatime    0   2

Hardware:
CPU: Intel Xeon X3430 (Quad Core)
MB: Supermicro X8SI6-F
RAM: 16GB (4x4GB) Samsung ECC/Unbuffered DDR3 1333mhz CL9 (MEM-DR340L-SL01-EU13)
HDDs in btrfs RAID10: 8 x Western Digital Se 4TB 64MB 7200RPM SATA
6Gb/s (WD4000F9YZ)
HBAs: LSI SAS 9211-8i, LSI SAS 9201-16i

Mvh

Hans-Kristian Bakke


On 16 December 2013 19:16, Chris Mason <clm@fb.com> wrote:
> On Mon, 2013-12-16 at 17:32 +0100, Hans-Kristian Bakke wrote:
>> Ok, I guess the essence have been lost in the meta discussion.
>>
>> Basically I get blocking for more than 120 seconds during these workloads:
>> - defragmenting several large fragmentet files in succession (leaving
>> time for btrfs to finish writing each file). This have *always*
>> happened in my array, even when it just consisted of 4x4TB drives.
>> or
>> - rsyncing *to* the btrfs array from another internal array (rsync -a
>> <source_on_ext4_mdadm_array> <dest_on_btrfs_raid10_array>)
>>
>
> Ok, and do you have autodefrag enabled on the btrfs FS you are copying
> to?  Also, how much ram do you have?
>
> -chris
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-16 18:22                   ` Hans-Kristian Bakke
@ 2013-12-16 18:33                     ` Chris Mason
  2013-12-16 18:41                       ` Hans-Kristian Bakke
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Mason @ 2013-12-16 18:33 UTC (permalink / raw)
  To: hkbakke; +Cc: linux-btrfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1352 bytes --]

On Mon, 2013-12-16 at 19:22 +0100, Hans-Kristian Bakke wrote:
> I have explicitly set compression=lzo, and later noatime just to test
> now, else it's just default 3.12.4 options (or 3.13-rc2 when I tested
> that).
> 
> To make sure, here is my btrfs mounts from /proc/mounts:
> /dev/sdl /btrfs btrfs rw,noatime,compress=lzo,space_cache 0 0
> /dev/sdl /storage/storage-vol0 btrfs rw,noatime,compress=lzo,space_cache 0 0
> 
> /etc/fstab:
> UUID=9302fc8f-15c6-46e9-9217-951d7423927c   /btrfs  btrfs
> defaults,compress=lzo,noatime           0   2
> UUID=9302fc8f-15c6-46e9-9217-951d7423927c   /storage/storage-vol0
> btrfs   defaults,subvol=storage-vol0,noatime    0   2
> 
> Hardware:
> CPU: Intel Xeon X3430 (Quad Core)
> MB: Supermicro X8SI6-F
> RAM: 16GB (4x4GB) Samsung ECC/Unbuffered DDR3 1333mhz CL9 (MEM-DR340L-SL01-EU13)
> HDDs in btrfs RAID10: 8 x Western Digital Se 4TB 64MB 7200RPM SATA
> 6Gb/s (WD4000F9YZ)
> HBAs: LSI SAS 9211-8i, LSI SAS 9201-16i
> 

Ok, could you please capture the dmesg output after a sysrq-w during one
of the stalls during rsync writing?  We want to see all the stack traces
of all the waiting procs.

Defrag is a slightly different use case, so I want to address that
separately.

-chris

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨è­Ú&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~†­†Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ߣøm

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Blocket for more than 120 seconds
  2013-12-16 18:33                     ` Chris Mason
@ 2013-12-16 18:41                       ` Hans-Kristian Bakke
  0 siblings, 0 replies; 23+ messages in thread
From: Hans-Kristian Bakke @ 2013-12-16 18:41 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

No problem. You have to wait a bit though, as the volume is currently
going through a reduction in the number of drives from 8 to 7 and I do
not feel comfortable stalling the volume while that is happening. I
will report back with the logs later on.

Mvh

Hans-Kristian Bakke


On 16 December 2013 19:33, Chris Mason <clm@fb.com> wrote:
> On Mon, 2013-12-16 at 19:22 +0100, Hans-Kristian Bakke wrote:
>> I have explicitly set compression=lzo, and later noatime just to test
>> now, else it's just default 3.12.4 options (or 3.13-rc2 when I tested
>> that).
>>
>> To make sure, here is my btrfs mounts from /proc/mounts:
>> /dev/sdl /btrfs btrfs rw,noatime,compress=lzo,space_cache 0 0
>> /dev/sdl /storage/storage-vol0 btrfs rw,noatime,compress=lzo,space_cache 0 0
>>
>> /etc/fstab:
>> UUID=9302fc8f-15c6-46e9-9217-951d7423927c   /btrfs  btrfs
>> defaults,compress=lzo,noatime           0   2
>> UUID=9302fc8f-15c6-46e9-9217-951d7423927c   /storage/storage-vol0
>> btrfs   defaults,subvol=storage-vol0,noatime    0   2
>>
>> Hardware:
>> CPU: Intel Xeon X3430 (Quad Core)
>> MB: Supermicro X8SI6-F
>> RAM: 16GB (4x4GB) Samsung ECC/Unbuffered DDR3 1333mhz CL9 (MEM-DR340L-SL01-EU13)
>> HDDs in btrfs RAID10: 8 x Western Digital Se 4TB 64MB 7200RPM SATA
>> 6Gb/s (WD4000F9YZ)
>> HBAs: LSI SAS 9211-8i, LSI SAS 9201-16i
>>
>
> Ok, could you please capture the dmesg output after a sysrq-w during one
> of the stalls during rsync writing?  We want to see all the stack traces
> of all the waiting procs.
>
> Defrag is a slightly different use case, so I want to address that
> separately.
>
> -chris
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2013-12-16 18:41 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-14 20:30 Blocket for more than 120 seconds Hans-Kristian Bakke
2013-12-14 21:35 ` Chris Murphy
2013-12-14 23:19   ` Hans-Kristian Bakke
2013-12-14 23:50     ` Chris Murphy
2013-12-15  0:28       ` Hans-Kristian Bakke
2013-12-15  1:59         ` Chris Murphy
2013-12-15  2:35           ` Hans-Kristian Bakke
2013-12-15 13:24             ` Duncan
2013-12-15 14:51               ` Hans-Kristian Bakke
2013-12-15 23:08                 ` Duncan
2013-12-16  0:06                   ` Hans-Kristian Bakke
2013-12-16 10:19                     ` Duncan
2013-12-16 10:55                       ` Hans-Kristian Bakke
2013-12-16 15:00                         ` Duncan
2013-12-16 15:18             ` Chris Mason
2013-12-16 16:32               ` Hans-Kristian Bakke
2013-12-16 18:16                 ` Chris Mason
2013-12-16 18:22                   ` Hans-Kristian Bakke
2013-12-16 18:33                     ` Chris Mason
2013-12-16 18:41                       ` Hans-Kristian Bakke
2013-12-15  3:47         ` George Mitchell
2013-12-15 23:39       ` Charles Cazabon
2013-12-16  0:16         ` Hans-Kristian Bakke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.