All of lore.kernel.org
 help / color / mirror / Atom feed
* Poor performance of btrfs. Suspected unidentified btrfs housekeeping process which writes a lot
@ 2013-01-30 14:57 Adam Ryczkowski
  2013-01-30 23:58 ` Chris Murphy
  0 siblings, 1 reply; 10+ messages in thread
From: Adam Ryczkowski @ 2013-01-30 14:57 UTC (permalink / raw)
  To: linux-btrfs

Welcome,

I've been using btrfs for over a 3 months to store my personal data on 
my NAS server. Almost all interactions with files on the server are done 
using unison synchronizer. After another use of bedup 
(https://github.com/g2p/bedup) on my btrfs volume I experienced huge 
perfomance loss with synchronization. It now takes over 3 hours what 
have taken only 15 minutes! File browsing is not affected; but it takes 
forever to read contents of the files!

When I use `iotop -o -d 30` (which measures I/O activity for 30-second 
interval) I can see:

Total DISK READ:      98.66 K/s | Total DISK WRITE:     826.55 K/s
   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO> COMMAND
  4296 be/4 root        3.99 K/s  408.59 K/s  0.00 % 98.64 % 
[btrfs-transacti]
  6407 be/4 adam       94.14 K/s    0.00 B/s  0.00 % 85.24 % unison -server
   311 be/4 root        0.00 B/s    0.00 B/s  0.00 % 58.20 % [md1_raid6]
   354 be/3 root        0.00 B/s    2.26 K/s  0.00 % 24.29 % [jbd2/md0-8]
   306 be/4 root        0.00 B/s    0.00 B/s  0.00 %  4.79 % [md0_raid1]
  1229 be/4 syslog      0.00 B/s  136.15 B/s  0.00 %  0.00 % rsyslogd -c5
  1744 be/4 root        0.00 B/s  136.15 B/s  0.00 %  0.00 % 
console-kit-daemon --no-daemon

I expect no writes at all since the statistics were taken during the 
"Looking for changes" phse. Normally, the `unison -server` process shold 
have at least 5 M/s disk read speed. (The block device the btrfs is 
build on has a measured capability of 50 M/s sequential throughput)

When I pause the `unison -server` process (with htop), the disk activity 
persists of another 5-30 seconds, so I am infer, that the btrfs is doing 
some house-keeping work, and this is the reason I decided to post the 
email on this list. I suspect, that this house-keeping work has a time 
granularity of 5-30 seconds, and during this time access to the 
filesystem is delayed. The problem is not specific to the unison. This 
background process is triggered by just reading the file contents. Once 
the system is through and the file is read, than all subsequent attempts 
to read it are fine, even if I drop the cache (i.e. echo 3 > 
/proc/sys/vm/drop_caches). But after a while (after reboot) the 
performance hit recurs.

The questions are:
1. What sort of work is btrfs doing?
What is it writing (and why is it writing 100x bytes more than reading)?
2. Why does it take it so long?
3. What can I do to speed-up the process?
4. What can I do to prevent it from happening again?

Here are details about my system that might help you with the diagnose. 
If it is not enough,

I suspect it has something to do with snapshots I make for backup. I 
have 35 of them, and I ask bedup to find duplicates across all 
subvolumes. But on the other hand it is supposed to work since kernel 
3.5, and the filesystem has never seen kernel older than 3.6.

My filesystem /dev/vg-adama-docs/lv-adama-docs is 372GB in size, and is 
a quite complex setup:
It is based on logical volume (LVM2), which has a single physical volume 
made by dm-crypt device /dev/dm-1, which subsequently sits on top of 
/dev/md1 linux raid 6, which is built with 4 identical 186GB GPT 
partitions on each of my SATA 3TB hard drives.

There are 272k files on the system (excluding 35 snaphosts), 23k folders 
and 104 GB data.
$ df /mnt/adama-docs -h
Filesystem                                   Size  Used Avail Use% 
Mounted on
/dev/mapper/vg--adama--docs-lv--adama--docs  373G   85G  288G  23% 
/mnt/adama-docs

I was always using the latest kernel (its 3.7.1-030701-generic at the 
moment) on my Ubuntu Quantal server.

-- 

Adam Ryczkowski
Skype:sisteczko <skype:sisteczko>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Poor performance of btrfs. Suspected unidentified btrfs housekeeping process which writes a lot
  2013-01-30 14:57 Poor performance of btrfs. Suspected unidentified btrfs housekeeping process which writes a lot Adam Ryczkowski
@ 2013-01-30 23:58 ` Chris Murphy
  2013-01-31  1:02   ` Adam Ryczkowski
  0 siblings, 1 reply; 10+ messages in thread
From: Chris Murphy @ 2013-01-30 23:58 UTC (permalink / raw)
  To: Adam Ryczkowski; +Cc: linux-btrfs


On Jan 30, 2013, at 7:57 AM, Adam Ryczkowski <adam.ryczkowski@statystyka.net> wrote:
> 
> I suspect it has something to do with snapshots I make for backup. I have 35 of them, and I ask bedup to find duplicates across all subvolumes.


Assuming most files do have identical duplicates, implies the same file in all 35 subvolumes is actually in the same physical location; it differs only in subvol reference. But it's not btrfs that determines the "duplicate" vs "unique" state of those 35 file instances, but unison. The fs still must send all 35x instances for the state to be determined, as if they were unique files.

Another thing, I'd expect this to scale very poorly if the 35 subvolumes contain any appreciable uniqueness, because searches can't be done in parallel. So the more subvolumes you add, the more disk contention you get, but also enormous amounts of latency as possibly 35 locations on the disk are being searched if they happen to be unique.

So in either case "duplicate" vs "unique" you have a problem, just different kinds. And as the storage grows, it increasingly encounters both problems at the same time. Small problem. What size are the files?

And that's on a bare drive before you went and did this:

> My filesystem /dev/vg-adama-docs/lv-adama-docs is 372GB in size, and is a quite complex setup:
> It is based on logical volume (LVM2), which has a single physical volume made by dm-crypt device /dev/dm-1, which subsequently sits on top of /dev/md1 linux raid 6, which is built with 4 identical 186GB GPT partitions on each of my SATA 3TB hard drives.

Why are you using raid6 for four disks, instead of raid10?  What's the chunk size for the raid 6? What's the btrfs leaf size? What's the dedup chunk size?

Why are you using LVM at all, while the /dev/dm-1 is the same size as the LV? You say the btrfs volume on LV is on dm-1 which means they're all the same size, obviating the need for LVM in this case entirely.

Chris Murphy


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Poor performance of btrfs. Suspected unidentified btrfs housekeeping process which writes a lot
  2013-01-30 23:58 ` Chris Murphy
@ 2013-01-31  1:02   ` Adam Ryczkowski
  2013-01-31  1:50     ` Chris Murphy
  0 siblings, 1 reply; 10+ messages in thread
From: Adam Ryczkowski @ 2013-01-31  1:02 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

Than you, Chris, for your time.


On 2013-01-31 00:58, Chris Murphy wrote:
> On Jan 30, 2013, at 7:57 AM, Adam Ryczkowski<adam.ryczkowski@statystyka.net>  wrote:
>> I suspect it has something to do with snapshots I make for backup. I have 35 of them, and I ask bedup to find duplicates across all subvolumes.
> Assuming most files do have identical duplicates, implies the same file in all 35 subvolumes is actually in the same physical location; it differs only in subvol reference. But it's not btrfs that determines the "duplicate" vs "unique" state of those 35 file instances, but unison. The fs still must send all 35x instances for the state to be determined, as if they were unique files.
I'm sorry if I didn't put my question more clearly. I tried to write, 
that the problem is not specific to the unison; I am able to reproduce 
the problem using other means of reading contents of the file. I tried 
'cat' many small files, and previewing under Midnight Commander some 
large ones. I didn't take precise measurements, but I can tell, that 
reading 500 50-byte files (ca. 25kB of data) took way longer that 
reading one 3MB file, so I suspect the problem is with metadata access 
times rather than with data.

I am aware, that reading 1MB distributed in small files takes longer 
than 1MB of sequential reading. The problem is that _suddenly_ this 
speed  got at least 20 times longer than usual. And from what iotop and 
systat told me, the harddrives were busy _writing_ something, not 
_reading_! The amount of time I wait for scanning the whole harddrive 
with unison is comparable with time that full balance takes.

Anyway, I synchronize only the "working copy" part of my file system. 
All the backup subvolumes sit in a separate path, not seen by the unison.
Moreover, once I wait long enough for the system to finish scanning the 
file system, file access speeds are back to normal, even after I drop 
read cache or even reboot the system. It is only after making another 
snapshot, when the problems recurs.
> Another thing, I'd expect this to scale very poorly if the 35 subvolumes contain any appreciable uniqueness, because searches can't be done in parallel. So the more subvolumes you add, the more disk contention you get, but also enormous amounts of latency as possibly 35 locations on the disk are being searched if they happen to be unique.

*The severity of my problem is proportional to time*. It happens 
immediately after making snaphot, and persists for each file until I try 
to read its contents. Than, even after the reboot, timing is back to 
normal. With my limited knowledge about the internals of btrfs I 
suspect, that the bedup has messed my metadata somehow. Maybe I should 
balance only the metadata part (if that is possible at all)?
> So in either case "duplicate" vs "unique" you have a problem, just different kinds. And as the storage grows, it increasingly encounters both problems at the same time. Small problem. What size are the files?
>
> And that's on a bare drive before you went and did this:
>
>> My filesystem /dev/vg-adama-docs/lv-adama-docs is 372GB in size, and is a quite complex setup:
>> It is based on logical volume (LVM2), which has a single physical volume made by dm-crypt device /dev/dm-1, which subsequently sits on top of /dev/md1 linux raid 6, which is built with 4 identical 186GB GPT partitions on each of my SATA 3TB hard drives.
> Why are you using raid6 for four disks, instead of raid10?
Because I plan to add another 4 in the future. It's way easier to add 
another disk to the array, than to change the RAID layout.
> What's the chunk size for the raid 6? What's the btrfs leaf size? What's the dedup chunk size?
I'll tell you tomorrow, but I hardly think that the misalignment could 
be any problem here. As I said, everything was fine and the problem 
didn't appear in gradual fashion.
> Why are you using LVM at all, while the /dev/dm-1 is the same size as the LV? You say the btrfs volume on LV is on dm-1 which means they're all the same size, obviating the need for LVM in this case entirely.
Yes, I agree, that at the moment I don't need it. But when partition 
sits on logical volume I keep the option to extend the filesystem, when 
I the need comes.
My current needs are more complex, I don't keep all the date in the same 
redundancy and security level. It is also hard to tell in advance the 
relative sizes of each combination of redundancy and security levels. So 
I allocate only as much space on the GPT partitions as I immediately 
need, and in the future, when need comes, I can relatively easily make 
more partitions, arrange them in the appropriate raid/mdcrypt 
combination, and expand the filesystem that ran out space.

I am aware, that this setup is very complex. I can say, that my 
application is not life-critical, and this complexity serves me well on 
another Linux server, which I am using over 5 years (without the btrfs, 
of course).


> Chris Murphy
>


-- 

Adam Ryczkowski
+48505919892 <callto:+48505919892>
Skype:sisteczko <skype:sisteczko>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Poor performance of btrfs. Suspected unidentified btrfs housekeeping process which writes a lot
  2013-01-31  1:02   ` Adam Ryczkowski
@ 2013-01-31  1:50     ` Chris Murphy
  2013-01-31 10:56       ` Adam Ryczkowski
  0 siblings, 1 reply; 10+ messages in thread
From: Chris Murphy @ 2013-01-31  1:50 UTC (permalink / raw)
  To: Adam Ryczkowski; +Cc: linux-btrfs


On Jan 30, 2013, at 6:02 PM, Adam Ryczkowski <adam.ryczkowski@statystyka.net> wrote:

>  I didn't take precise measurements, but I can tell, that reading 500 50-byte files (ca. 25kB of data) took way longer that reading one 3MB file, so I suspect the problem is with metadata access times rather than with data.

For 50 byte files, btrfs writes the data with metadata. Depending on their location relative to each other, this could mean 250MB of reads because of the large raid6 chunk size, yet only ~ 2MB is needed by btrfs.


> I am aware, that reading 1MB distributed in small files takes longer than 1MB of sequential reading. The problem is that _suddenly_ this speed  got at least 20 times longer than usual.

How does dedup work on 50 byte files? How does it contribute to fragmentation? And then how does that fragmentation turn into gross read inefficiencies at the md chunk level?


> And from what iotop and systat told me, the harddrives were busy _writing_ something, not _reading_!

Seems like you need to find out what's being written, how many and how big the requests are. Small writes mean huge RWM penalty on raid6, especially a 4 disk raid 6 where you're practically guaranteed to have either data or metadata request halted for a parity rewrite.

> 
> Anyway, I synchronize only the "working copy" part of my file system. All the backup subvolumes sit in a separate path, not seen by the unison.

You're syncing what to what, in physical terms? I know one of the what's is a btrfs volume on top of LVM, on top of LUKs, on top of md raid6, on top of partitions located on four 3TB drives. YOu said there are other partitions on these drives so are there other read/writes occurring on those drives at the same time? It doesn't look like that's the case from iotop, the md0


> Moreover, once I wait long enough for the system to finish scanning the file system, file access speeds are back to normal, even after I drop read cache or even reboot the system. It is only after making another snapshot, when the problems recurs.
>> Another thing, I'd expect this to scale very poorly if the 35 subvolumes contain any appreciable uniqueness, because searches can't be done in parallel. So the more subvolumes you add, the more disk contention you get, but also enormous amounts of latency as possibly 35 locations on the disk are being searched if they happen to be unique.
> 
> *The severity of my problem is proportional to time*. It happens immediately after making snaphot, and persists for each file until I try to read its contents. Than, even after the reboot, timing is back to normal. With my limited knowledge about the internals of btrfs I suspect, that the bedup has messed my metadata somehow. Maybe I should balance only the metadata part (if that is possible at all)?

It's possible to balance just metadata chunks. But I think this is a spaghetti on the wall approach, rather than understanding how all of these layers are interacting with each other.
https://btrfs.wiki.kernel.org/index.php/Balance_Filters

>>> 
>> Why are you using raid6 for four disks, instead of raid10?
> Because I plan to add another 4 in the future. It's way easier to add another disk to the array, than to change the RAID layout.

If this is happening imminently perhaps, in the meantime you have a terribly inefficient raid setup.

>> What's the chunk size for the raid 6? What's the btrfs leaf size? What's the dedup chunk size?
> I'll tell you tomorrow, but I hardly think that the misalignment could be any problem here. As I said, everything was fine and the problem didn't appear in gradual fashion.

It also depends on what mysterious stuff is being written during what's ostensibly a read only event.


>> Why are you using LVM at all, while the /dev/dm-1 is the same size as the LV? You say the btrfs volume on LV is on dm-1 which means they're all the same size, obviating the need for LVM in this case entirely.
> Yes, I agree, that at the moment I don't need it. But when partition sits on logical volume I keep the option to extend the filesystem, when I the need comes.

This is not an ideal way to extend a btrfs file system however. You're adding unnecessarily layers and complexity while also not taking advantage of what LVM can do that btrfs cannot when it comes to logical volume management.


> My current needs are more complex, I don't keep all the date in the same redundancy and security level. It is also hard to tell in advance the relative sizes of each combination of redundancy and security levels. So I allocate only as much space on the GPT partitions as I immediately need, and in the future, when need comes, I can relatively easily make more partitions, arrange them in the appropriate raid/mdcrypt combination, and expand the filesystem that ran out space.

It sounds unnecessarily complex, but what do I know. Hopefully you have everything backed up to something that is comparatively simple. There are more failure points here than I can count.

> 
> I am aware, that this setup is very complex. I can say, that my application is not life-critical, and this complexity serves me well on another Linux server, which I am using over 5 years (without the btrfs, of course).

Well the with btrfs plus dedup adds a lot. And if the problem is disk contention, you may find drive heads dying a lot sooner than you'd otherwise expect.

When this problem is happening, with the low bandwidth writing, can you hear disk chatter? On all of the drives at the same time or just one or two at a time?


Chris Murphy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Poor performance of btrfs. Suspected unidentified btrfs housekeeping process which writes a lot
  2013-01-31  1:50     ` Chris Murphy
@ 2013-01-31 10:56       ` Adam Ryczkowski
  2013-01-31 19:08         ` Chris Murphy
  0 siblings, 1 reply; 10+ messages in thread
From: Adam Ryczkowski @ 2013-01-31 10:56 UTC (permalink / raw)
  To: linux-btrfs

My original problem got solved, but you answer has a set of interesting 
performance hints, and I am very grateful for you input. Here are my 
answers and further questions if you are willing to continue this topic.


On 2013-01-31 02:50, Chris Murphy wrote:
> On Jan 30, 2013, at 6:02 PM, Adam Ryczkowski <adam.ryczkowski@statystyka.net> wrote:
>
>>   I didn't take precise measurements, but I can tell, that reading 500 50-byte files (ca. 25kB of data) took way longer that reading one 3MB file, so I suspect the problem is with metadata access times rather than with data.
> For 50 byte files, btrfs writes the data with metadata. Depending on their location relative to each other, this could mean 250MB of reads because of the large raid6 chunk size, yet only ~ 2MB is needed by btrfs.
Yes, good point. I never stated that my setup gives me the best I can 
get from my hardware.
>> I am aware, that reading 1MB distributed in small files takes longer than 1MB of sequential reading. The problem is that _suddenly_ this speed  got at least 20 times longer than usual.
> How does dedup work on 50 byte files? How does it contribute to fragmentation? And then how does that fragmentation turn into gross read inefficiencies at the md chunk level?
I really don't know. It is interesting to know that, though. But 
whatever are the results, at the current state of affairs the defrag 
will ruin all benefits of bedup, so even if the filesystem gets 
fragmented, I can do nothing about it.
>> And from what iotop and systat told me, the harddrives were busy _writing_ something, not _reading_!
> Seems like you need to find out what's being written, how many and how big the requests are. Small writes mean huge RWM penalty on raid6, especially a 4 disk raid 6 where you're practically guaranteed to have either data or metadata request halted for a parity rewrite.
Yes, you are right. It is important contributing factor, why relatime 
mount option killed my performance so badly.
>> Anyway, I synchronize only the "working copy" part of my file system. All the backup subvolumes sit in a separate path, not seen by the unison.
> You're syncing what to what, in physical terms? I know one of the what's is a btrfs volume on top of LVM, on top of LUKs, on top of md raid6, on top of partitions located on four 3TB drives. YOu said there are other partitions on these drives so are there other read/writes occurring on those drives at the same time? It doesn't look like that's the case from iotop, the md0
No, I synchronize across network with my desktop machines and backup 
file server :-). But even if I didn't, the unison is kind enough to 
detect local sync and it makes them in sequence (not asynchronously).
>>> What's the chunk size for the raid 6? What's the btrfs leaf size? What's the dedup chunk size?
>> I'll tell you tomorrow, but I hardly think that the misalignment could be any problem here. As I said, everything was fine and the problem didn't appear in gradual fashion.
> It also depends on what mysterious stuff is being written during what's ostensibly a read only event.
The dedup chunk size isn't clearly stated, but from the README I infer 
it deduplicates files as a whole; here is an excerpt from the README 
(https://github.com/g2p/bedup/blob/master/README.rst)
> Deduplication is implemented using a Btrfs feature that allows for 
> cloning data from one file to the other. The cloned ranges become 
> shared on disk, saving space.

This is a summary of the granurality of the allocation pieces in the 
storage hierarchy.
On mdadm I have chunk size of 512K,
the dm-crypt volume uses 512 byte sectors,
and all lvm physical volumes' PE Sizes: 4MiB, but it shouldn't affect 
efficiency.

I couldn't find any command that tells me the leaf size of already 
created btrfs system. Maybe you can tell me?

I will also check, if there is an alignment problem as well. When I was 
reading a manual for each of the layer I came to the conclusion that 
each layer is supposed to align to the underlying one automatically. But 
I try to can check it.
>>> Why are you using LVM at all, while the /dev/dm-1 is the same size as the LV? You say the btrfs volume on LV is on dm-1 which means they're all the same size, obviating the need for LVM in this case entirely.
>> Yes, I agree, that at the moment I don't need it. But when partition sits on logical volume I keep the option to extend the filesystem, when I the need comes.
> This is not an ideal way to extend a btrfs file system however. You're adding unnecessarily layers and complexity while also not taking advantage of what LVM can do that btrfs cannot when it comes to logical volume management.
Can you tell me more? Because I have only learned, that btrfs 
multi-device support cannot join two volumes without striping. And 
striping in this case is equivalent to fragmentation, which we want to 
avoid. In contrast to what LVM can do. LVM can concatenate the 
underlying storage together, without striping.

-- 

Adam Ryczkowski
www.statystyka.net <http://www.google.com/>
+48505919892 <callto:+48505919892>
Skype:sisteczko <skype:sisteczko>
Aktualny kalendarz 
<https://www.google.com/calendar/b/0/embed?src=adam.ryczkowski@statystyka.net&ctz&gsessionid=OK>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Poor performance of btrfs. Suspected unidentified btrfs housekeeping process which writes a lot
  2013-01-31 10:56       ` Adam Ryczkowski
@ 2013-01-31 19:08         ` Chris Murphy
  2013-01-31 19:17           ` Adam Ryczkowski
  0 siblings, 1 reply; 10+ messages in thread
From: Chris Murphy @ 2013-01-31 19:08 UTC (permalink / raw)
  To: Adam Ryczkowski; +Cc: linux-btrfs


On Jan 31, 2013, at 2:45 AM, Adam Ryczkowski <adam.ryczkowski@statystyka.net> wrote:
>> 
> Yes, you are right. It is important contributing factor, why relatime mount option killed my performance so badly.

So is this what was causing the problem?

>> 
> The dedup chunk size isn't clearly stated, but from the README I infer it deduplicates files as a whole; here is an excerpt from the README (https://github.com/g2p/bedup/blob/master/README.rst)

I wouldn't expect reading file metadata, 

> This is a summary of the granurality of the allocation pieces in the storage hierarchy.
> On mdadm I have chunk size of 512K,

It's quite large for your use case. It's large for most any use case, actually.

> I couldn't find any command that tells me the leaf size of already created btrfs system. Maybe you can tell me?

I don't know that it's easily determined after mkfs time, someone else can maybe answer. Default is 4KB. Otherwise you use flags to set it.


> I will also check, if there is an alignment problem as well. When I was reading a manual for each of the layer I came to the conclusion that each layer is supposed to align to the underlying one automatically. But I try to can check it.

I'm not thinking of an alignment problem, but a poor chosen chunk size for the usage problem. Changing 50 bytes (could be metadata or data), means in your case at least 2MB of RMW with a 512KB chunk. And this gets worse with more disks, because you have more chunks to read. The whole stripe is read, modified, and written on md raid6 currently. You're planning to add four more disks, so that's now 8 disks, and a 4MB full stripe RMW for 50 bytes of changed data.

Depending on what GPT partitioned these 3TB disks, it's remotely possible they aren't aligned to 4K sectors however. gdisk should do this correctly by starting the first partition at LBA 2048, and aligning to 16 sector boundaries. parted of recent versions does something similar, but I forget the details. Older versions can misalign by starting at LBA 63, as can other older non-Linux tools. OS X's Disk Utility starts the first partition at LBA 40 which is OK.

> Can you tell me more? Because I have only learned, that btrfs multi-device support cannot join two volumes without striping. And striping in this case is equivalent to fragmentation, which we want to avoid. In contrast to what LVM can do. LVM can concatenate the underlying storage together, without striping.

When you create a btrfs file system, by default the data profile is single, and metadata profile is dup. When you add another device to the volume, it stays this way. The single data profile behaves similar to LVM linear, except btrfs will alternate chunk allocations between devices, so that one isn't just sitting there spinning for a month and not being used at all. 

So it's not striping. But even if it were striping, that would help you on write performance in particular because now it's effectively RAID 60. I don't see why striping is considered fragmentation.

To change the profile for the volume, you use -dconvert and/or -mconvert with a rebalance operation.

Chris Murphy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Poor performance of btrfs. Suspected unidentified btrfs housekeeping process which writes a lot
  2013-01-31 19:08         ` Chris Murphy
@ 2013-01-31 19:17           ` Adam Ryczkowski
  2013-01-31 20:35             ` Chris Murphy
  0 siblings, 1 reply; 10+ messages in thread
From: Adam Ryczkowski @ 2013-01-31 19:17 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Chris Murphy

On 2013-01-31 20:08, Chris Murphy wrote:
> On Jan 31, 2013, at 2:45 AM, Adam Ryczkowski <adam.ryczkowski@statystyka.net> wrote:
>> Yes, you are right. It is important contributing factor, why relatime mount option killed my performance so badly.
> So is this what was causing the problem?
Yes.
>> Can you tell me more? Because I have only learned, that btrfs multi-device support cannot join two volumes without striping. And striping in this case is equivalent to fragmentation, which we want to avoid. In contrast to what LVM can do. LVM can concatenate the underlying storage together, without striping.
> When you create a btrfs file system, by default the data profile is single, and metadata profile is dup. When you add another device to the volume, it stays this way. The single data profile behaves similar to LVM linear, except btrfs will alternate chunk allocations between devices, so that one isn't just sitting there spinning for a month and not being used at all.
>
> So it's not striping. But even if it were striping, that would help you on write performance in particular because now it's effectively RAID 60. I don't see why striping is considered fragmentation.
Well, if the devices are on the same physical hard-drive, than 
sequential file reading would cause hard drive heads to seek between the 
first and the other partition on every extent. This is something 
equivalent to defragmentation; it is only good if the partitions are on 
separate hard drives.
> To change the profile for the volume, you use -dconvert and/or -mconvert with a rebalance operation.
Once again, thank you very much, Chris.
-- 

Adam Ryczkowski
+48505919892 <callto:+48505919892>
Skype:sisteczko 
<skype:sisteczko><https://www.google.com/calendar/b/0/embed?src=adam.ryczkowski@statystyka.net&ctz&gsessionid=OK>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Poor performance of btrfs. Suspected unidentified btrfs housekeeping process which writes a lot
  2013-01-31 19:17           ` Adam Ryczkowski
@ 2013-01-31 20:35             ` Chris Murphy
  0 siblings, 0 replies; 10+ messages in thread
From: Chris Murphy @ 2013-01-31 20:35 UTC (permalink / raw)
  To: Adam Ryczkowski; +Cc: linux-btrfs


On Jan 31, 2013, at 12:17 PM, Adam Ryczkowski <adam.ryczkowski@statystyka.net> wrote:
>>> 
>> When you create a btrfs file system, by default the data profile is single, and metadata profile is dup. When you add another device to the volume, it stays this way. The single data profile behaves similar to LVM linear, except btrfs will alternate chunk allocations between devices, so that one isn't just sitting there spinning for a month and not being used at all.
>> 
>> So it's not striping. But even if it were striping, that would help you on write performance in particular because now it's effectively RAID 60. I don't see why striping is considered fragmentation.
> Well, if the devices are on the same physical hard-drive, than sequential file reading would cause hard drive heads to seek between the first and the other partition on every extent. This is something equivalent to defragmentation;

You wouldn't make the volume larger by adding devices in this case regardless of the profile used. You'd first grow the underlying layers. And then resize the file system.

> it is only good if the partitions are on separate hard drives.

Yes obviously. But even better is to not partition your devices at all if you're concerned about efficiency. Just use the whole drive as the device.


Chris Murphy


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Poor performance of btrfs. Suspected unidentified btrfs housekeeping process which writes a lot
  2013-01-31  9:45 ` Adam Ryczkowski
@ 2013-01-31 19:06   ` Gabriel
  0 siblings, 0 replies; 10+ messages in thread
From: Gabriel @ 2013-01-31 19:06 UTC (permalink / raw)
  To: linux-btrfs

Hi,

> After mounting the system with noatime the problem disappeared, like in 
> magic.

Incidentally, the current version of bedup uses a private mountpoint with 
noatime whenever you don't give it the path to a mounted volume.  You can 
use it with no arguments or designate a filesystem by its uuid or /dev 
path.

> All the writes must have came from the dealyed metadata copy process. 
> Once all the metadata copy-update was done, file system speed was back 
> to normal, but once the new day broke out, all the copying business 
> needed to done again... This in 100% describes all the odd behavior.
> 
> In particular apparently the problem had nothing to do with my complex 
> block device setup, nor with bedup, nor with unison.
> 
> Thank you again, Andrew!
> 
> P.S. Maybe it is not be decided by me, but this small message about 
> performance (not even labeled as warning) in 
> https://btrfs.wiki.kernel.org/index.php/Mount_options IMHO should have 
> been made more conspicuous, maybe put somewhere when the snapshot 
> mechanism is described or in FAQ. I'll try to fix it.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Poor performance of btrfs. Suspected unidentified btrfs housekeeping process which writes a lot
       [not found] <CAAuLxcbVXFjzvZ+Oj4MUEHnsOhhbVPTeKx-34En2ym37J2wuuA@mail.gmail.com>
@ 2013-01-31  9:45 ` Adam Ryczkowski
  2013-01-31 19:06   ` Gabriel
  0 siblings, 1 reply; 10+ messages in thread
From: Adam Ryczkowski @ 2013-01-31  9:45 UTC (permalink / raw)
  To: andrew.j.wade, linux-btrfs

On 2013-01-31 04:33, Andrew Wade wrote:
> Hi Adam,
>
> Is btrfs mounted relatime? I'm wondering if you're seeing metadata
> writes from atime updates. I've got my filesystem mounted noatime to
> avoid breaking metadata sharing between subvolumes.
>
> Apologies for the broken threading - I'm not subscribed to the list.
>
> regards,
> Andrew

Thank you, thank you!!! Hurray!! That was the problem!! I'm so happy 
you've helped me out!!!

After mounting the system with noatime the problem disappeared, like in 
magic.

All the writes must have came from the dealyed metadata copy process. 
Once all the metadata copy-update was done, file system speed was back 
to normal, but once the new day broke out, all the copying business 
needed to done again... This in 100% describes all the odd behavior.

In particular apparently the problem had nothing to do with my complex 
block device setup, nor with bedup, nor with unison.

Thank you again, Andrew!

P.S. Maybe it is not be decided by me, but this small message about 
performance (not even labeled as warning) in 
https://btrfs.wiki.kernel.org/index.php/Mount_options IMHO should have 
been made more conspicuous, maybe put somewhere when the snapshot 
mechanism is described or in FAQ. I'll try to fix it.

-- 

Adam Ryczkowski
+48505919892 <callto:+48505919892>
Skype:sisteczko <skype:sisteczko>


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-01-31 20:35 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-30 14:57 Poor performance of btrfs. Suspected unidentified btrfs housekeeping process which writes a lot Adam Ryczkowski
2013-01-30 23:58 ` Chris Murphy
2013-01-31  1:02   ` Adam Ryczkowski
2013-01-31  1:50     ` Chris Murphy
2013-01-31 10:56       ` Adam Ryczkowski
2013-01-31 19:08         ` Chris Murphy
2013-01-31 19:17           ` Adam Ryczkowski
2013-01-31 20:35             ` Chris Murphy
     [not found] <CAAuLxcbVXFjzvZ+Oj4MUEHnsOhhbVPTeKx-34En2ym37J2wuuA@mail.gmail.com>
2013-01-31  9:45 ` Adam Ryczkowski
2013-01-31 19:06   ` Gabriel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.