* btrfs performance, sudden drop to 0 IOPs
@ 2015-02-09 17:26 P. Remek
2015-02-09 19:56 ` Kai Krakow
` (3 more replies)
0 siblings, 4 replies; 20+ messages in thread
From: P. Remek @ 2015-02-09 17:26 UTC (permalink / raw)
To: linux-btrfs
Hello,
I am benchmarking Btrfs and when benchmarking random writes with fio
utility, I noticed following two things:
1) On first run when target file doesn't exist yet, perfromance is
about 8000 IOPs. On second, and every other run, performance goes up
to 70000 IOPs. Its massive difference. The target file is the one
created during the first run.
2) There are windows during the test where IOPs drop to 0 and stay 0
about 10 seconds and then it goes back again, and after couple of
seconds again to 0. This is reproducible 100% times.
Can somobody shred some light on what's happening?
Command: fio --randrepeat=1 --ioengine=libaio --direct=1
--gtod_reduce=1 --name=test9 --filename=test9 --bs=4k --iodepth=256
--size=10G --numjobs=1 --readwrite=randwrite
Environment:
CPU: dual socket: E5-2630 v2
RAM: 32 GB ram
OS: Ubuntu server 14.10
Kernel: 3.19.0-031900rc2-generic
btrfs tools: Btrfs v3.14.1
2x LSI 9300 HBAs - SAS3 12/Gbs
8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs
Regards,
Premek
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-09 17:26 btrfs performance, sudden drop to 0 IOPs P. Remek
@ 2015-02-09 19:56 ` Kai Krakow
2015-02-09 22:21 ` P. Remek
2015-02-10 4:42 ` Duncan
` (2 subsequent siblings)
3 siblings, 1 reply; 20+ messages in thread
From: Kai Krakow @ 2015-02-09 19:56 UTC (permalink / raw)
To: linux-btrfs
P. Remek <p.remek1@googlemail.com> schrieb:
> Hello,
>
> I am benchmarking Btrfs and when benchmarking random writes with fio
> utility, I noticed following two things:
>
> 1) On first run when target file doesn't exist yet, perfromance is
> about 8000 IOPs. On second, and every other run, performance goes up
> to 70000 IOPs. Its massive difference. The target file is the one
> created during the first run.
>
> 2) There are windows during the test where IOPs drop to 0 and stay 0
> about 10 seconds and then it goes back again, and after couple of
> seconds again to 0. This is reproducible 100% times.
>
> Can somobody shred some light on what's happening?
I'm not an expert or dev but it's probably due to btrfs doing some
housekeeping under the hood. Could you check the output of "btrfs filesystem
usage /mountpoint" while running the test? I'd guess there's some pressure
on the global reserve during those times.
> Command: fio --randrepeat=1 --ioengine=libaio --direct=1
> --gtod_reduce=1 --name=test9 --filename=test9 --bs=4k --iodepth=256
> --size=10G --numjobs=1 --readwrite=randwrite
>
> Environment:
> CPU: dual socket: E5-2630 v2
> RAM: 32 GB ram
> OS: Ubuntu server 14.10
> Kernel: 3.19.0-031900rc2-generic
> btrfs tools: Btrfs v3.14.1
> 2x LSI 9300 HBAs - SAS3 12/Gbs
> 8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs
>
> Regards,
> Premek
--
Replies to list only preferred.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-09 19:56 ` Kai Krakow
@ 2015-02-09 22:21 ` P. Remek
2015-02-10 6:58 ` Kai Krakow
0 siblings, 1 reply; 20+ messages in thread
From: P. Remek @ 2015-02-09 22:21 UTC (permalink / raw)
To: Kai Krakow; +Cc: linux-btrfs
Not sure if it helps, but here is it:
root@lab1:/mnt/vol1# btrfs filesystem df /mnt/vol1/
Data, RAID10: total=116.00GiB, used=110.03GiB
Data, single: total=8.00MiB, used=0.00
System, RAID1: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, RAID1: total=2.00GiB, used=563.72MiB
Metadata, single: total=8.00MiB, used=0.00
unknown, single: total=192.00MiB, used=0.00
On Mon, Feb 9, 2015 at 8:56 PM, Kai Krakow <hurikhan77@gmail.com> wrote:
> P. Remek <p.remek1@googlemail.com> schrieb:
>
>> Hello,
>>
>> I am benchmarking Btrfs and when benchmarking random writes with fio
>> utility, I noticed following two things:
>>
>> 1) On first run when target file doesn't exist yet, perfromance is
>> about 8000 IOPs. On second, and every other run, performance goes up
>> to 70000 IOPs. Its massive difference. The target file is the one
>> created during the first run.
>>
>> 2) There are windows during the test where IOPs drop to 0 and stay 0
>> about 10 seconds and then it goes back again, and after couple of
>> seconds again to 0. This is reproducible 100% times.
>>
>> Can somobody shred some light on what's happening?
>
> I'm not an expert or dev but it's probably due to btrfs doing some
> housekeeping under the hood. Could you check the output of "btrfs filesystem
> usage /mountpoint" while running the test? I'd guess there's some pressure
> on the global reserve during those times.
>
>> Command: fio --randrepeat=1 --ioengine=libaio --direct=1
>> --gtod_reduce=1 --name=test9 --filename=test9 --bs=4k --iodepth=256
>> --size=10G --numjobs=1 --readwrite=randwrite
>>
>> Environment:
>> CPU: dual socket: E5-2630 v2
>> RAM: 32 GB ram
>> OS: Ubuntu server 14.10
>> Kernel: 3.19.0-031900rc2-generic
>> btrfs tools: Btrfs v3.14.1
>> 2x LSI 9300 HBAs - SAS3 12/Gbs
>> 8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs
>>
>> Regards,
>> Premek
>
> --
> Replies to list only preferred.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-09 17:26 btrfs performance, sudden drop to 0 IOPs P. Remek
2015-02-09 19:56 ` Kai Krakow
@ 2015-02-10 4:42 ` Duncan
2015-02-10 17:44 ` P. Remek
2015-02-11 12:40 ` Austin S Hemmelgarn
2015-02-12 4:59 ` Liu Bo
3 siblings, 1 reply; 20+ messages in thread
From: Duncan @ 2015-02-10 4:42 UTC (permalink / raw)
To: linux-btrfs
P. Remek posted on Mon, 09 Feb 2015 18:26:49 +0100 as excerpted:
> Hello,
>
> I am benchmarking Btrfs and when benchmarking random writes with fio
> utility, I noticed following two things:
>
> 1) On first run when target file doesn't exist yet, perfromance is about
> 8000 IOPs. On second, and every other run, performance goes up to 70000
> IOPs. Its massive difference. The target file is the one created during
> the first run.
You say a file size of 10 GiB with a block size of 4 KiB, but don't say
whether you're using the autodefrag mount option, or whether you had set
nocow on the file at creation (generally done by setting it on the
directory, so new files inherit the option, chattr +C).
What I /suspect/ is happening, is that at the 10 GiB files size, on
original file creation, btrfs is creating a large file of several
comparatively large extents (possibly 1 GiB each, the nominal data chunk
size, tho it can be larger on large enough filesystems). Note that btrfs
will normally wait to sync, accumulating further writes into the file
before actually writing it. By default it's 30 seconds, but there's a
mount option to change that. So btrfs is probably waiting, then writing
out all changes for the last 30 seconds at once, allowing it to use
fairly large extents when it does so.
Then when the file already exists,, keeping in mind that btrfs is COW
(copy-on-write) and that by default it keeps two copies of metadata (dup
on a single device, or one each on two separate devices, on a multi-
device filesystem), one copy of data (single on a single device, I
believe raid0 on multi-device), it's having to COW individual 4K blocks
within the file as they are rewritten.
This is going to massively fragment the file, driving up IOPs
tremendously. On top of that, each time a data fragment is written,
there's going to be two metadata updates due to the dup/raid1 metadata
default, and while they won't be updated immediately, every commit (30
seconds), those metadata changes are going to replicate up the metadata
tree to its root.
So instead of having a few orderly GiB-ish size extents written, along
with their metadata, as at file-create, now you're writing a new extent
for each changed 4 KiB block, plus 2X metadata updates for each one, plus
every commit, the updated metadata chain up to the root.
Those 70K IOPs are all the extra work the filesystem is doing in ordered
to track those 4 KiB COWed writes!
The autodefrag option will likely increase this even further, as it
doesn't prevent the COWs, but instead, queues up any files it detects as
fragmented, for later cleanup via autodefrag worker thread. This is one
reason this option isn't recommended for large (say quarter to half-gig-
plus) heavy-internal-rewrite-pattern use-cases (typically VM images or
large database files), tho it works quite well for files upto a couple
hundred MiB or so (typical of firefox sqlite database files, etc), since
those get rewritten pretty fast.
The nocow file attribute can be used on these larger files, but it does
have additional implications. Nocow turns off btrfs compression for that
file, if you had it enabled (mount option), and also turns off
checksumming. Turning off checksumming means btrfs will no longer detect
file corruption, but many databases and vm tools have their own
corruption detection and possibly correction schemes already, since they
use them on filesystems such as ext* that don't have builtin
checksumming, so turning off the btrfs checksumming and error detection
for these files isn't as bad as it would otherwise seem, and in many
cases prevents the filesystem duplicating work that the application is
already doing. (Also, on btrfs, nocow must be set at file creation, when
it is still zero-sized. As mentioned above, this is usually accomplished
by setting it on the directory and letting new files and subdirs inherit
the attribute.)
But with the nocow file attribute properly applied, these random rewrites
will be done in-place, no cascading fragmentation and metadata updates,
and my guess is that you'll see the IOPs on existing nocow files reduce
to something far more sane as a result.
> 2) There are windows during the test where IOPs drop to 0 and stay 0
> about 10 seconds and then it goes back again, and after couple of
> seconds again to 0. This is reproducible 100% times.
I recall this periodic behavior coming up in at least one earlier thread
as well, but I'm not a dev, just a btrfs user and list regular, and I
don't recall what the explanation was, unless it was related to internal
btrfs bookkeeping due to that 30-second commit cycle I mentioned above.
But I'm guessing that if you properly set nocow on the file, you'll
probably see this go away as well, since you won't be overwhelming btrfs
and the hardware with IOPs any longer.
Perhaps someone with a better understanding of the situation will jump in
and explain this bit better than I can...
> Can somobody shred some light on what's happening?
>
>
> Command: fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
> --name=test9 --filename=test9 --bs=4k --iodepth=256 --size=10G
> --numjobs=1 --readwrite=randwrite
>
> Environment:
> CPU: dual socket: E5-2630 v2
> RAM: 32 GB ram
> OS: Ubuntu server 14.10
> Kernel: 3.19.0-031900rc2-generic
> btrfs tools: Btrfs v3.14.1
> 2x LSI 9300 HBAs
> - SAS3 12/Gbs 8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs
I suppose you're already aware that you're running a rather outdated
userspace/btrfs-progs (what I assume you meant by tools). Userspace
versions sync with the kernel cycle, with a particular 3.x.0 version
typically being released a couple weeks after the kernel of the same
version, usually with a couple 3.x.y, y-update releases following before
the next kernel-synced x-version bump.
So userspace/progs v3.19.0 isn't out yet (tho rc2 is available), but
3.18.2 is current, well beyond your 3.14.1.
FWIW, a current kernel is most important during normal operation, as the
userspace simply tells the kernel what to do at a high level and the
kernel follows thru with its lower level code. So for normal operation,
userspace getting a bit behind isn't a major issue unless you want a
feature only available in a newer version.
But if something goes wrong and you're trying to diagnose and repair from
userspace, THAT is when the userspace low-level code is run, and thus
when userspace version becomes vitally important.
So as long as nothing's going wrong, you're probably OK with that 3.14
userspace. But I'd still recommend updating to current and keeping
current, because you don't want to be scrambling to build a newer
userspace after something goes wrong, in ordered to have the best chance
at recovery.
Kudos on having a current kernel, at least. There have been quite a few
kernel bugs fixed since 3.14 era, and you're running a current kernel so
at least aren't needlessly risking the known bugs of the older ones where
it's operationally important. =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-09 22:21 ` P. Remek
@ 2015-02-10 6:58 ` Kai Krakow
0 siblings, 0 replies; 20+ messages in thread
From: Kai Krakow @ 2015-02-10 6:58 UTC (permalink / raw)
To: linux-btrfs
P. Remek <p.remek1@googlemail.com> schrieb:
> Not sure if it helps, but here is it:
>
> root@lab1:/mnt/vol1# btrfs filesystem df /mnt/vol1/
> Data, RAID10: total=116.00GiB, used=110.03GiB
> Data, single: total=8.00MiB, used=0.00
> System, RAID1: total=8.00MiB, used=16.00KiB
> System, single: total=4.00MiB, used=0.00
> Metadata, RAID1: total=2.00GiB, used=563.72MiB
> Metadata, single: total=8.00MiB, used=0.00
> unknown, single: total=192.00MiB, used=0.00
This looks completely different to my output. Do you use the latest btrfs-
progs?
$ btrfs --version
Btrfs v3.18.2
$ btrfs fi us /
Overall:
Device size: 2.71TiB
Device allocated: 1.50TiB
Device unallocated: 1.21TiB
Used: 1.37TiB
Free (estimated): 1.33TiB (min: 745.87GiB)
Data ratio: 1.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 0.00B)
Data,RAID0: Size:1.49TiB, Used:1.36TiB
/dev/bcache0 507.00GiB
/dev/bcache1 507.00GiB
/dev/bcache2 507.00GiB
Metadata,RAID1: Size:6.00GiB, Used:3.99GiB
/dev/bcache0 4.00GiB
/dev/bcache1 4.00GiB
/dev/bcache2 4.00GiB
System,RAID1: Size:32.00MiB, Used:100.00KiB
/dev/bcache1 32.00MiB
/dev/bcache2 32.00MiB
Unallocated:
/dev/bcache0 414.51GiB
/dev/bcache1 414.48GiB
/dev/bcache2 414.48GiB
> On Mon, Feb 9, 2015 at 8:56 PM, Kai Krakow <hurikhan77@gmail.com> wrote:
>> P. Remek <p.remek1@googlemail.com> schrieb:
>>
>>> Hello,
>>>
>>> I am benchmarking Btrfs and when benchmarking random writes with fio
>>> utility, I noticed following two things:
>>>
>>> 1) On first run when target file doesn't exist yet, perfromance is
>>> about 8000 IOPs. On second, and every other run, performance goes up
>>> to 70000 IOPs. Its massive difference. The target file is the one
>>> created during the first run.
>>>
>>> 2) There are windows during the test where IOPs drop to 0 and stay 0
>>> about 10 seconds and then it goes back again, and after couple of
>>> seconds again to 0. This is reproducible 100% times.
>>>
>>> Can somobody shred some light on what's happening?
>>
>> I'm not an expert or dev but it's probably due to btrfs doing some
>> housekeeping under the hood. Could you check the output of "btrfs
>> filesystem usage /mountpoint" while running the test? I'd guess there's
>> some pressure on the global reserve during those times.
>>
>>> Command: fio --randrepeat=1 --ioengine=libaio --direct=1
>>> --gtod_reduce=1 --name=test9 --filename=test9 --bs=4k --iodepth=256
>>> --size=10G --numjobs=1 --readwrite=randwrite
>>>
>>> Environment:
>>> CPU: dual socket: E5-2630 v2
>>> RAM: 32 GB ram
>>> OS: Ubuntu server 14.10
>>> Kernel: 3.19.0-031900rc2-generic
>>> btrfs tools: Btrfs v3.14.1
>>> 2x LSI 9300 HBAs - SAS3 12/Gbs
>>> 8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs
>>>
>>> Regards,
>>> Premek
>>
>> --
>> Replies to list only preferred.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Replies to list only preferred.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-10 4:42 ` Duncan
@ 2015-02-10 17:44 ` P. Remek
2015-02-12 2:10 ` Duncan
0 siblings, 1 reply; 20+ messages in thread
From: P. Remek @ 2015-02-10 17:44 UTC (permalink / raw)
To: Duncan; +Cc: linux-btrfs
> What I /suspect/ is happening, is that at the 10 GiB files size, on
> original file creation, btrfs is creating a large file of several
> comparatively large extents (possibly 1 GiB each, the nominal data chunk
> size, tho it can be larger on large enough filesystems). Note that btrfs
> will normally wait to sync, accumulating further writes into the file
> before actually writing it. By default it's 30 seconds, but there's a
> mount option to change that. So btrfs is probably waiting, then writing
> out all changes for the last 30 seconds at once, allowing it to use
> fairly large extents when it does so.
In the test, I use --direct=1 parameter for fio which basically does
O_DIRECT on target file. The O_DIRECT should guarantee that the
filesystem cache is bypassed and IO is sent directly to the
underlaying storage. Are you saying that btrfs buffers writes despite
of O_DIRECT?
I also tried to mount the filesystem with commit parameter set to a) 1
second and b) 1000 seconds as follows:
root@lab1:/# mount -o autodefrag,commit=1 /dev/mapper/prm-0 /mnt/vol1
It didn't chage the behavior - after about 30-40 second of running
there is a drop to 0 IOPs and lasts about 20 seconds.
> Those 70K IOPs are all the extra work the filesystem is doing in ordered
> to track those 4 KiB COWed writes!
This sounds like you are thinking that getting 70K IOPs is a bad thing
but I am testing performance which means higher IOPS = better result.
In other words, after second run when that target file already
existed, the performance improved significantly.
In the light of what you are saying it more looks like there is some
higher overhead when allocating completely new block of data for the
file compared to overhead with COW operation on already existing block
of data.
> I suppose you're already aware that you're running a rather outdated
> userspace/btrfs-progs (what I assume you meant by tools). Userspace
> versions sync with the kernel cycle, with a particular 3.x.0 version
> typically being released a couple weeks after the kernel of the same
> version, usually with a couple 3.x.y, y-update releases following before
> the next kernel-synced x-version bump.
I was hoping that btrfs-progs doesn't have any influence on runtime
properties of the btrfs filesystem. As I am doing performance tests, I
hope that btrfs-progs version doesn't have any impact on the results.
Regards,
P.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-09 17:26 btrfs performance, sudden drop to 0 IOPs P. Remek
2015-02-09 19:56 ` Kai Krakow
2015-02-10 4:42 ` Duncan
@ 2015-02-11 12:40 ` Austin S Hemmelgarn
2015-02-12 4:59 ` Liu Bo
3 siblings, 0 replies; 20+ messages in thread
From: Austin S Hemmelgarn @ 2015-02-11 12:40 UTC (permalink / raw)
To: P. Remek, linux-btrfs
On 2015-02-09 12:26, P. Remek wrote:
> Hello,
>
> I am benchmarking Btrfs and when benchmarking random writes with fio
> utility, I noticed following two things:
>
Based on what I know about BTRFS, I think that these issues actually
have distinct causes.
> 1) On first run when target file doesn't exist yet, perfromance is
> about 8000 IOPs. On second, and every other run, performance goes up
> to 70000 IOPs. Its massive difference. The target file is the one
> created during the first run.
I've noticed that almost always, file creation on BTRFS is slower than
file re-writes. This seems to especially be the case when using AIO
and/or O_DIRECT (although O_DIRECT on a COW filesystem is _really_
complicated to get right). I don't know that there is really any way
currently to solve this, although it would be interesting to see if
fallocat'ing the files prior to the initial run would have any
significant performance impact.
>
> 2) There are windows during the test where IOPs drop to 0 and stay 0
> about 10 seconds and then it goes back again, and after couple of
> seconds again to 0. This is reproducible 100% times.
I've seen this same behavior on a number of filesystems (not just BTRFS)
when using the default I/O scheduler with it's default parameters,
especially on systems with high performance storage. IIRC, Ubuntu 13.10
switched from using the upstream default I/O scheduler (CFQ) to using
the Deadline I/O scheduler because it has better performance (and is
more deterministic) on most cheap commodity desktop/laptop hardware.
I've found however that the Deadline scheduler actually tends to perform
worse than CFQ when used on higher-end server systems and/or SSD's,
although CFQ with default parameters only does marginally better. I'd
suggest experimenting with some of the parameters under /sys/block
(check the files in the Documentation/block directory of the Linux
kernel sources for information about what (almost) everything there does).
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-10 17:44 ` P. Remek
@ 2015-02-12 2:10 ` Duncan
2015-02-12 4:33 ` Kai Krakow
0 siblings, 1 reply; 20+ messages in thread
From: Duncan @ 2015-02-12 2:10 UTC (permalink / raw)
To: linux-btrfs
P. Remek posted on Tue, 10 Feb 2015 18:44:33 +0100 as excerpted:
> In the test, I use --direct=1 parameter for fio which basically does
> O_DIRECT on target file. The O_DIRECT should guarantee that the
> filesystem cache is bypassed and IO is sent directly to the underlaying
> storage. Are you saying that btrfs buffers writes despite of O_DIRECT?
I'm out of my (admin, no claims at developer) league on that. I see
someone else replied, and would defer to them on this.
>> Those 70K IOPs are all the extra work the filesystem is doing in
>> ordered to track those 4 KiB COWed writes!
>
> This sounds like you are thinking that getting 70K IOPs is a bad thing
> but I am testing performance which means higher IOPS = better result. In
> other words, after second run when that target file already existed, the
> performance improved significantly.
Perhaps I'm wrong (I /did/ emphasize "suspect") here, but what I was
suggesting was...
Those higher IOPs are I believe fake, manufactured by the filesystem as a
result of splitting up the few larger extents into many smaller extents
due to COW-fragmentation. If I'm correct, the physical device and the
below-filesystem-level kernel levels (where I expect your IOPs measure is
sourced) are seeing this orders of magnitude increased number of IOPs due
to breaking one original filesystem operation into perhaps hundreds of
effectively random individual 4k block operations, but the actual thruput
at the above-filesystem-level is reduced.
There's certainly a potential in theory for such an effect on btrfs due
to COWing rewrites and faced with those results, it is how I'd explain
them in a rather hand-wavy not too low-level technical way.
But if it doesn't match reality, then my understanding is insufficient
and I'm wrong. Wouldn't be the first time. =:^P
>> I suppose you're already aware that you're running a rather outdated
>> userspace/btrfs-progs (what I assume you meant by tools).
>
> I was hoping that btrfs-progs doesn't have any influence on runtime
> properties of the btrfs filesystem. As I am doing performance tests, I
> hope that btrfs-progs version doesn't have any impact on the results.
I was simply pointing out the mismatch, in case you intended to actually
deploy, and potentially try to fix any problems with that old a
userspace. As long as you're aware of the issue and won't be trying to
btrfs check --repair or the like with that old userspace, for runtime
testing, indeed, it shouldn't matter.
So you're "hoping correctly". =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-12 2:10 ` Duncan
@ 2015-02-12 4:33 ` Kai Krakow
2015-02-12 12:21 ` Austin S Hemmelgarn
2015-02-13 2:46 ` Liu Bo
0 siblings, 2 replies; 20+ messages in thread
From: Kai Krakow @ 2015-02-12 4:33 UTC (permalink / raw)
To: linux-btrfs
Duncan <1i5t5.duncan@cox.net> schrieb:
> P. Remek posted on Tue, 10 Feb 2015 18:44:33 +0100 as excerpted:
>
>> In the test, I use --direct=1 parameter for fio which basically does
>> O_DIRECT on target file. The O_DIRECT should guarantee that the
>> filesystem cache is bypassed and IO is sent directly to the underlaying
>> storage. Are you saying that btrfs buffers writes despite of O_DIRECT?
>
> I'm out of my (admin, no claims at developer) league on that. I see
> someone else replied, and would defer to them on this.
I don't think that O_DIRECT can work efficiently on COW filesystems. It
probably has a negative effect and cannot be faster as normal access. Linus
itself said one time that O_DIRECT is broken and should go away, and instead
cache hinting should be used.
Think of this: For the _unbuffered_ direct-io request to be fulfilled the
file system has to go through its COW logic first which it otherwise had
buffered and done in background. Bypassing the cache is probably only a
side-effect of O_DIRECT, not its purpose.
At least I'd try with a nocow-file for the benchmark if you still have to
use O_DIRECT.
--
Replies to list only preferred.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-09 17:26 btrfs performance, sudden drop to 0 IOPs P. Remek
` (2 preceding siblings ...)
2015-02-11 12:40 ` Austin S Hemmelgarn
@ 2015-02-12 4:59 ` Liu Bo
2015-02-13 13:06 ` P. Remek
3 siblings, 1 reply; 20+ messages in thread
From: Liu Bo @ 2015-02-12 4:59 UTC (permalink / raw)
To: P. Remek; +Cc: linux-btrfs
On Mon, Feb 09, 2015 at 06:26:49PM +0100, P. Remek wrote:
> Hello,
>
> I am benchmarking Btrfs and when benchmarking random writes with fio
> utility, I noticed following two things:
>
> 1) On first run when target file doesn't exist yet, perfromance is
> about 8000 IOPs. On second, and every other run, performance goes up
> to 70000 IOPs. Its massive difference. The target file is the one
> created during the first run.
I was doing similar tests in the last few days, well, the huge performance difference comes from AIO+DIO path,
fs/direct-io.c: 1170
/*
* For file extending writes updating i_size before data
* writeouts
* complete can expose uninitialized blocks in dumb filesystems.
* In that case we need to wait for I/O completion even if asked
* for an asynchronous write.
*/
if (is_sync_kiocb(iocb))
dio->is_async = false;
else if (!(dio->flags & DIO_ASYNC_EXTEND) &&
(rw & WRITE) && end > i_size_read(inode))
dio->is_async = false;
else
dio->is_async = true;
So you may like to play with fio's fallocate option, although it's 'posix' on default which should have set proper i_size for you, but I don't believe it unless I set it to.
>
> 2) There are windows during the test where IOPs drop to 0 and stay 0
> about 10 seconds and then it goes back again, and after couple of
> seconds again to 0. This is reproducible 100% times.
>
> Can somobody shred some light on what's happening?
>
I'd use a blktrace based tool like iowatcher or seekwatcher to see
what's really happening on the performance drops.
>
> Command: fio --randrepeat=1 --ioengine=libaio --direct=1
> --gtod_reduce=1 --name=test9 --filename=test9 --bs=4k --iodepth=256
> --size=10G --numjobs=1 --readwrite=randwrite
Since this is just a libaio-dio random write, I think it has nothing to do with
progs side.
Thanks,
-liubo
>
> Environment:
> CPU: dual socket: E5-2630 v2
> RAM: 32 GB ram
> OS: Ubuntu server 14.10
> Kernel: 3.19.0-031900rc2-generic
> btrfs tools: Btrfs v3.14.1
> 2x LSI 9300 HBAs - SAS3 12/Gbs
> 8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs
>
> Regards,
> Premek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-12 4:33 ` Kai Krakow
@ 2015-02-12 12:21 ` Austin S Hemmelgarn
2015-02-12 19:42 ` Kai Krakow
2015-02-13 13:08 ` P. Remek
2015-02-13 2:46 ` Liu Bo
1 sibling, 2 replies; 20+ messages in thread
From: Austin S Hemmelgarn @ 2015-02-12 12:21 UTC (permalink / raw)
To: Kai Krakow, linux-btrfs
On 2015-02-11 23:33, Kai Krakow wrote:
> Duncan <1i5t5.duncan@cox.net> schrieb:
>
>> P. Remek posted on Tue, 10 Feb 2015 18:44:33 +0100 as excerpted:
>>
>>> In the test, I use --direct=1 parameter for fio which basically does
>>> O_DIRECT on target file. The O_DIRECT should guarantee that the
>>> filesystem cache is bypassed and IO is sent directly to the underlaying
>>> storage. Are you saying that btrfs buffers writes despite of O_DIRECT?
>>
>> I'm out of my (admin, no claims at developer) league on that. I see
>> someone else replied, and would defer to them on this.
>
> I don't think that O_DIRECT can work efficiently on COW filesystems. It
> probably has a negative effect and cannot be faster as normal access. Linus
> itself said one time that O_DIRECT is broken and should go away, and instead
> cache hinting should be used.
>
> Think of this: For the _unbuffered_ direct-io request to be fulfilled the
> file system has to go through its COW logic first which it otherwise had
> buffered and done in background. Bypassing the cache is probably only a
> side-effect of O_DIRECT, not its purpose.
IIUC, the original purpose of O_DIRECT was to allow the application to
handle caching itself, instead of having the kernel do it. The issue is
that it is (again, IIUC) a hard requirement for AIO, which is a
performance booster for many use cases.
>
> At least I'd try with a nocow-file for the benchmark if you still have to
> use O_DIRECT.
>
I'd definitely suggest using NOCOW for any file you are doing O_DIRECT
with, as you should see _much_ better performance that way, and also
don't run the (theoretical) risk of some of the same types of corruption
that swapfiles on BTRFS can cause.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-12 12:21 ` Austin S Hemmelgarn
@ 2015-02-12 19:42 ` Kai Krakow
2015-02-13 13:16 ` P. Remek
2015-02-13 13:08 ` P. Remek
1 sibling, 1 reply; 20+ messages in thread
From: Kai Krakow @ 2015-02-12 19:42 UTC (permalink / raw)
To: linux-btrfs
Austin S Hemmelgarn <ahferroin7@gmail.com> schrieb:
> On 2015-02-11 23:33, Kai Krakow wrote:
>> Duncan <1i5t5.duncan@cox.net> schrieb:
>>
>>> P. Remek posted on Tue, 10 Feb 2015 18:44:33 +0100 as excerpted:
>>>
>>>> In the test, I use --direct=1 parameter for fio which basically does
>>>> O_DIRECT on target file. The O_DIRECT should guarantee that the
>>>> filesystem cache is bypassed and IO is sent directly to the underlaying
>>>> storage. Are you saying that btrfs buffers writes despite of O_DIRECT?
>>>
>>> I'm out of my (admin, no claims at developer) league on that. I see
>>> someone else replied, and would defer to them on this.
>>
>> I don't think that O_DIRECT can work efficiently on COW filesystems. It
>> probably has a negative effect and cannot be faster as normal access.
>> Linus itself said one time that O_DIRECT is broken and should go away,
>> and instead cache hinting should be used.
>>
>> Think of this: For the _unbuffered_ direct-io request to be fulfilled the
>> file system has to go through its COW logic first which it otherwise had
>> buffered and done in background. Bypassing the cache is probably only a
>> side-effect of O_DIRECT, not its purpose.
> IIUC, the original purpose of O_DIRECT was to allow the application to
> handle caching itself, instead of having the kernel do it. The issue is
> that it is (again, IIUC) a hard requirement for AIO, which is a
> performance booster for many use cases.
Yes, it was implemented for the purpose of allowing an application to
implement its own caching - probably for the sole purpose of doing it
"better" or more efficient. But it simply does not work out that well, at
least with COW fs. The original idea "performance" is more or less eaten
away in a COW scenario - or worse. And that in turn is why Linus said
O_DIRECT is broken and should go away, use cache hinting instead.
>From that perspective, I concluded what I wrote: Bypassing the cache is only
a side-effect. It didn't solve the problem the right way - it
unintentionally solved something else. So, to alleviate the design flaw, you
can only use it for its intended purpose on nocow-files (or nocow-
filesystems).
>> At least I'd try with a nocow-file for the benchmark if you still have to
>> use O_DIRECT.
>>
> I'd definitely suggest using NOCOW for any file you are doing O_DIRECT
> with, as you should see _much_ better performance that way, and also
> don't run the (theoretical) risk of some of the same types of corruption
> that swapfiles on BTRFS can cause.
Dito.
--
Replies to list only preferred.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-12 4:33 ` Kai Krakow
2015-02-12 12:21 ` Austin S Hemmelgarn
@ 2015-02-13 2:46 ` Liu Bo
2015-02-13 3:55 ` Wang Shilong
1 sibling, 1 reply; 20+ messages in thread
From: Liu Bo @ 2015-02-13 2:46 UTC (permalink / raw)
To: Kai Krakow; +Cc: linux-btrfs
On Thu, Feb 12, 2015 at 05:33:41AM +0100, Kai Krakow wrote:
> Duncan <1i5t5.duncan@cox.net> schrieb:
>
> > P. Remek posted on Tue, 10 Feb 2015 18:44:33 +0100 as excerpted:
> >
> >> In the test, I use --direct=1 parameter for fio which basically does
> >> O_DIRECT on target file. The O_DIRECT should guarantee that the
> >> filesystem cache is bypassed and IO is sent directly to the underlaying
> >> storage. Are you saying that btrfs buffers writes despite of O_DIRECT?
> >
> > I'm out of my (admin, no claims at developer) league on that. I see
> > someone else replied, and would defer to them on this.
>
> I don't think that O_DIRECT can work efficiently on COW filesystems. It
> probably has a negative effect and cannot be faster as normal access. Linus
> itself said one time that O_DIRECT is broken and should go away, and instead
> cache hinting should be used.
>
> Think of this: For the _unbuffered_ direct-io request to be fulfilled the
> file system has to go through its COW logic first which it otherwise had
> buffered and done in background. Bypassing the cache is probably only a
> side-effect of O_DIRECT, not its purpose.
Hmm, not true in btrfs, the COW logic mentioned above is nothing but to allocate
a NEW extent, and it's not done in background.
Comparing to nocow logic, the main difference comes from
a) COW files' calculating checksums of the dirty data in DIO pages which nocow files don't need to.
b) their endio handlers.
Or am I missing something?
Thanks,
-liubo
>
> At least I'd try with a nocow-file for the benchmark if you still have to
> use O_DIRECT.
>
> --
> Replies to list only preferred.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-13 2:46 ` Liu Bo
@ 2015-02-13 3:55 ` Wang Shilong
2015-02-13 13:18 ` P. Remek
0 siblings, 1 reply; 20+ messages in thread
From: Wang Shilong @ 2015-02-13 3:55 UTC (permalink / raw)
To: bo.li.liu; +Cc: Kai Krakow, linux-btrfs
Hello guys,
>
> On Thu, Feb 12, 2015 at 05:33:41AM +0100, Kai Krakow wrote:
>> Duncan <1i5t5.duncan@cox.net> schrieb:
>>
>>> P. Remek posted on Tue, 10 Feb 2015 18:44:33 +0100 as excerpted:
>>>
>>>> In the test, I use --direct=1 parameter for fio which basically does
>>>> O_DIRECT on target file. The O_DIRECT should guarantee that the
>>>> filesystem cache is bypassed and IO is sent directly to the underlaying
>>>> storage. Are you saying that btrfs buffers writes despite of O_DIRECT?
>>>
>>> I'm out of my (admin, no claims at developer) league on that. I see
>>> someone else replied, and would defer to them on this.
>>
>> I don't think that O_DIRECT can work efficiently on COW filesystems. It
>> probably has a negative effect and cannot be faster as normal access. Linus
>> itself said one time that O_DIRECT is broken and should go away, and instead
>> cache hinting should be used.
>>
>> Think of this: For the _unbuffered_ direct-io request to be fulfilled the
>> file system has to go through its COW logic first which it otherwise had
>> buffered and done in background. Bypassing the cache is probably only a
>> side-effect of O_DIRECT, not its purpose.
>
> Hmm, not true in btrfs, the COW logic mentioned above is nothing but to allocate
> a NEW extent, and it's not done in background.
>
> Comparing to nocow logic, the main difference comes from
> a) COW files' calculating checksums of the dirty data in DIO pages which nocow files don't need to.
> b) their endio handlers.
>
> Or am I missing something?
We did benchmark Btrfs aio/dio performance before, we noticed one big differences
from COW and nocow is not only checksum but checksum cost more metadata, which will
make Btrfs performance drop suddenly for a while, because of metadata reservation.
>
> Thanks,
>
> -liubo
>>
>> At least I'd try with a nocow-file for the benchmark if you still have to
>> use O_DIRECT.
>>
>> --
>> Replies to list only preferred.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Best Regards,
Wang Shilong
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-12 4:59 ` Liu Bo
@ 2015-02-13 13:06 ` P. Remek
2015-02-13 14:08 ` Liu Bo
0 siblings, 1 reply; 20+ messages in thread
From: P. Remek @ 2015-02-13 13:06 UTC (permalink / raw)
To: bo.li.liu; +Cc: linux-btrfs
> I'd use a blktrace based tool like iowatcher or seekwatcher to see
> what's really happening on the performance drops.
So I used this command to see if there are any outstanding requests in
the I/O scheduler queue when the performance drops to 0 IOPs
root@lab1:/# iostat -c -d -x -t -m /dev/sdi 1 10000
The output is:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdi 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
"avgqu-sz" gives the queue length (1 second avarage). So really it
seems that the system is not stuck in the Block I/O layer but in upper
layer instead (most likely filesystem layer).
I also created ext4 filesystem on another pair of disks - so I was
able to run simultaneous benchmark - one for ext4 and one for btrfs
(each having 4 SSDs assigned) and when btrfs went down to 0 IOPs the
ext4 fio benchmark kept generating high IOPs.
I also tried to mount the system with nodatacow:
/dev/sdi on /mnt/btrfs type btrfs (rw,nodatacow)
It didn't help with the performance drops.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-12 12:21 ` Austin S Hemmelgarn
2015-02-12 19:42 ` Kai Krakow
@ 2015-02-13 13:08 ` P. Remek
1 sibling, 0 replies; 20+ messages in thread
From: P. Remek @ 2015-02-13 13:08 UTC (permalink / raw)
To: Austin S Hemmelgarn; +Cc: Kai Krakow, linux-btrfs
> I'd definitely suggest using NOCOW for any file you are doing O_DIRECT with,
> as you should see _much_ better performance that way, and also don't run the
> (theoretical) risk of some of the same types of corruption that swapfiles on
> BTRFS can cause.
I mounted the filesystem with nodatacow as follows and it didn't help
- it still drops to 0 IOPs every couple of seconds.
/dev/sdi on /mnt/btrfs type btrfs (rw,nodatacow)
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-12 19:42 ` Kai Krakow
@ 2015-02-13 13:16 ` P. Remek
2015-02-13 18:26 ` Kai Krakow
0 siblings, 1 reply; 20+ messages in thread
From: P. Remek @ 2015-02-13 13:16 UTC (permalink / raw)
To: Kai Krakow; +Cc: linux-btrfs
> Yes, it was implemented for the purpose of allowing an application to
> implement its own caching - probably for the sole purpose of doing it
> "better" or more efficient. But it simply does not work out that well, at
> least with COW fs. The original idea "performance" is more or less eaten
> away in a COW scenario - or worse. And that in turn is why Linus said
> O_DIRECT is broken and should go away, use cache hinting instead.
Linus is saying to use things like madvise but the fact is that in
reality people are using O_DIRECT instead of it, so it is important to
get it right. The case which I am interested is KVM. Virtual machine
disk file is opened with O_DIRECT so that when Virtual machine is
doing IO, it is not cached twice - first time on guest operating
system level, and second time on hypervisor host operating system
level. With O_DIRECT it is only cached in guest.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-13 3:55 ` Wang Shilong
@ 2015-02-13 13:18 ` P. Remek
0 siblings, 0 replies; 20+ messages in thread
From: P. Remek @ 2015-02-13 13:18 UTC (permalink / raw)
To: Wang Shilong; +Cc: bo.li.liu, Kai Krakow, linux-btrfs
> We did benchmark Btrfs aio/dio performance before, we noticed one big differences
> from COW and nocow is not only checksum but checksum cost more metadata, which will
> make Btrfs performance drop suddenly for a while, because of metadata reservation.
I mounted the filesystem with nodatacow which sould also switch off
the checksuming but it didn't help - sudden drops are still there.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-13 13:06 ` P. Remek
@ 2015-02-13 14:08 ` Liu Bo
0 siblings, 0 replies; 20+ messages in thread
From: Liu Bo @ 2015-02-13 14:08 UTC (permalink / raw)
To: P. Remek; +Cc: linux-btrfs
On Fri, Feb 13, 2015 at 02:06:27PM +0100, P. Remek wrote:
> > I'd use a blktrace based tool like iowatcher or seekwatcher to see
> > what's really happening on the performance drops.
>
> So I used this command to see if there are any outstanding requests in
> the I/O scheduler queue when the performance drops to 0 IOPs
> root@lab1:/# iostat -c -d -x -t -m /dev/sdi 1 10000
>
> The output is:
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
> avgrq-sz avgqu-sz await r_await w_await svctm %util
>
> sdi 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>
> "avgqu-sz" gives the queue length (1 second avarage). So really it
> seems that the system is not stuck in the Block I/O layer but in upper
> layer instead (most likely filesystem layer).
>
> I also created ext4 filesystem on another pair of disks - so I was
> able to run simultaneous benchmark - one for ext4 and one for btrfs
> (each having 4 SSDs assigned) and when btrfs went down to 0 IOPs the
> ext4 fio benchmark kept generating high IOPs.
>
> I also tried to mount the system with nodatacow:
>
> /dev/sdi on /mnt/btrfs type btrfs (rw,nodatacow)
>
> It didn't help with the performance drops.
It's just weird since 10s is too much for filesystems, I don't know
what's happening and didn't have such an experience in my tests,
Perhaps Try "perf record -a -g" to see magic.
Thanks,
-liubo
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: btrfs performance, sudden drop to 0 IOPs
2015-02-13 13:16 ` P. Remek
@ 2015-02-13 18:26 ` Kai Krakow
0 siblings, 0 replies; 20+ messages in thread
From: Kai Krakow @ 2015-02-13 18:26 UTC (permalink / raw)
To: linux-btrfs
P. Remek <p.remek1@googlemail.com> schrieb:
>> Yes, it was implemented for the purpose of allowing an application to
>> implement its own caching - probably for the sole purpose of doing it
>> "better" or more efficient. But it simply does not work out that well, at
>> least with COW fs. The original idea "performance" is more or less eaten
>> away in a COW scenario - or worse. And that in turn is why Linus said
>> O_DIRECT is broken and should go away, use cache hinting instead.
>
> Linus is saying to use things like madvise but the fact is that in
> reality people are using O_DIRECT instead of it, so it is important to
> get it right.
Yeah, quite true - apparently... But as you already found, the O_DIRECT
implementation of btrfs is probably not the culprit.
> The case which I am interested is KVM. Virtual machine
> disk file is opened with O_DIRECT so that when Virtual machine is
> doing IO, it is not cached twice - first time on guest operating
> system level, and second time on hypervisor host operating system
> level. With O_DIRECT it is only cached in guest.
In VirtualBox I enabled host-side caching on purpose and instead lowered the
RAM. I don't know if VirtualBox does something like memory ballooning, but
usually I'd expect ballooning to push cache out of RAM - so host-side
caching may make sense.
I never measured it but it feels a bit snappier to work inside the
VirtualBox machine. Of course, recommendation depends on if you are using
ballooning and VM density.
In VirtualBox, this setting probably just turns off O_DIRECT. And my VM
images are set to nocow.
--
Replies to list only preferred.
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2015-02-13 19:11 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-09 17:26 btrfs performance, sudden drop to 0 IOPs P. Remek
2015-02-09 19:56 ` Kai Krakow
2015-02-09 22:21 ` P. Remek
2015-02-10 6:58 ` Kai Krakow
2015-02-10 4:42 ` Duncan
2015-02-10 17:44 ` P. Remek
2015-02-12 2:10 ` Duncan
2015-02-12 4:33 ` Kai Krakow
2015-02-12 12:21 ` Austin S Hemmelgarn
2015-02-12 19:42 ` Kai Krakow
2015-02-13 13:16 ` P. Remek
2015-02-13 18:26 ` Kai Krakow
2015-02-13 13:08 ` P. Remek
2015-02-13 2:46 ` Liu Bo
2015-02-13 3:55 ` Wang Shilong
2015-02-13 13:18 ` P. Remek
2015-02-11 12:40 ` Austin S Hemmelgarn
2015-02-12 4:59 ` Liu Bo
2015-02-13 13:06 ` P. Remek
2015-02-13 14:08 ` Liu Bo
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.