linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Confused by performance
@ 2010-05-24 21:08 K. Richard Pixley
  2010-05-25  3:59 ` Mike Fedyk
  2010-05-28  1:45 ` K. Richard Pixley
  0 siblings, 2 replies; 10+ messages in thread
From: K. Richard Pixley @ 2010-05-24 21:08 UTC (permalink / raw)
  To: linux-btrfs

I've just started to work with btrfs so I started with a benchmark.  On 
four identical servers, (2 dual core cpus, single local disk), I built 
filesystems - ext3, ext4, nilfs2, and btrfs.  I checked out a sizable 
code tree and timed a build.  The build is parallelized to use 4 threads 
when possible.

I'm seeing similar build times on ext[34] and nilfs2 but I'm seeing 
almost double the times for btrfs using default options.  And I'm having 
trouble reconciling this performance cost with the benchmarks I'm seeing 
around the net.

Is this a common result?  Is there a trick to getting ext4 competitive 
performance out of btrfs?  Is my application a poor choice for btrfs?  
Am I missing something obvious here?

--rich



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Confused by performance
  2010-05-24 21:08 Confused by performance K. Richard Pixley
@ 2010-05-25  3:59 ` Mike Fedyk
  2010-05-28  1:45 ` K. Richard Pixley
  1 sibling, 0 replies; 10+ messages in thread
From: Mike Fedyk @ 2010-05-25  3:59 UTC (permalink / raw)
  To: K. Richard Pixley; +Cc: linux-btrfs

On Mon, May 24, 2010 at 2:08 PM, K. Richard Pixley <rich@noir.com> wrot=
e:
> I've just started to work with btrfs so I started with a benchmark. =C2=
=A0On four
> identical servers, (2 dual core cpus, single local disk), I built
> filesystems - ext3, ext4, nilfs2, and btrfs. =C2=A0I checked out a si=
zable code
> tree and timed a build. =C2=A0The build is parallelized to use 4 thre=
ads when
> possible.
>
> I'm seeing similar build times on ext[34] and nilfs2 but I'm seeing a=
lmost
> double the times for btrfs using default options. =C2=A0And I'm havin=
g trouble
> reconciling this performance cost with the benchmarks I'm seeing arou=
nd the
> net.
>
> Is this a common result? =C2=A0Is there a trick to getting ext4 compe=
titive
> performance out of btrfs? =C2=A0Is my application a poor choice for b=
trfs? =C2=A0Am I
> missing something obvious here?
>

Please make sure you're testing with the latest btrfs from git or
linus latest kernel.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Confused by performance
  2010-05-24 21:08 Confused by performance K. Richard Pixley
  2010-05-25  3:59 ` Mike Fedyk
@ 2010-05-28  1:45 ` K. Richard Pixley
  2010-06-16 18:08   ` K. Richard Pixley
  1 sibling, 1 reply; 10+ messages in thread
From: K. Richard Pixley @ 2010-05-28  1:45 UTC (permalink / raw)
  To: linux-btrfs

Just as a followup, my problem appears to be hardware related.  It's not 
clear yet whether it's a strange failure mode or a configuration snafoo, 
disk or controller, but elsewhere I'm seeing a btfs single disk 
performance penalty more like 2% over ext[34] which seems completely 
reasonable.

Sorry for the panic.

--rich

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Confused by performance
  2010-05-28  1:45 ` K. Richard Pixley
@ 2010-06-16 18:08   ` K. Richard Pixley
  2010-06-16 19:21     ` Roberto Ragusa
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: K. Richard Pixley @ 2010-06-16 18:08 UTC (permalink / raw)
  To: linux-btrfs

Once again I'm stumped by some performance numbers and hoping for some 
insight.

Using an 8-core server, building in parallel, I'm building some code.  
Using ext2 over a 5-way, (5 disk), lvm partition, I can build that code 
in 35 minutes.  Tests with dd on the raw disk and lvm partitions show me 
that I'm getting near linear improvement from the raw stripe, even with 
dd runs exceeding 10G, so I think that convinces me that my disks and 
controller subsystem are capable of operating in parallel and in 
concert.  hdparm -t numbers seem to support what I'm seeing from dd.

Running the same build, same parallelism, over a btrfs (defaults) 
partition on a single drive, I'm seeing very consistent build times 
around an hour, which is reasonable.  I get a little under an hour on 
ext4 single disk, again, very consistently.

However, if I build a btrfs file system across the 5 disks, my build 
times decline to around 1.5 - 2hrs, although there's about a 30min 
variation between different runs.

If I build a btrfs file system across the 5-way lvm stripe, I get even 
worse performance at around 2.5hrs per build, with about a 45min 
variation between runs.

I can't explain these last two results.  Any theories?

--rich

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Confused by performance
  2010-06-16 18:08   ` K. Richard Pixley
@ 2010-06-16 19:21     ` Roberto Ragusa
       [not found]       ` <AANLkTinM6ab_KEynfgvVT9v5TmcogoLZ0PLAz2oPnsiS@mail.gmail.com>
  2010-06-16 21:44     ` Daniel J Blueman
  2010-06-17  9:57     ` Chris Mason
  2 siblings, 1 reply; 10+ messages in thread
From: Roberto Ragusa @ 2010-06-16 19:21 UTC (permalink / raw)
  To: linux-btrfs; +Cc: K. Richard Pixley

K. Richard Pixley wrote:
> Once again I'm stumped by some performance numbers and hoping for some
> insight.
> 
> Using an 8-core server, building in parallel, I'm building some code. 
> Using ext2 over a 5-way, (5 disk), lvm partition, I can build that code
> in 35 minutes.  Tests with dd on the raw disk and lvm partitions show me
> that I'm getting near linear improvement from the raw stripe, even with
> dd runs exceeding 10G, so I think that convinces me that my disks and
> controller subsystem are capable of operating in parallel and in
> concert.  hdparm -t numbers seem to support what I'm seeing from dd.
> 
> Running the same build, same parallelism, over a btrfs (defaults)
> partition on a single drive, I'm seeing very consistent build times
> around an hour, which is reasonable.  I get a little under an hour on
> ext4 single disk, again, very consistently.
> 
> However, if I build a btrfs file system across the 5 disks, my build
> times decline to around 1.5 - 2hrs, although there's about a 30min
> variation between different runs.
> 
> If I build a btrfs file system across the 5-way lvm stripe, I get even
> worse performance at around 2.5hrs per build, with about a 45min
> variation between runs.
> 
> I can't explain these last two results.  Any theories?

If you just want theory, I can try. :-)

Theory of striping follows (numbers invented).
If you have a stripe size of 8 sectors, 40 successive sectors are
divided in 5 groups of 8 sectors, with each group on a different disk.
Suppose you want to read 40 sectors; with one disk and no striping
you need:
    time_to_place_disk_head_and_rotational_latency (10ms)
  + time to read 40 sectors (around 0ms)
Suppose you want to read 40 sectors with your 5disk striped volume,
you need
    time_to_place_disk_head_and_rotational_latency (10ms)
  + time to read 8 sectors (around 0ms)
    time_to_place_disk_head_and_rotational_latency (10ms)
  + time to read 8 sectors (around 0ms)
    time_to_place_disk_head_and_rotational_latency (10ms)
  + time to read 8 sectors (around 0ms)
    time_to_place_disk_head_and_rotational_latency (10ms)
  + time to read 8 sectors (around 0ms)
    time_to_place_disk_head_and_rotational_latency (10ms)
  + time to read 8 sectors (around 0ms)
so you are 5 times slower. Now, it could be that you submit the
5 requests together; in that case you do not pay a 5 times penalty,
but a 2 times penalty. Why? Because, if you think about rotational
latency, if a disk takes 10ms to do one rotation, you will have
your data ready in a random time equally distributed between
0 and 10 ms (average 5ms). If you submit 5 command to 5 disks
each of one will have a (independent!) flat distribution between
0 and 10ms; as you need all 5 pieces you have to wait the
unluckier of the disks, so your average will be near 10ms.
So in general striping costs you a 2 to 5 speed penalty.

If your build if really parallel, when one process is waiting
data, another one will make requests. But remember that all the disks
are busy because of the first process, so it is not unreasonable
to have that multiple processes do not gain any speed.
In reality, the first 5 requests and the second 5 could be evaluated
at the same time to give precedence to the one of the two more
easy for the drive (maybe the second one is lucky from a rotational point
of view, so it is better to do that before the first). In this case the
disks are better utilized, but the net effect on the overall build
is not so easy to establish, because, when you give precedence
to the second request, you are delaying the first, so the entire
first 40-sectors read could have worse timing than 0-10ms_almost_surely_10,
and can get a 0-20ms_maybe_15.

There is a lot of maths you can study (queue theory and scheduling
algorithms) and a lot of factors can be important (disk queue size,
NCQ, caching) at various levels (O.S., controller, disk).

In my opinion, the basic rule in these cases should be:
  == use a stripe size bigger than the sizes of your random reads ==
In case of seeky load I would personally use a stripe size of 64MiB,
for example. One read should only involve one disk.
Stripe size is often configured with very small values (such as 4KiB),
because it produces very big numbers when you read sequentially
(as you are really using all the disks together).
But latency sucks.

In your case, the build is probably very seeky, and the "seekiness"
could be exacerbated by having many writes (and things become even worse
when the filesystems involve a journal...).

(sorry for the long mail. you asked for a theory :-) )
-- 
   Roberto Ragusa    mail at robertoragusa.it

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Confused by performance
       [not found]       ` <AANLkTinM6ab_KEynfgvVT9v5TmcogoLZ0PLAz2oPnsiS@mail.gmail.com>
@ 2010-06-16 19:35         ` Freddie Cash
  2010-06-16 19:56           ` Roberto Ragusa
  2010-06-17  6:57           ` David Brown
  0 siblings, 2 replies; 10+ messages in thread
From: Freddie Cash @ 2010-06-16 19:35 UTC (permalink / raw)
  To: linux-btrfs

<snip a lot of fancy math that missed the point>

That's all well and good, but you missed the part where he said ext2
on a 5-way LVM stripeset is many times faster than btrfs on a 5-way
btrfs stripeset.

IOW, same 5-way stripeset, different filesystems and volume managers,
and very different performance.

And he's wondering why the btrfs method used for striping is so much
slower than the lvm method used for striping.

--
Freddie Cash
fjwcash@gmail.com



--
Freddie Cash
fjwcash@gmail.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Confused by performance
  2010-06-16 19:35         ` Freddie Cash
@ 2010-06-16 19:56           ` Roberto Ragusa
  2010-06-17  6:57           ` David Brown
  1 sibling, 0 replies; 10+ messages in thread
From: Roberto Ragusa @ 2010-06-16 19:56 UTC (permalink / raw)
  To: linux-btrfs

Freddie Cash wrote:
> > <snip a lot of fancy math that missed the point>
> >
> > That's all well and good, but you missed the part where he said ext2
> > on a 5-way LVM stripeset is many times faster than btrfs on a 5-way
> > btrfs stripeset.
> >
> > IOW, same 5-way stripeset, different filesystems and volume managers,
> > and very different performance.
> >
> > And he's wondering why the btrfs method used for striping is so much
> > slower than the lvm method used for striping.

Sorry, I missed the first line where ext2 on 5disk lvm is said to be fast.
I was commenting as the last two results were ext2 on 5disk and btrfs
on 5disk.

I'd say the great variation between successive runs is important and seems
to point to some bigger problem.

-- Roberto Ragusa mail at robertoragusa.it

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Confused by performance
  2010-06-16 18:08   ` K. Richard Pixley
  2010-06-16 19:21     ` Roberto Ragusa
@ 2010-06-16 21:44     ` Daniel J Blueman
  2010-06-17  9:57     ` Chris Mason
  2 siblings, 0 replies; 10+ messages in thread
From: Daniel J Blueman @ 2010-06-16 21:44 UTC (permalink / raw)
  To: K. Richard Pixley; +Cc: linux-btrfs

On Wed, Jun 16, 2010 at 7:08 PM, K. Richard Pixley <rich@noir.com> wrot=
e:
> Once again I'm stumped by some performance numbers and hoping for som=
e
> insight.
>
> Using an 8-core server, building in parallel, I'm building some code.=
 =A0Using
> ext2 over a 5-way, (5 disk), lvm partition, I can build that code in =
35
> minutes. =A0Tests with dd on the raw disk and lvm partitions show me =
that I'm
> getting near linear improvement from the raw stripe, even with dd run=
s
> exceeding 10G, so I think that convinces me that my disks and control=
ler
> subsystem are capable of operating in parallel and in concert. =A0hdp=
arm -t
> numbers seem to support what I'm seeing from dd.
>
> Running the same build, same parallelism, over a btrfs (defaults) par=
tition
> on a single drive, I'm seeing very consistent build times around an h=
our,
> which is reasonable. =A0I get a little under an hour on ext4 single d=
isk,
> again, very consistently.
>
> However, if I build a btrfs file system across the 5 disks, my build =
times
> decline to around 1.5 - 2hrs, although there's about a 30min variatio=
n
> between different runs.
>
> If I build a btrfs file system across the 5-way lvm stripe, I get eve=
n worse
> performance at around 2.5hrs per build, with about a 45min variation =
between
> runs.
>
> I can't explain these last two results. =A0Any theories?

Try mounting the BTRFS filesystem with 'nobarrier', since this may be
an obvious difference. Also, for metadata-write-intensive workloads,
when creating the filesystem try 'mkfs.btrfs -m single'. Of course,
all this doesn't explain the variance.

I'd say it's worth emplying 'blktrace' to see what happening at a
lower level, and even eg varying between deadline/CFQ I/O schedulers.

Daniel
--=20
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Confused by performance
  2010-06-16 19:35         ` Freddie Cash
  2010-06-16 19:56           ` Roberto Ragusa
@ 2010-06-17  6:57           ` David Brown
  1 sibling, 0 replies; 10+ messages in thread
From: David Brown @ 2010-06-17  6:57 UTC (permalink / raw)
  To: linux-btrfs

On 16/06/2010 21:35, Freddie Cash wrote:
> <snip a lot of fancy math that missed the point>
>
> That's all well and good, but you missed the part where he said ext2
> on a 5-way LVM stripeset is many times faster than btrfs on a 5-way
> btrfs stripeset.
>
> IOW, same 5-way stripeset, different filesystems and volume managers,
> and very different performance.
>
> And he's wondering why the btrfs method used for striping is so much
> slower than the lvm method used for striping.
>

This could easily be explained by Roberto's theory and maths - if the 
lvm stripe set used large stripe sizes so that the random reads were 
mostly read from a single disk, it would be fast.  If the btrfs stripes 
were small, then it would be slow due to all the extra seeks.

Do we know anything about the stripe sizes used?



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Confused by performance
  2010-06-16 18:08   ` K. Richard Pixley
  2010-06-16 19:21     ` Roberto Ragusa
  2010-06-16 21:44     ` Daniel J Blueman
@ 2010-06-17  9:57     ` Chris Mason
  2 siblings, 0 replies; 10+ messages in thread
From: Chris Mason @ 2010-06-17  9:57 UTC (permalink / raw)
  To: K. Richard Pixley; +Cc: linux-btrfs

On Wed, Jun 16, 2010 at 11:08:48AM -0700, K. Richard Pixley wrote:
> Once again I'm stumped by some performance numbers and hoping for
> some insight.
> 
> Using an 8-core server, building in parallel, I'm building some
> code.  Using ext2 over a 5-way, (5 disk), lvm partition, I can build
> that code in 35 minutes.  Tests with dd on the raw disk and lvm
> partitions show me that I'm getting near linear improvement from the
> raw stripe, even with dd runs exceeding 10G, so I think that
> convinces me that my disks and controller subsystem are capable of
> operating in parallel and in concert.  hdparm -t numbers seem to
> support what I'm seeing from dd.
> 
> Running the same build, same parallelism, over a btrfs (defaults)
> partition on a single drive, I'm seeing very consistent build times
> around an hour, which is reasonable.  I get a little under an hour
> on ext4 single disk, again, very consistently.
> 
> However, if I build a btrfs file system across the 5 disks, my build
> times decline to around 1.5 - 2hrs, although there's about a 30min
> variation between different runs.
> 
> If I build a btrfs file system across the 5-way lvm stripe, I get
> even worse performance at around 2.5hrs per build, with about a
> 45min variation between runs.
> 
> I can't explain these last two results.  Any theories?

I suspect they come down to different raid levels done by btrfs, and
maybe barriers.

By default btrfs will duplicate metadata, so ext2 is doing much less
metadata IO than btrfs does.

Try mkfs.btrfs -m raid0 -d raid0 /dev/xxx /dev/xxy ...

Then try mount -o nobarrier /dev/xxx /mnt

Someone else mentioned blktrace, it would help explain things if you're
interested in tracking this down.

-chris


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-06-17  9:57 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-24 21:08 Confused by performance K. Richard Pixley
2010-05-25  3:59 ` Mike Fedyk
2010-05-28  1:45 ` K. Richard Pixley
2010-06-16 18:08   ` K. Richard Pixley
2010-06-16 19:21     ` Roberto Ragusa
     [not found]       ` <AANLkTinM6ab_KEynfgvVT9v5TmcogoLZ0PLAz2oPnsiS@mail.gmail.com>
2010-06-16 19:35         ` Freddie Cash
2010-06-16 19:56           ` Roberto Ragusa
2010-06-17  6:57           ` David Brown
2010-06-16 21:44     ` Daniel J Blueman
2010-06-17  9:57     ` Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).