All of lore.kernel.org
 help / color / mirror / Atom feed
* replication write speed
@ 2011-05-09  2:43 Simon Tian
  2011-05-09  2:50 ` Gregory Farnum
  0 siblings, 1 reply; 8+ messages in thread
From: Simon Tian @ 2011-05-09  2:43 UTC (permalink / raw)
  To: ceph-devel

Hi folks,

      I am testing the replication performance of ceph-0.26 with
libceph, write 1G data in with ceph_write() and read it out with
ceph_read(),

rep_size    1                               2
    3                      4
write:	78.8 MB/s 		39.38 MB/s 		27.7 MB/s		20.90 MB/s
read:	85.3 MB/s 		85.33 MB/s		78.77MB/s		78.77MB/s

I think if the replication strategy is splay or primary copy, not the
chain, as the thesis said,   writing speed for 3, 4 or even more
replication will be a little worse than  2 replication, should be near
with 39.38 MB/s.
But the write performance  I got is affect so much by size of replication.

What is the replication strategy in ceph-0.26, not splay?  If splay,
why not near with 39.38 MB/s?

There is 5 OSDs in 2 hosts, 2 in one and 3 int the other.

Thx!
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: replication write speed
  2011-05-09  2:43 replication write speed Simon Tian
@ 2011-05-09  2:50 ` Gregory Farnum
  2011-05-09  3:04   ` Simon Tian
  0 siblings, 1 reply; 8+ messages in thread
From: Gregory Farnum @ 2011-05-09  2:50 UTC (permalink / raw)
  To: Simon Tian; +Cc: ceph-devel

On Sun, May 8, 2011 at 7:43 PM, Simon Tian <aixt2006@gmail.com> wrote:
> Hi folks,
>
>      I am testing the replication performance of ceph-0.26 with
> libceph, write 1G data in with ceph_write() and read it out with
> ceph_read(),
>
> rep_size    1                               2
>    3                      4
> write:  78.8 MB/s               39.38 MB/s              27.7 MB/s               20.90 MB/s
> read:   85.3 MB/s               85.33 MB/s              78.77MB/s               78.77MB/s
>
> I think if the replication strategy is splay or primary copy, not the
> chain, as the thesis said,   writing speed for 3, 4 or even more
> replication will be a little worse than  2 replication, should be near
> with 39.38 MB/s.
> But the write performance  I got is affect so much by size of replication.
>
> What is the replication strategy in ceph-0.26, not splay?  If splay,
> why not near with 39.38 MB/s?
>
> There is 5 OSDs in 2 hosts, 2 in one and 3 int the other.

The replication strategy has been fixed at primary copy for several
years now. At expected replication levels (2-3) there just isn't a big
difference between the strategies, and limiting it to primary-copy
replication makes a lot of the bookkeeping for data safety much easier
to handle.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: replication write speed
  2011-05-09  2:50 ` Gregory Farnum
@ 2011-05-09  3:04   ` Simon Tian
  2011-05-09  3:31     ` Gregory Farnum
  2011-05-09  3:48     ` Simon Tian
  0 siblings, 2 replies; 8+ messages in thread
From: Simon Tian @ 2011-05-09  3:04 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

2011/5/9 Gregory Farnum <gregf@hq.newdream.net>:
> On Sun, May 8, 2011 at 7:43 PM, Simon Tian <aixt2006@gmail.com> wrote:
>> Hi folks,
>>
>>      I am testing the replication performance of ceph-0.26 with
>> libceph, write 1G data in with ceph_write() and read it out with
>> ceph_read(),
>>
>> rep_size    1                               2
>>    3                      4
>> write:  78.8 MB/s               39.38 MB/s              27.7 MB/s               20.90 MB/s
>> read:   85.3 MB/s               85.33 MB/s              78.77MB/s               78.77MB/s
>>
>> I think if the replication strategy is splay or primary copy, not the
>> chain, as the thesis said,   writing speed for 3, 4 or even more
>> replication will be a little worse than  2 replication, should be near
>> with 39.38 MB/s.
>> But the write performance  I got is affect so much by size of replication.
>>
>> What is the replication strategy in ceph-0.26, not splay?  If splay,
>> why not near with 39.38 MB/s?
>>
>> There is 5 OSDs in 2 hosts, 2 in one and 3 int the other.
>
> The replication strategy has been fixed at primary copy for several
> years now. At expected replication levels (2-3) there just isn't a big
> difference between the strategies, and limiting it to primary-copy
> replication makes a lot of the bookkeeping for data safety much easier
> to handle.

As you know, I am a fresh to ceph, haha

For primary copy, I think when the replication size is 3, 4, or even
more, the writing speed should also near with 2 replication. Because
the 2nd, 3rd, 4th, ... replication are written parallelly.  The speed
I got for 3, 4 replication is not near with the speed of 2, in fact,
like linear reduce.

Thx very much!
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: replication write speed
  2011-05-09  3:04   ` Simon Tian
@ 2011-05-09  3:31     ` Gregory Farnum
  2011-05-09  7:01       ` Simon Tian
  2011-05-09  3:48     ` Simon Tian
  1 sibling, 1 reply; 8+ messages in thread
From: Gregory Farnum @ 2011-05-09  3:31 UTC (permalink / raw)
  To: Simon Tian; +Cc: ceph-devel

On Sun, May 8, 2011 at 8:04 PM, Simon Tian <aixt2006@gmail.com> wrote:
> For primary copy, I think when the replication size is 3, 4, or even
> more, the writing speed should also near with 2 replication. Because
> the 2nd, 3rd, 4th, ... replication are written parallelly.  The speed
> I got for 3, 4 replication is not near with the speed of 2, in fact,
> like linear reduce.
You're hitting your network limits there. With primary copy then the
primary needs to send out the data to each of the replicas, which caps
the write speed at (network bandwidth) / (num replicas). Presumably
you're using a gigabit network (or at least your nodes have gigabit
connections):
1 replica: ~125MB/s (really a bit less due to protocol overhead)
2 replicas:~62MB/s
3 replicas: ~40MB/s
4 replicas: ~31MB/s
etc.
Of course, you can also be limited by the speed of your disks (don't
forget to take journaling into account); and your situation is further
complicated by having multiple daemons per physical node. But I
suspect you get the idea. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: replication write speed
  2011-05-09  3:04   ` Simon Tian
  2011-05-09  3:31     ` Gregory Farnum
@ 2011-05-09  3:48     ` Simon Tian
  1 sibling, 0 replies; 8+ messages in thread
From: Simon Tian @ 2011-05-09  3:48 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

So, I want a speed like this, not a linear reducation when replication
size increase:
rep_size    1                  2                        3
           4                          5                        ...
 n
write:  78.8 MB/s        39.38 MB/s         (38.7 MB/s)         (37.90
MB/s)        (37.03 MB/s)         ...       (32.13 MB/s)
read:   85.3 MB/s        85.33 MB/s         78.77MB/s
78.77MB/s           (78.77MB/s )           .....      (78.77MB/s )

Hmm, So how can I got a perfect result like this with primary-copy replication?

Thx very much!

Simon

2011/5/9 Simon Tian <aixt2006@gmail.com>:
> 2011/5/9 Gregory Farnum <gregf@hq.newdream.net>:
>> On Sun, May 8, 2011 at 7:43 PM, Simon Tian <aixt2006@gmail.com> wrote:
>>> Hi folks,
>>>
>>>      I am testing the replication performance of ceph-0.26 with
>>> libceph, write 1G data in with ceph_write() and read it out with
>>> ceph_read(),
>>>
>>> rep_size    1                               2
>>>    3                      4
>>> write:  78.8 MB/s               39.38 MB/s              27.7 MB/s               20.90 MB/s
>>> read:   85.3 MB/s               85.33 MB/s              78.77MB/s               78.77MB/s
>>>
>>> I think if the replication strategy is splay or primary copy, not the
>>> chain, as the thesis said,   writing speed for 3, 4 or even more
>>> replication will be a little worse than  2 replication, should be near
>>> with 39.38 MB/s.
>>> But the write performance  I got is affect so much by size of replication.
>>>
>>> What is the replication strategy in ceph-0.26, not splay?  If splay,
>>> why not near with 39.38 MB/s?
>>>
>>> There is 5 OSDs in 2 hosts, 2 in one and 3 int the other.
>>
>> The replication strategy has been fixed at primary copy for several
>> years now. At expected replication levels (2-3) there just isn't a big
>> difference between the strategies, and limiting it to primary-copy
>> replication makes a lot of the bookkeeping for data safety much easier
>> to handle.
>
> As you know, I am a fresh to ceph, haha
>
> For primary copy, I think when the replication size is 3, 4, or even
> more, the writing speed should also near with 2 replication. Because
> the 2nd, 3rd, 4th, ... replication are written parallelly.  The speed
> I got for 3, 4 replication is not near with the speed of 2, in fact,
> like linear reduce.
>
> Thx very much!
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: replication write speed
  2011-05-09  3:31     ` Gregory Farnum
@ 2011-05-09  7:01       ` Simon Tian
  2011-05-09 16:02         ` Gregory Farnum
  0 siblings, 1 reply; 8+ messages in thread
From: Simon Tian @ 2011-05-09  7:01 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 1768 bytes --]

2011/5/9 Gregory Farnum <gregf@hq.newdream.net>:
> On Sun, May 8, 2011 at 8:04 PM, Simon Tian <aixt2006@gmail.com> wrote:
>> For primary copy, I think when the replication size is 3, 4, or even
>> more, the writing speed should also near with 2 replication. Because
>> the 2nd, 3rd, 4th, ... replication are written parallelly.  The speed
>> I got for 3, 4 replication is not near with the speed of 2, in fact,
>> like linear reduce.
> You're hitting your network limits there. With primary copy then the
> primary needs to send out the data to each of the replicas, which caps
> the write speed at (network bandwidth) / (num replicas). Presumably
> you're using a gigabit network (or at least your nodes have gigabit
> connections):
> 1 replica: ~125MB/s (really a bit less due to protocol overhead)
> 2 replicas:~62MB/s
> 3 replicas: ~40MB/s
> 4 replicas: ~31MB/s
> etc.
> Of course, you can also be limited by the speed of your disks (don't
> forget to take journaling into account); and your situation is further
> complicated by having multiple daemons per physical node. But I
> suspect you get the idea. :)


  Yes, you are very right!   The client throughput with different replication
size will be limited by the network bandwidth of the primary copy.


I have some other questions:
1.  If I have to write or read a sparse file randomly, will the
performance reduce much?

2. Is a rbd image sparse file?

3. As the attachment showed,  the read throughput will increase when
I/O size increasing.
   What is this I/O size mean? Is there any relationship between I/O
size and object size?
   In the latest ceph, what will the read throughput of different file
system like with different I/O size?

Thx very much!
Simon

[-- Attachment #2: fs_throughput.jpg --]
[-- Type: image/jpeg, Size: 27839 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: replication write speed
  2011-05-09  7:01       ` Simon Tian
@ 2011-05-09 16:02         ` Gregory Farnum
  2011-05-10  2:25           ` Simon Tian
  0 siblings, 1 reply; 8+ messages in thread
From: Gregory Farnum @ 2011-05-09 16:02 UTC (permalink / raw)
  To: Simon Tian; +Cc: ceph-devel

On Mon, May 9, 2011 at 12:01 AM, Simon Tian <aixt2006@gmail.com> wrote:
> 2011/5/9 Gregory Farnum <gregf@hq.newdream.net>:
>> On Sun, May 8, 2011 at 8:04 PM, Simon Tian <aixt2006@gmail.com> wrote:
>>> For primary copy, I think when the replication size is 3, 4, or even
>>> more, the writing speed should also near with 2 replication. Because
>>> the 2nd, 3rd, 4th, ... replication are written parallelly.  The speed
>>> I got for 3, 4 replication is not near with the speed of 2, in fact,
>>> like linear reduce.
>> You're hitting your network limits there. With primary copy then the
>> primary needs to send out the data to each of the replicas, which caps
>> the write speed at (network bandwidth) / (num replicas). Presumably
>> you're using a gigabit network (or at least your nodes have gigabit
>> connections):
>> 1 replica: ~125MB/s (really a bit less due to protocol overhead)
>> 2 replicas:~62MB/s
>> 3 replicas: ~40MB/s
>> 4 replicas: ~31MB/s
>> etc.
>> Of course, you can also be limited by the speed of your disks (don't
>> forget to take journaling into account); and your situation is further
>> complicated by having multiple daemons per physical node. But I
>> suspect you get the idea. :)
>
>
>  Yes, you are very right!   The client throughput with different replication
> size will be limited by the network bandwidth of the primary copy.
>
>
> I have some other questions:
> 1.  If I have to write or read a sparse file randomly, will the
> performance reduce much?
That depends on how large your random IOs are, how much of the file is
cached in-memory on the OSDs, etc. In general, random IO does not look
a lot different than sequential IO to the OSDs -- since the OSDs store
files in 4MB blocks then any large file read will involve retrieving
random 4MB blocks from the OSD anyway. On the client you might see a
bigger difference, though -- there is a limited amount of prefetching
going on client-side and it will work much better with sequential than
random reads.

But behavior under different workloads is an area that still needs
more study and refinement.

> 2. Is a rbd image sparse file?
Yes! As with files in the POSIX-compatible Ceph layer, RBD images are
stored in blocks (4MB by default) on the OSDs. Only those chunks with
data actually exist, and depending on your options and the backing
filesystem, only the piece of the chunk with data is actually stored.

> 3. As the attachment showed,  the read throughput will increase when
> I/O size increasing.
>   What is this I/O size mean? Is there any relationship between I/O
> size and object size?
>   In the latest ceph, what will the read throughput of different file
> system like with different I/O size?
Is that one of the illustrations from Sage's thesis?
In general larger IOs will have higher throughput for many of the same
reasons that larger IOs have higher throughput on hard drives: the OSD
still needs to retrieve the data from off-disk, and a larger IO size
will minimize the impact of the seek latency there. With very large
IOs, the client can dispatch multiple read requests at once, allowing
the seek latency on the OSDs to happen simultaneously rather than
sequentially.
You can obviously do IOs of any size without regard fro the size of
the object; the client layers handle all the necessary translation.
In all versions of Ceph, you can expect higher throughput with larger
IO sizes. I'm not sure if that's what you mean?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: replication write speed
  2011-05-09 16:02         ` Gregory Farnum
@ 2011-05-10  2:25           ` Simon Tian
  0 siblings, 0 replies; 8+ messages in thread
From: Simon Tian @ 2011-05-10  2:25 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

Thx a lot!  I've got it!
You really pull me into ceph deeply! haha


2011/5/10 Gregory Farnum <gregf@hq.newdream.net>:
> On Mon, May 9, 2011 at 12:01 AM, Simon Tian <aixt2006@gmail.com> wrote:
>> 2011/5/9 Gregory Farnum <gregf@hq.newdream.net>:
>>> On Sun, May 8, 2011 at 8:04 PM, Simon Tian <aixt2006@gmail.com> wrote:
>>>> For primary copy, I think when the replication size is 3, 4, or even
>>>> more, the writing speed should also near with 2 replication. Because
>>>> the 2nd, 3rd, 4th, ... replication are written parallelly.  The speed
>>>> I got for 3, 4 replication is not near with the speed of 2, in fact,
>>>> like linear reduce.
>>> You're hitting your network limits there. With primary copy then the
>>> primary needs to send out the data to each of the replicas, which caps
>>> the write speed at (network bandwidth) / (num replicas). Presumably
>>> you're using a gigabit network (or at least your nodes have gigabit
>>> connections):
>>> 1 replica: ~125MB/s (really a bit less due to protocol overhead)
>>> 2 replicas:~62MB/s
>>> 3 replicas: ~40MB/s
>>> 4 replicas: ~31MB/s
>>> etc.
>>> Of course, you can also be limited by the speed of your disks (don't
>>> forget to take journaling into account); and your situation is further
>>> complicated by having multiple daemons per physical node. But I
>>> suspect you get the idea. :)
>>
>>
>>  Yes, you are very right!   The client throughput with different replication
>> size will be limited by the network bandwidth of the primary copy.
>>
>>
>> I have some other questions:
>> 1.  If I have to write or read a sparse file randomly, will the
>> performance reduce much?
> That depends on how large your random IOs are, how much of the file is
> cached in-memory on the OSDs, etc. In general, random IO does not look
> a lot different than sequential IO to the OSDs -- since the OSDs store
> files in 4MB blocks then any large file read will involve retrieving
> random 4MB blocks from the OSD anyway. On the client you might see a
> bigger difference, though -- there is a limited amount of prefetching
> going on client-side and it will work much better with sequential than
> random reads.
>
> But behavior under different workloads is an area that still needs
> more study and refinement.
>
>> 2. Is a rbd image sparse file?
> Yes! As with files in the POSIX-compatible Ceph layer, RBD images are
> stored in blocks (4MB by default) on the OSDs. Only those chunks with
> data actually exist, and depending on your options and the backing
> filesystem, only the piece of the chunk with data is actually stored.
>
>> 3. As the attachment showed,  the read throughput will increase when
>> I/O size increasing.
>>   What is this I/O size mean? Is there any relationship between I/O
>> size and object size?
>>   In the latest ceph, what will the read throughput of different file
>> system like with different I/O size?
> Is that one of the illustrations from Sage's thesis?
> In general larger IOs will have higher throughput for many of the same
> reasons that larger IOs have higher throughput on hard drives: the OSD
> still needs to retrieve the data from off-disk, and a larger IO size
> will minimize the impact of the seek latency there. With very large
> IOs, the client can dispatch multiple read requests at once, allowing
> the seek latency on the OSDs to happen simultaneously rather than
> sequentially.
> You can obviously do IOs of any size without regard fro the size of
> the object; the client layers handle all the necessary translation.
> In all versions of Ceph, you can expect higher throughput with larger
> IO sizes. I'm not sure if that's what you mean?
> -Greg
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-05-10  2:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-09  2:43 replication write speed Simon Tian
2011-05-09  2:50 ` Gregory Farnum
2011-05-09  3:04   ` Simon Tian
2011-05-09  3:31     ` Gregory Farnum
2011-05-09  7:01       ` Simon Tian
2011-05-09 16:02         ` Gregory Farnum
2011-05-10  2:25           ` Simon Tian
2011-05-09  3:48     ` Simon Tian

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.