All of lore.kernel.org
 help / color / mirror / Atom feed
* ceph-0.77-900.gce9bfb8 Testing Rados EC/Tiering & CephFS ...
@ 2014-03-20 10:49 Andreas Joachim Peters
  2014-03-20 13:09 ` Mark Nelson
  2014-03-25 18:04 ` Gregory Farnum
  0 siblings, 2 replies; 8+ messages in thread
From: Andreas Joachim Peters @ 2014-03-20 10:49 UTC (permalink / raw)
  To: ceph-devel

Hi, 

I did some Firefly ceph-0.77-900.gce9bfb8 testing of EC/Tiering deploying 64 OSD with in-memory filesystems (RapidDisk with ext4) on a single 256 GB box. The raw write performance of this box is ~3 GB/s for all and ~450 MB/s per OSD. It provides 250k IOPS per OSD.

I compared several algorithms and configurations ...

Here are the results (there is no significant difference between 64 or 10 OSDS for the performance, tried both but not for 24+8 !) with 4M objects, 32 client threads ....

1 rep: 1.1 GB/s
2 rep: 886 MB/s
3 rep: 750 MB/s
cauchy 4+2: 880 MB/s
liber8tion: 4+2: 875 MB/s
cauchy 6+3: 780 MB/s
cauchy 16+8: 520 MB/s
cauchy 24+8: 450 MB/s

Then I added a single replica cache pool in front of cauchy 4+2.

The write performance is now 1.1 GB/s as expected when the cache is not full. If I shrink the cache pool in front forcing continuous eviction during the benchmark it degrades to stable 140 MB/s.

The single threaded client reduces from 260 MB/s to 165 MB/s.

What is strange to me is that after a "rados bench" there are objects left in the cache and the back-end tier. They only disappear if I set the "forward" and force the eviction. Is that by design the desired behaviour to not apply the deletion?

Some observations:
- I think it is important to document the alignment requirements for appends (e.g. if you do rados put it needs aligned appends and the 4M blocks are not aligned for every combination of (k,m) ).

- another observation is that seems difficult to run 64 OSDs on a box. I have no obvious memory limitation but it requires ~30k threads and it was difficult to create several pools with many PGs without having OSDs core dumping because resources are not available.

- when OSD get 100% full they core dump most of the time. In my case all OSDs become full at the same time and when this happended there is no way to get the cluster up again without manually deleting objects in the OSD directories and make some space.

- I get a syntax error in the CEPH CENTOS(RHEL6) startup script:

awk: { d=$2/1073741824 ; r = sprintf(\"%.2f\", d); print r }
awk:                                 ^ backslash not last character on line

- I have run several times into a situation where the only way out was to delete the whole cluster and set it up from scratch

- I got this reproducable stack trace with a EC pool and a front end tier:
osd/ReplicatedPG.cc: 5554: FAILED assert(cop->data.length() + cop->temp_cursor.data_offset == cop->cursor.data_offset)

 ceph version 0.77-900-gce9bfb8 (ce9bfb879c32690d030db6b2a349b7b6f6e6a468)
 1: (ReplicatedPG::_write_copy_chunk(boost::shared_ptr<ReplicatedPG::CopyOp>, PGBackend::PGTransaction*)+0x7dd) [0x8a376d]
 2: (ReplicatedPG::_build_finish_copy_transaction(boost::shared_ptr<ReplicatedPG::CopyOp>, PGBackend::PGTransaction*)+0x114) [0x8a3954]
 3: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x507) [0x8f1097]
 4: (C_Copyfrom::finish(int)+0xb7) [0x93fa67]
 5: (Context::complete(int)+0x9) [0x65d4b9]
 6: (Finisher::finisher_thread_entry()+0x1d8) [0xa9a528]
 7: /lib64/libpthread.so.0() [0x3386a079d1]
 8: (clone()+0x6d) [0x33866e8b6d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Moreover I did some trivial testing of the meta data part of CephFS and ceph-fuse:

- I created a directory hierarchy with like 10/1000/100 = 1 Mio directories. After creation the MDS uses 5.5 GB of memory, ceph-fuse 1.8 GB. It takes 33 minutes to do "find /ceph" on this hierarchy. If I restart the MDS and do the same it takes 18 minutes. After this operation the MDS uses ~10 GB of memory (10k per directory for one entry).

If I do "ls -laRt /ceph" I get "no such file or directory" after some time. When this happened one can pick one of the directory and do a single "ls -la <dir>". The first time one get's again "no such file or directory", the second time it eventually works and shows the contents.

Cheers Andreas.



 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ceph-0.77-900.gce9bfb8 Testing Rados EC/Tiering & CephFS ...
  2014-03-20 10:49 ceph-0.77-900.gce9bfb8 Testing Rados EC/Tiering & CephFS Andreas Joachim Peters
@ 2014-03-20 13:09 ` Mark Nelson
  2014-03-20 13:43   ` Andreas Joachim Peters
  2014-03-25 18:04 ` Gregory Farnum
  1 sibling, 1 reply; 8+ messages in thread
From: Mark Nelson @ 2014-03-20 13:09 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

On 03/20/2014 05:49 AM, Andreas Joachim Peters wrote:
> Hi,
>
> I did some Firefly ceph-0.77-900.gce9bfb8 testing of EC/Tiering deploying 64 OSD with in-memory filesystems (RapidDisk with ext4) on a single 256 GB box. The raw write performance of this box is ~3 GB/s for all and ~450 MB/s per OSD. It provides 250k IOPS per OSD.
>
> I compared several algorithms and configurations ...
>
> Here are the results (there is no significant difference between 64 or 10 OSDS for the performance, tried both but not for 24+8 !) with 4M objects, 32 client threads ....
>
> 1 rep: 1.1 GB/s
> 2 rep: 886 MB/s
> 3 rep: 750 MB/s
> cauchy 4+2: 880 MB/s
> liber8tion: 4+2: 875 MB/s
> cauchy 6+3: 780 MB/s
> cauchy 16+8: 520 MB/s
> cauchy 24+8: 450 MB/s

How many copies of rados bench are you using and how much concurrency? 
Also, were the rados bench processes on an external host?

Here's what I'm seeing internally for 4MB writes on a box with 30 
spinning disks, 6 Intel 520 SSDs for journals, and bonded 10GbE.  This 
was using 4 copies of rados bench and 32 concurrent IOs per process. 
Note that 4MB writes is probably the best case scenario for EC compared 
to replication as far as performance goes right now.

4MB READ

ec-4k1m:	1546.21 MB/s
ec-6k2m:	1308.52 MB/s
ec-10k3m:	1060.73 MB/s
2X rep:		2110.92 MB/s
3X rep:		2131.03 MB/s

4MB WRITE

ec-4k1m:	1063.72 MB/s
ec-6k2m:	750.381 MB/s
ec-10k3m:	493.615 MB/s
2X rep:		1171.35 MB/s
3X rep:		755.148 MB/s


>
> Then I added a single replica cache pool in front of cauchy 4+2.
>
> The write performance is now 1.1 GB/s as expected when the cache is not full. If I shrink the cache pool in front forcing continuous eviction during the benchmark it degrades to stable 140 MB/s.

Mind trying that same test with just a simple 1x rep pool to another 1x 
rep pool and see what happens?

>
> The single threaded client reduces from 260 MB/s to 165 MB/s.
>
> What is strange to me is that after a "rados bench" there are objects left in the cache and the back-end tier. They only disappear if I set the "forward" and force the eviction. Is that by design the desired behaviour to not apply the deletion?
>
> Some observations:
> - I think it is important to document the alignment requirements for appends (e.g. if you do rados put it needs aligned appends and the 4M blocks are not aligned for every combination of (k,m) ).
>
> - another observation is that seems difficult to run 64 OSDs on a box. I have no obvious memory limitation but it requires ~30k threads and it was difficult to create several pools with many PGs without having OSDs core dumping because resources are not available.
>
> - when OSD get 100% full they core dump most of the time. In my case all OSDs become full at the same time and when this happended there is no way to get the cluster up again without manually deleting objects in the OSD directories and make some space.
>
> - I get a syntax error in the CEPH CENTOS(RHEL6) startup script:
>
> awk: { d=$2/1073741824 ; r = sprintf(\"%.2f\", d); print r }
> awk:                                 ^ backslash not last character on line
>
> - I have run several times into a situation where the only way out was to delete the whole cluster and set it up from scratch
>
> - I got this reproducable stack trace with a EC pool and a front end tier:
> osd/ReplicatedPG.cc: 5554: FAILED assert(cop->data.length() + cop->temp_cursor.data_offset == cop->cursor.data_offset)
>
>   ceph version 0.77-900-gce9bfb8 (ce9bfb879c32690d030db6b2a349b7b6f6e6a468)
>   1: (ReplicatedPG::_write_copy_chunk(boost::shared_ptr<ReplicatedPG::CopyOp>, PGBackend::PGTransaction*)+0x7dd) [0x8a376d]
>   2: (ReplicatedPG::_build_finish_copy_transaction(boost::shared_ptr<ReplicatedPG::CopyOp>, PGBackend::PGTransaction*)+0x114) [0x8a3954]
>   3: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x507) [0x8f1097]
>   4: (C_Copyfrom::finish(int)+0xb7) [0x93fa67]
>   5: (Context::complete(int)+0x9) [0x65d4b9]
>   6: (Finisher::finisher_thread_entry()+0x1d8) [0xa9a528]
>   7: /lib64/libpthread.so.0() [0x3386a079d1]
>   8: (clone()+0x6d) [0x33866e8b6d]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Haven't seen that one yet!  Can you create an urgent bug with the full 
stack trace and the pool settings both for the cache and the tier pool? 
  Thanks!

>
> Moreover I did some trivial testing of the meta data part of CephFS and ceph-fuse:
>
> - I created a directory hierarchy with like 10/1000/100 = 1 Mio directories. After creation the MDS uses 5.5 GB of memory, ceph-fuse 1.8 GB. It takes 33 minutes to do "find /ceph" on this hierarchy. If I restart the MDS and do the same it takes 18 minutes. After this operation the MDS uses ~10 GB of memory (10k per directory for one entry).
>
> If I do "ls -laRt /ceph" I get "no such file or directory" after some time. When this happened one can pick one of the directory and do a single "ls -la <dir>". The first time one get's again "no such file or directory", the second time it eventually works and shows the contents.
>
> Cheers Andreas.
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: ceph-0.77-900.gce9bfb8 Testing Rados EC/Tiering & CephFS ...
  2014-03-20 13:09 ` Mark Nelson
@ 2014-03-20 13:43   ` Andreas Joachim Peters
  2014-03-20 13:55     ` Mark Nelson
  0 siblings, 1 reply; 8+ messages in thread
From: Andreas Joachim Peters @ 2014-03-20 13:43 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel

Hi Mark, 

I tested write performance with a single rados bench (32 threads), everything on localhost, there is minimal latency on networking and IO in this setup. The test is running CPU bound with 32 paralell io's.  I used 4MB blocks, using bigger or smaller block IO volume goes down. I don't need several instances of rados bench to saturate the host. I reach already with 16 IOs comparable speed. So, that is anyway not a realistic scenario but a good baseline measurement and concurrency test.

Your write results look very compatible to mine. Since I run client and server together and all in-memory all this tests are probably limited by the memory bandwidth and the CPU speed. The machine is a 2x6 core 2GHz Xeon with 1.3GHz ECC RAM.

I will test with single replica front- and back-end case and post the result. I was more surprised by the IOPS I can reach in this setup. Not more then ~2.5k IOPS when writing even for tiny blocks. On your setup you seem not to get much more in the 4M case.

Cheers Andreas.

________________________________________
From: Mark Nelson [mark.nelson@inktank.com]
Sent: 20 March 2014 14:09
To: Andreas Joachim Peters
Cc: ceph-devel@vger.kernel.org
Subject: Re: ceph-0.77-900.gce9bfb8 Testing Rados EC/Tiering & CephFS ...

On 03/20/2014 05:49 AM, Andreas Joachim Peters wrote:
> Hi,
>
> I did some Firefly ceph-0.77-900.gce9bfb8 testing of EC/Tiering deploying 64 OSD with in-memory filesystems (RapidDisk with ext4) on a single 256 GB box. The raw write performance of this box is ~3 GB/s for all and ~450 MB/s per OSD. It provides 250k IOPS per OSD.
>
> I compared several algorithms and configurations ...
>
> Here are the results (there is no significant difference between 64 or 10 OSDS for the performance, tried both but not for 24+8 !) with 4M objects, 32 client threads ....
>
> 1 rep: 1.1 GB/s
> 2 rep: 886 MB/s
> 3 rep: 750 MB/s
> cauchy 4+2: 880 MB/s
> liber8tion: 4+2: 875 MB/s
> cauchy 6+3: 780 MB/s
> cauchy 16+8: 520 MB/s
> cauchy 24+8: 450 MB/s

How many copies of rados bench are you using and how much concurrency?
Also, were the rados bench processes on an external host?

Here's what I'm seeing internally for 4MB writes on a box with 30
spinning disks, 6 Intel 520 SSDs for journals, and bonded 10GbE.  This
was using 4 copies of rados bench and 32 concurrent IOs per process.
Note that 4MB writes is probably the best case scenario for EC compared
to replication as far as performance goes right now.

4MB READ

ec-4k1m:        1546.21 MB/s
ec-6k2m:        1308.52 MB/s
ec-10k3m:       1060.73 MB/s
2X rep:         2110.92 MB/s
3X rep:         2131.03 MB/s

4MB WRITE

ec-4k1m:        1063.72 MB/s
ec-6k2m:        750.381 MB/s
ec-10k3m:       493.615 MB/s
2X rep:         1171.35 MB/s
3X rep:         755.148 MB/s




>
> Then I added a single replica cache pool in front of cauchy 4+2.
>
> The write performance is now 1.1 GB/s as expected when the cache is not full. If I shrink the cache pool in front forcing continuous eviction during the benchmark it degrades to stable 140 MB/s.

Mind trying that same test with just a simple 1x rep pool to another 1x
rep pool and see what happens?

>
> The single threaded client reduces from 260 MB/s to 165 MB/s.
>
> What is strange to me is that after a "rados bench" there are objects left in the cache and the back-end tier. They only disappear if I set the "forward" and force the eviction. Is that by design the desired behaviour to not apply the deletion?
>
> Some observations:
> - I think it is important to document the alignment requirements for appends (e.g. if you do rados put it needs aligned appends and the 4M blocks are not aligned for every combination of (k,m) ).
>
> - another observation is that seems difficult to run 64 OSDs on a box. I have no obvious memory limitation but it requires ~30k threads and it was difficult to create several pools with many PGs without having OSDs core dumping because resources are not available.
>
> - when OSD get 100% full they core dump most of the time. In my case all OSDs become full at the same time and when this happended there is no way to get the cluster up again without manually deleting objects in the OSD directories and make some space.
>
> - I get a syntax error in the CEPH CENTOS(RHEL6) startup script:
>
> awk: { d=$2/1073741824 ; r = sprintf(\"%.2f\", d); print r }
> awk:                                 ^ backslash not last character on line
>
> - I have run several times into a situation where the only way out was to delete the whole cluster and set it up from scratch
>
> - I got this reproducable stack trace with a EC pool and a front end tier:
> osd/ReplicatedPG.cc: 5554: FAILED assert(cop->data.length() + cop->temp_cursor.data_offset == cop->cursor.data_offset)
>
>   ceph version 0.77-900-gce9bfb8 (ce9bfb879c32690d030db6b2a349b7b6f6e6a468)
>   1: (ReplicatedPG::_write_copy_chunk(boost::shared_ptr<ReplicatedPG::CopyOp>, PGBackend::PGTransaction*)+0x7dd) [0x8a376d]
>   2: (ReplicatedPG::_build_finish_copy_transaction(boost::shared_ptr<ReplicatedPG::CopyOp>, PGBackend::PGTransaction*)+0x114) [0x8a3954]
>   3: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x507) [0x8f1097]
>   4: (C_Copyfrom::finish(int)+0xb7) [0x93fa67]
>   5: (Context::complete(int)+0x9) [0x65d4b9]
>   6: (Finisher::finisher_thread_entry()+0x1d8) [0xa9a528]
>   7: /lib64/libpthread.so.0() [0x3386a079d1]
>   8: (clone()+0x6d) [0x33866e8b6d]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Haven't seen that one yet!  Can you create an urgent bug with the full
stack trace and the pool settings both for the cache and the tier pool?
  Thanks!

>
> Moreover I did some trivial testing of the meta data part of CephFS and ceph-fuse:
>
> - I created a directory hierarchy with like 10/1000/100 = 1 Mio directories. After creation the MDS uses 5.5 GB of memory, ceph-fuse 1.8 GB. It takes 33 minutes to do "find /ceph" on this hierarchy. If I restart the MDS and do the same it takes 18 minutes. After this operation the MDS uses ~10 GB of memory (10k per directory for one entry).
>
> If I do "ls -laRt /ceph" I get "no such file or directory" after some time. When this happened one can pick one of the directory and do a single "ls -la <dir>". The first time one get's again "no such file or directory", the second time it eventually works and shows the contents.
>
> Cheers Andreas.
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ceph-0.77-900.gce9bfb8 Testing Rados EC/Tiering & CephFS ...
  2014-03-20 13:43   ` Andreas Joachim Peters
@ 2014-03-20 13:55     ` Mark Nelson
  2014-03-20 16:42       ` Andreas Joachim Peters
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Nelson @ 2014-03-20 13:55 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

On 03/20/2014 08:43 AM, Andreas Joachim Peters wrote:
> Hi Mark,
>
> I tested write performance with a single rados bench (32 threads), everything on localhost, there is minimal latency on networking and IO in this setup. The test is running CPU bound with 32 paralell io's.  I used 4MB blocks, using bigger or smaller block IO volume goes down. I don't need several instances of rados bench to saturate the host. I reach already with 16 IOs comparable speed. So, that is anyway not a realistic scenario but a good baseline measurement and concurrency test.

Have you checked CPU usage during these tests?  EC is pretty intense, 
and running rados bench on the same host doesn't help.  Having multiple 
copies of rados bench going potentially may help for a couple of 
reasons, but moving it off to a separate host may be important too 
(assuming you've got the network throughput to make it work).

>
> Your write results look very compatible to mine. Since I run client and server together and all in-memory all this tests are probably limited by the memory bandwidth and the CPU speed. The machine is a 2x6 core 2GHz Xeon with 1.3GHz ECC RAM.

The machine in my test has 2x E5-2630L CPUs (6 core at 2GHz each).

>
> I will test with single replica front- and back-end case and post the result. I was more surprised by the IOPS I can reach in this setup. Not more then ~2.5k IOPS when writing even for tiny blocks. On your setup you seem not to get much more in the 4M case.

As time goes on we'll be able to optimize small IO performance with EC, 
but there's a lot more processing involved so I suspect it's always 
going to be slower (especially for reads!) than simple replication when 
latency is critical.

>
> Cheers Andreas.
>
> ________________________________________
> From: Mark Nelson [mark.nelson@inktank.com]
> Sent: 20 March 2014 14:09
> To: Andreas Joachim Peters
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: ceph-0.77-900.gce9bfb8 Testing Rados EC/Tiering & CephFS ...
>
> On 03/20/2014 05:49 AM, Andreas Joachim Peters wrote:
>> Hi,
>>
>> I did some Firefly ceph-0.77-900.gce9bfb8 testing of EC/Tiering deploying 64 OSD with in-memory filesystems (RapidDisk with ext4) on a single 256 GB box. The raw write performance of this box is ~3 GB/s for all and ~450 MB/s per OSD. It provides 250k IOPS per OSD.
>>
>> I compared several algorithms and configurations ...
>>
>> Here are the results (there is no significant difference between 64 or 10 OSDS for the performance, tried both but not for 24+8 !) with 4M objects, 32 client threads ....
>>
>> 1 rep: 1.1 GB/s
>> 2 rep: 886 MB/s
>> 3 rep: 750 MB/s
>> cauchy 4+2: 880 MB/s
>> liber8tion: 4+2: 875 MB/s
>> cauchy 6+3: 780 MB/s
>> cauchy 16+8: 520 MB/s
>> cauchy 24+8: 450 MB/s
>
> How many copies of rados bench are you using and how much concurrency?
> Also, were the rados bench processes on an external host?
>
> Here's what I'm seeing internally for 4MB writes on a box with 30
> spinning disks, 6 Intel 520 SSDs for journals, and bonded 10GbE.  This
> was using 4 copies of rados bench and 32 concurrent IOs per process.
> Note that 4MB writes is probably the best case scenario for EC compared
> to replication as far as performance goes right now.
>
> 4MB READ
>
> ec-4k1m:        1546.21 MB/s
> ec-6k2m:        1308.52 MB/s
> ec-10k3m:       1060.73 MB/s
> 2X rep:         2110.92 MB/s
> 3X rep:         2131.03 MB/s
>
> 4MB WRITE
>
> ec-4k1m:        1063.72 MB/s
> ec-6k2m:        750.381 MB/s
> ec-10k3m:       493.615 MB/s
> 2X rep:         1171.35 MB/s
> 3X rep:         755.148 MB/s
>
>
>
>
>>
>> Then I added a single replica cache pool in front of cauchy 4+2.
>>
>> The write performance is now 1.1 GB/s as expected when the cache is not full. If I shrink the cache pool in front forcing continuous eviction during the benchmark it degrades to stable 140 MB/s.
>
> Mind trying that same test with just a simple 1x rep pool to another 1x
> rep pool and see what happens?
>
>>
>> The single threaded client reduces from 260 MB/s to 165 MB/s.
>>
>> What is strange to me is that after a "rados bench" there are objects left in the cache and the back-end tier. They only disappear if I set the "forward" and force the eviction. Is that by design the desired behaviour to not apply the deletion?
>>
>> Some observations:
>> - I think it is important to document the alignment requirements for appends (e.g. if you do rados put it needs aligned appends and the 4M blocks are not aligned for every combination of (k,m) ).
>>
>> - another observation is that seems difficult to run 64 OSDs on a box. I have no obvious memory limitation but it requires ~30k threads and it was difficult to create several pools with many PGs without having OSDs core dumping because resources are not available.
>>
>> - when OSD get 100% full they core dump most of the time. In my case all OSDs become full at the same time and when this happended there is no way to get the cluster up again without manually deleting objects in the OSD directories and make some space.
>>
>> - I get a syntax error in the CEPH CENTOS(RHEL6) startup script:
>>
>> awk: { d=$2/1073741824 ; r = sprintf(\"%.2f\", d); print r }
>> awk:                                 ^ backslash not last character on line
>>
>> - I have run several times into a situation where the only way out was to delete the whole cluster and set it up from scratch
>>
>> - I got this reproducable stack trace with a EC pool and a front end tier:
>> osd/ReplicatedPG.cc: 5554: FAILED assert(cop->data.length() + cop->temp_cursor.data_offset == cop->cursor.data_offset)
>>
>>    ceph version 0.77-900-gce9bfb8 (ce9bfb879c32690d030db6b2a349b7b6f6e6a468)
>>    1: (ReplicatedPG::_write_copy_chunk(boost::shared_ptr<ReplicatedPG::CopyOp>, PGBackend::PGTransaction*)+0x7dd) [0x8a376d]
>>    2: (ReplicatedPG::_build_finish_copy_transaction(boost::shared_ptr<ReplicatedPG::CopyOp>, PGBackend::PGTransaction*)+0x114) [0x8a3954]
>>    3: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x507) [0x8f1097]
>>    4: (C_Copyfrom::finish(int)+0xb7) [0x93fa67]
>>    5: (Context::complete(int)+0x9) [0x65d4b9]
>>    6: (Finisher::finisher_thread_entry()+0x1d8) [0xa9a528]
>>    7: /lib64/libpthread.so.0() [0x3386a079d1]
>>    8: (clone()+0x6d) [0x33866e8b6d]
>>    NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
> Haven't seen that one yet!  Can you create an urgent bug with the full
> stack trace and the pool settings both for the cache and the tier pool?
>    Thanks!
>
>>
>> Moreover I did some trivial testing of the meta data part of CephFS and ceph-fuse:
>>
>> - I created a directory hierarchy with like 10/1000/100 = 1 Mio directories. After creation the MDS uses 5.5 GB of memory, ceph-fuse 1.8 GB. It takes 33 minutes to do "find /ceph" on this hierarchy. If I restart the MDS and do the same it takes 18 minutes. After this operation the MDS uses ~10 GB of memory (10k per directory for one entry).
>>
>> If I do "ls -laRt /ceph" I get "no such file or directory" after some time. When this happened one can pick one of the directory and do a single "ls -la <dir>". The first time one get's again "no such file or directory", the second time it eventually works and shows the contents.
>>
>> Cheers Andreas.
>>
>>
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: ceph-0.77-900.gce9bfb8 Testing Rados EC/Tiering & CephFS ...
  2014-03-20 13:55     ` Mark Nelson
@ 2014-03-20 16:42       ` Andreas Joachim Peters
  2014-03-20 17:11         ` Mark Nelson
  0 siblings, 1 reply; 8+ messages in thread
From: Andreas Joachim Peters @ 2014-03-20 16:42 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel

Yes, 
I checked CPU usage  ... the machine burns 70-100% during writing. (4+2) is faster than  3 replica probably because there is only half the bandwidth to fan out behind the primary replica although more computation is involved.

Several instances of rados bench give the same sum of bandwidth or IOPS like a single instance.

The 2.2k IOPS I get with a single replica and 4k blocks and this uses 70% CPU mainly used by the ceph-osd processes ... rados bench uses only 2% of all CPU. With Cauchy (4,2) I get 700 IOPS for the 4k blocks.

Just for fun I also looked at the read now:

Reading cached single 4k replica blocks with 16 parallel IOs I get 12k IOPS. 

If I evict the blocks beforehand I get 900 IOPS reading via the tiering setup.

I tried as you said with 1 replica in cache and 1 in backend and I get very mixed results. When the cache is already full then the first seconds it runs well, then it stalls for few seconds, then continues ... 

For read it matters to have several rados bench running ... I can push it to 16k IOPS to read cached 4k blocks.

Cheers Andreas.




________________________________________
From: Mark Nelson [mark.nelson@inktank.com]
Sent: 20 March 2014 14:55
To: Andreas Joachim Peters
Cc: ceph-devel@vger.kernel.org
Subject: Re: ceph-0.77-900.gce9bfb8 Testing Rados EC/Tiering & CephFS ...

On 03/20/2014 08:43 AM, Andreas Joachim Peters wrote:
> Hi Mark,
>
> I tested write performance with a single rados bench (32 threads), everything on localhost, there is minimal latency on networking and IO in this setup. The test is running CPU bound with 32 paralell io's.  I used 4MB blocks, using bigger or smaller block IO volume goes down. I don't need several instances of rados bench to saturate the host. I reach already with 16 IOs comparable speed. So, that is anyway not a realistic scenario but a good baseline measurement and concurrency test.

Have you checked CPU usage during these tests?  EC is pretty intense,
and running rados bench on the same host doesn't help.  Having multiple
copies of rados bench going potentially may help for a couple of
reasons, but moving it off to a separate host may be important too
(assuming you've got the network throughput to make it work).

>
> Your write results look very compatible to mine. Since I run client and server together and all in-memory all this tests are probably limited by the memory bandwidth and the CPU speed. The machine is a 2x6 core 2GHz Xeon with 1.3GHz ECC RAM.

The machine in my test has 2x E5-2630L CPUs (6 core at 2GHz each).

>
> I will test with single replica front- and back-end case and post the result. I was more surprised by the IOPS I can reach in this setup. Not more then ~2.5k IOPS when writing even for tiny blocks. On your setup you seem not to get much more in the 4M case.

As time goes on we'll be able to optimize small IO performance with EC,
but there's a lot more processing involved so I suspect it's always
going to be slower (especially for reads!) than simple replication when
latency is critical.

>
> Cheers Andreas.
>
> ________________________________________
> From: Mark Nelson [mark.nelson@inktank.com]
> Sent: 20 March 2014 14:09
> To: Andreas Joachim Peters
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: ceph-0.77-900.gce9bfb8 Testing Rados EC/Tiering & CephFS ...
>
> On 03/20/2014 05:49 AM, Andreas Joachim Peters wrote:
>> Hi,
>>
>> I did some Firefly ceph-0.77-900.gce9bfb8 testing of EC/Tiering deploying 64 OSD with in-memory filesystems (RapidDisk with ext4) on a single 256 GB box. The raw write performance of this box is ~3 GB/s for all and ~450 MB/s per OSD. It provides 250k IOPS per OSD.
>>
>> I compared several algorithms and configurations ...
>>
>> Here are the results (there is no significant difference between 64 or 10 OSDS for the performance, tried both but not for 24+8 !) with 4M objects, 32 client threads ....
>>
>> 1 rep: 1.1 GB/s
>> 2 rep: 886 MB/s
>> 3 rep: 750 MB/s
>> cauchy 4+2: 880 MB/s
>> liber8tion: 4+2: 875 MB/s
>> cauchy 6+3: 780 MB/s
>> cauchy 16+8: 520 MB/s
>> cauchy 24+8: 450 MB/s
>
> How many copies of rados bench are you using and how much concurrency?
> Also, were the rados bench processes on an external host?
>
> Here's what I'm seeing internally for 4MB writes on a box with 30
> spinning disks, 6 Intel 520 SSDs for journals, and bonded 10GbE.  This
> was using 4 copies of rados bench and 32 concurrent IOs per process.
> Note that 4MB writes is probably the best case scenario for EC compared
> to replication as far as performance goes right now.
>
> 4MB READ
>
> ec-4k1m:        1546.21 MB/s
> ec-6k2m:        1308.52 MB/s
> ec-10k3m:       1060.73 MB/s
> 2X rep:         2110.92 MB/s
> 3X rep:         2131.03 MB/s
>
> 4MB WRITE
>
> ec-4k1m:        1063.72 MB/s
> ec-6k2m:        750.381 MB/s
> ec-10k3m:       493.615 MB/s
> 2X rep:         1171.35 MB/s
> 3X rep:         755.148 MB/s
>
>
>
>
>>
>> Then I added a single replica cache pool in front of cauchy 4+2.
>>
>> The write performance is now 1.1 GB/s as expected when the cache is not full. If I shrink the cache pool in front forcing continuous eviction during the benchmark it degrades to stable 140 MB/s.
>
> Mind trying that same test with just a simple 1x rep pool to another 1x
> rep pool and see what happens?
>
>>
>> The single threaded client reduces from 260 MB/s to 165 MB/s.
>>
>> What is strange to me is that after a "rados bench" there are objects left in the cache and the back-end tier. They only disappear if I set the "forward" and force the eviction. Is that by design the desired behaviour to not apply the deletion?
>>
>> Some observations:
>> - I think it is important to document the alignment requirements for appends (e.g. if you do rados put it needs aligned appends and the 4M blocks are not aligned for every combination of (k,m) ).
>>
>> - another observation is that seems difficult to run 64 OSDs on a box. I have no obvious memory limitation but it requires ~30k threads and it was difficult to create several pools with many PGs without having OSDs core dumping because resources are not available.
>>
>> - when OSD get 100% full they core dump most of the time. In my case all OSDs become full at the same time and when this happended there is no way to get the cluster up again without manually deleting objects in the OSD directories and make some space.
>>
>> - I get a syntax error in the CEPH CENTOS(RHEL6) startup script:
>>
>> awk: { d=$2/1073741824 ; r = sprintf(\"%.2f\", d); print r }
>> awk:                                 ^ backslash not last character on line
>>
>> - I have run several times into a situation where the only way out was to delete the whole cluster and set it up from scratch
>>
>> - I got this reproducable stack trace with a EC pool and a front end tier:
>> osd/ReplicatedPG.cc: 5554: FAILED assert(cop->data.length() + cop->temp_cursor.data_offset == cop->cursor.data_offset)
>>
>>    ceph version 0.77-900-gce9bfb8 (ce9bfb879c32690d030db6b2a349b7b6f6e6a468)
>>    1: (ReplicatedPG::_write_copy_chunk(boost::shared_ptr<ReplicatedPG::CopyOp>, PGBackend::PGTransaction*)+0x7dd) [0x8a376d]
>>    2: (ReplicatedPG::_build_finish_copy_transaction(boost::shared_ptr<ReplicatedPG::CopyOp>, PGBackend::PGTransaction*)+0x114) [0x8a3954]
>>    3: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x507) [0x8f1097]
>>    4: (C_Copyfrom::finish(int)+0xb7) [0x93fa67]
>>    5: (Context::complete(int)+0x9) [0x65d4b9]
>>    6: (Finisher::finisher_thread_entry()+0x1d8) [0xa9a528]
>>    7: /lib64/libpthread.so.0() [0x3386a079d1]
>>    8: (clone()+0x6d) [0x33866e8b6d]
>>    NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
> Haven't seen that one yet!  Can you create an urgent bug with the full
> stack trace and the pool settings both for the cache and the tier pool?
>    Thanks!
>
>>
>> Moreover I did some trivial testing of the meta data part of CephFS and ceph-fuse:
>>
>> - I created a directory hierarchy with like 10/1000/100 = 1 Mio directories. After creation the MDS uses 5.5 GB of memory, ceph-fuse 1.8 GB. It takes 33 minutes to do "find /ceph" on this hierarchy. If I restart the MDS and do the same it takes 18 minutes. After this operation the MDS uses ~10 GB of memory (10k per directory for one entry).
>>
>> If I do "ls -laRt /ceph" I get "no such file or directory" after some time. When this happened one can pick one of the directory and do a single "ls -la <dir>". The first time one get's again "no such file or directory", the second time it eventually works and shows the contents.
>>
>> Cheers Andreas.
>>
>>
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ceph-0.77-900.gce9bfb8 Testing Rados EC/Tiering & CephFS ...
  2014-03-20 16:42       ` Andreas Joachim Peters
@ 2014-03-20 17:11         ` Mark Nelson
  0 siblings, 0 replies; 8+ messages in thread
From: Mark Nelson @ 2014-03-20 17:11 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

On 03/20/2014 11:42 AM, Andreas Joachim Peters wrote:
> Yes,
> I checked CPU usage  ... the machine burns 70-100% during writing. (4+2) is faster than  3 replica probably because there is only half the bandwidth to fan out behind the primary replica although more computation is involved.
>
> Several instances of rados bench give the same sum of bandwidth or IOPS like a single instance.

Ok, just wanted to be sure.  Especially as the client side throughput 
goes up having more instances becomes important.

>
> The 2.2k IOPS I get with a single replica and 4k blocks and this uses 70% CPU mainly used by the ceph-osd processes ... rados bench uses only 2% of all CPU. With Cauchy (4,2) I get 700 IOPS for the 4k blocks.
>
> Just for fun I also looked at the read now:
>
> Reading cached single 4k replica blocks with 16 parallel IOs I get 12k IOPS.
>
> If I evict the blocks beforehand I get 900 IOPS reading via the tiering setup.
>
> I tried as you said with 1 replica in cache and 1 in backend and I get very mixed results. When the cache is already full then the first seconds it runs well, then it stalls for few seconds, then continues ...
>
> For read it matters to have several rados bench running ... I can push it to 16k IOPS to read cached 4k blocks.

Oh, for reads be careful with rados bench to read from separate pools 
otherwise you can end up doing many of the reads from pagecache on the 
subsequent processes.

>
> Cheers Andreas.
>
>
>
>
> ________________________________________
> From: Mark Nelson [mark.nelson@inktank.com]
> Sent: 20 March 2014 14:55
> To: Andreas Joachim Peters
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: ceph-0.77-900.gce9bfb8 Testing Rados EC/Tiering & CephFS ...
>
> On 03/20/2014 08:43 AM, Andreas Joachim Peters wrote:
>> Hi Mark,
>>
>> I tested write performance with a single rados bench (32 threads), everything on localhost, there is minimal latency on networking and IO in this setup. The test is running CPU bound with 32 paralell io's.  I used 4MB blocks, using bigger or smaller block IO volume goes down. I don't need several instances of rados bench to saturate the host. I reach already with 16 IOs comparable speed. So, that is anyway not a realistic scenario but a good baseline measurement and concurrency test.
>
> Have you checked CPU usage during these tests?  EC is pretty intense,
> and running rados bench on the same host doesn't help.  Having multiple
> copies of rados bench going potentially may help for a couple of
> reasons, but moving it off to a separate host may be important too
> (assuming you've got the network throughput to make it work).
>
>>
>> Your write results look very compatible to mine. Since I run client and server together and all in-memory all this tests are probably limited by the memory bandwidth and the CPU speed. The machine is a 2x6 core 2GHz Xeon with 1.3GHz ECC RAM.
>
> The machine in my test has 2x E5-2630L CPUs (6 core at 2GHz each).
>
>>
>> I will test with single replica front- and back-end case and post the result. I was more surprised by the IOPS I can reach in this setup. Not more then ~2.5k IOPS when writing even for tiny blocks. On your setup you seem not to get much more in the 4M case.
>
> As time goes on we'll be able to optimize small IO performance with EC,
> but there's a lot more processing involved so I suspect it's always
> going to be slower (especially for reads!) than simple replication when
> latency is critical.
>
>>
>> Cheers Andreas.
>>
>> ________________________________________
>> From: Mark Nelson [mark.nelson@inktank.com]
>> Sent: 20 March 2014 14:09
>> To: Andreas Joachim Peters
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: ceph-0.77-900.gce9bfb8 Testing Rados EC/Tiering & CephFS ...
>>
>> On 03/20/2014 05:49 AM, Andreas Joachim Peters wrote:
>>> Hi,
>>>
>>> I did some Firefly ceph-0.77-900.gce9bfb8 testing of EC/Tiering deploying 64 OSD with in-memory filesystems (RapidDisk with ext4) on a single 256 GB box. The raw write performance of this box is ~3 GB/s for all and ~450 MB/s per OSD. It provides 250k IOPS per OSD.
>>>
>>> I compared several algorithms and configurations ...
>>>
>>> Here are the results (there is no significant difference between 64 or 10 OSDS for the performance, tried both but not for 24+8 !) with 4M objects, 32 client threads ....
>>>
>>> 1 rep: 1.1 GB/s
>>> 2 rep: 886 MB/s
>>> 3 rep: 750 MB/s
>>> cauchy 4+2: 880 MB/s
>>> liber8tion: 4+2: 875 MB/s
>>> cauchy 6+3: 780 MB/s
>>> cauchy 16+8: 520 MB/s
>>> cauchy 24+8: 450 MB/s
>>
>> How many copies of rados bench are you using and how much concurrency?
>> Also, were the rados bench processes on an external host?
>>
>> Here's what I'm seeing internally for 4MB writes on a box with 30
>> spinning disks, 6 Intel 520 SSDs for journals, and bonded 10GbE.  This
>> was using 4 copies of rados bench and 32 concurrent IOs per process.
>> Note that 4MB writes is probably the best case scenario for EC compared
>> to replication as far as performance goes right now.
>>
>> 4MB READ
>>
>> ec-4k1m:        1546.21 MB/s
>> ec-6k2m:        1308.52 MB/s
>> ec-10k3m:       1060.73 MB/s
>> 2X rep:         2110.92 MB/s
>> 3X rep:         2131.03 MB/s
>>
>> 4MB WRITE
>>
>> ec-4k1m:        1063.72 MB/s
>> ec-6k2m:        750.381 MB/s
>> ec-10k3m:       493.615 MB/s
>> 2X rep:         1171.35 MB/s
>> 3X rep:         755.148 MB/s
>>
>>
>>
>>
>>>
>>> Then I added a single replica cache pool in front of cauchy 4+2.
>>>
>>> The write performance is now 1.1 GB/s as expected when the cache is not full. If I shrink the cache pool in front forcing continuous eviction during the benchmark it degrades to stable 140 MB/s.
>>
>> Mind trying that same test with just a simple 1x rep pool to another 1x
>> rep pool and see what happens?
>>
>>>
>>> The single threaded client reduces from 260 MB/s to 165 MB/s.
>>>
>>> What is strange to me is that after a "rados bench" there are objects left in the cache and the back-end tier. They only disappear if I set the "forward" and force the eviction. Is that by design the desired behaviour to not apply the deletion?
>>>
>>> Some observations:
>>> - I think it is important to document the alignment requirements for appends (e.g. if you do rados put it needs aligned appends and the 4M blocks are not aligned for every combination of (k,m) ).
>>>
>>> - another observation is that seems difficult to run 64 OSDs on a box. I have no obvious memory limitation but it requires ~30k threads and it was difficult to create several pools with many PGs without having OSDs core dumping because resources are not available.
>>>
>>> - when OSD get 100% full they core dump most of the time. In my case all OSDs become full at the same time and when this happended there is no way to get the cluster up again without manually deleting objects in the OSD directories and make some space.
>>>
>>> - I get a syntax error in the CEPH CENTOS(RHEL6) startup script:
>>>
>>> awk: { d=$2/1073741824 ; r = sprintf(\"%.2f\", d); print r }
>>> awk:                                 ^ backslash not last character on line
>>>
>>> - I have run several times into a situation where the only way out was to delete the whole cluster and set it up from scratch
>>>
>>> - I got this reproducable stack trace with a EC pool and a front end tier:
>>> osd/ReplicatedPG.cc: 5554: FAILED assert(cop->data.length() + cop->temp_cursor.data_offset == cop->cursor.data_offset)
>>>
>>>     ceph version 0.77-900-gce9bfb8 (ce9bfb879c32690d030db6b2a349b7b6f6e6a468)
>>>     1: (ReplicatedPG::_write_copy_chunk(boost::shared_ptr<ReplicatedPG::CopyOp>, PGBackend::PGTransaction*)+0x7dd) [0x8a376d]
>>>     2: (ReplicatedPG::_build_finish_copy_transaction(boost::shared_ptr<ReplicatedPG::CopyOp>, PGBackend::PGTransaction*)+0x114) [0x8a3954]
>>>     3: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x507) [0x8f1097]
>>>     4: (C_Copyfrom::finish(int)+0xb7) [0x93fa67]
>>>     5: (Context::complete(int)+0x9) [0x65d4b9]
>>>     6: (Finisher::finisher_thread_entry()+0x1d8) [0xa9a528]
>>>     7: /lib64/libpthread.so.0() [0x3386a079d1]
>>>     8: (clone()+0x6d) [0x33866e8b6d]
>>>     NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>
>> Haven't seen that one yet!  Can you create an urgent bug with the full
>> stack trace and the pool settings both for the cache and the tier pool?
>>     Thanks!
>>
>>>
>>> Moreover I did some trivial testing of the meta data part of CephFS and ceph-fuse:
>>>
>>> - I created a directory hierarchy with like 10/1000/100 = 1 Mio directories. After creation the MDS uses 5.5 GB of memory, ceph-fuse 1.8 GB. It takes 33 minutes to do "find /ceph" on this hierarchy. If I restart the MDS and do the same it takes 18 minutes. After this operation the MDS uses ~10 GB of memory (10k per directory for one entry).
>>>
>>> If I do "ls -laRt /ceph" I get "no such file or directory" after some time. When this happened one can pick one of the directory and do a single "ls -la <dir>". The first time one get's again "no such file or directory", the second time it eventually works and shows the contents.
>>>
>>> Cheers Andreas.
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ceph-0.77-900.gce9bfb8 Testing Rados EC/Tiering & CephFS ...
  2014-03-20 10:49 ceph-0.77-900.gce9bfb8 Testing Rados EC/Tiering & CephFS Andreas Joachim Peters
  2014-03-20 13:09 ` Mark Nelson
@ 2014-03-25 18:04 ` Gregory Farnum
  2014-03-26  2:41   ` Yan, Zheng
  1 sibling, 1 reply; 8+ messages in thread
From: Gregory Farnum @ 2014-03-25 18:04 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

On Thu, Mar 20, 2014 at 3:49 AM, Andreas Joachim Peters
<Andreas.Joachim.Peters@cern.ch> wrote:
> Hi,
>
> I did some Firefly ceph-0.77-900.gce9bfb8 testing of EC/Tiering deploying 64 OSD with in-memory filesystems (RapidDisk with ext4) on a single 256 GB box. The raw write performance of this box is ~3 GB/s for all and ~450 MB/s per OSD. It provides 250k IOPS per OSD.
>
> I compared several algorithms and configurations ...
>
> Here are the results (there is no significant difference between 64 or 10 OSDS for the performance, tried both but not for 24+8 !) with 4M objects, 32 client threads ....
>
> 1 rep: 1.1 GB/s
> 2 rep: 886 MB/s
> 3 rep: 750 MB/s
> cauchy 4+2: 880 MB/s
> liber8tion: 4+2: 875 MB/s
> cauchy 6+3: 780 MB/s
> cauchy 16+8: 520 MB/s
> cauchy 24+8: 450 MB/s
>
> Then I added a single replica cache pool in front of cauchy 4+2.
>
> The write performance is now 1.1 GB/s as expected when the cache is not full. If I shrink the cache pool in front forcing continuous eviction during the benchmark it degrades to stable 140 MB/s.
>
> The single threaded client reduces from 260 MB/s to 165 MB/s.
>
> What is strange to me is that after a "rados bench" there are objects left in the cache and the back-end tier. They only disappear if I set the "forward" and force the eviction. Is that by design the desired behaviour to not apply the deletion?

That's not too surprising -- you probably put enough data into the
cluster that some of the bench objects got evicted into the cold
storage pool, and then they were deleted by rados bench. The cache
pool needs to keep the object around with a "deleted" and "dirty" flag
to make sure it eventually gets cleaned up from the backing cold pool
-- as happened when you set to forward and forced an eviction.

>
> Some observations:
> - I think it is important to document the alignment requirements for appends (e.g. if you do rados put it needs aligned appends and the 4M blocks are not aligned for every combination of (k,m) ).
>
> - another observation is that seems difficult to run 64 OSDs on a box. I have no obvious memory limitation but it requires ~30k threads and it was difficult to create several pools with many PGs without having OSDs core dumping because resources are not available.
>
> - when OSD get 100% full they core dump most of the time. In my case all OSDs become full at the same time and when this happended there is no way to get the cluster up again without manually deleting objects in the OSD directories and make some space.
>
> - I get a syntax error in the CEPH CENTOS(RHEL6) startup script:
>
> awk: { d=$2/1073741824 ; r = sprintf(\"%.2f\", d); print r }
> awk:                                 ^ backslash not last character on line
>
> - I have run several times into a situation where the only way out was to delete the whole cluster and set it up from scratch
>
> - I got this reproducable stack trace with a EC pool and a front end tier:
> osd/ReplicatedPG.cc: 5554: FAILED assert(cop->data.length() + cop->temp_cursor.data_offset == cop->cursor.data_offset)
>
>  ceph version 0.77-900-gce9bfb8 (ce9bfb879c32690d030db6b2a349b7b6f6e6a468)
>  1: (ReplicatedPG::_write_copy_chunk(boost::shared_ptr<ReplicatedPG::CopyOp>, PGBackend::PGTransaction*)+0x7dd) [0x8a376d]
>  2: (ReplicatedPG::_build_finish_copy_transaction(boost::shared_ptr<ReplicatedPG::CopyOp>, PGBackend::PGTransaction*)+0x114) [0x8a3954]
>  3: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x507) [0x8f1097]
>  4: (C_Copyfrom::finish(int)+0xb7) [0x93fa67]
>  5: (Context::complete(int)+0x9) [0x65d4b9]
>  6: (Finisher::finisher_thread_entry()+0x1d8) [0xa9a528]
>  7: /lib64/libpthread.so.0() [0x3386a079d1]
>  8: (clone()+0x6d) [0x33866e8b6d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Hmm, we've had a lot of bug fixes going in lately (and I know some
were around that copy infrastructure), so I bet that's fixed now.

>
> Moreover I did some trivial testing of the meta data part of CephFS and ceph-fuse:
>
> - I created a directory hierarchy with like 10/1000/100 = 1 Mio directories. After creation the MDS uses 5.5 GB of memory, ceph-fuse 1.8 GB. It takes 33 minutes to do "find /ceph" on this hierarchy. If I restart the MDS and do the same it takes 18 minutes. After this operation the MDS uses ~10 GB of memory (10k per directory for one entry).

Hmm. That's more than I would expect, but not impossibly so if the MDS
was having trouble keeping the relevant directories in-memory. We have
not done any optimization around that sort of scenario right now and
it's a pretty hard workload for a distributed storage system. :/

>
> If I do "ls -laRt /ceph" I get "no such file or directory" after some time. When this happened one can pick one of the directory and do a single "ls -la <dir>". The first time one get's again "no such file or directory", the second time it eventually works and shows the contents.

Can you expand on that a bit? What is "after some time"?

-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ceph-0.77-900.gce9bfb8 Testing Rados EC/Tiering & CephFS ...
  2014-03-25 18:04 ` Gregory Farnum
@ 2014-03-26  2:41   ` Yan, Zheng
  0 siblings, 0 replies; 8+ messages in thread
From: Yan, Zheng @ 2014-03-26  2:41 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Andreas Joachim Peters, ceph-devel

On Wed, Mar 26, 2014 at 2:04 AM, Gregory Farnum <greg@inktank.com> wrote:
> On Thu, Mar 20, 2014 at 3:49 AM, Andreas Joachim Peters
> <Andreas.Joachim.Peters@cern.ch> wrote:
>> Hi,
>>
>> I did some Firefly ceph-0.77-900.gce9bfb8 testing of EC/Tiering deploying 64 OSD with in-memory filesystems (RapidDisk with ext4) on a single 256 GB box. The raw write performance of this box is ~3 GB/s for all and ~450 MB/s per OSD. It provides 250k IOPS per OSD.
>>
>> I compared several algorithms and configurations ...
>>
>> Here are the results (there is no significant difference between 64 or 10 OSDS for the performance, tried both but not for 24+8 !) with 4M objects, 32 client threads ....
>>
>> 1 rep: 1.1 GB/s
>> 2 rep: 886 MB/s
>> 3 rep: 750 MB/s
>> cauchy 4+2: 880 MB/s
>> liber8tion: 4+2: 875 MB/s
>> cauchy 6+3: 780 MB/s
>> cauchy 16+8: 520 MB/s
>> cauchy 24+8: 450 MB/s
>>
>> Then I added a single replica cache pool in front of cauchy 4+2.
>>
>> The write performance is now 1.1 GB/s as expected when the cache is not full. If I shrink the cache pool in front forcing continuous eviction during the benchmark it degrades to stable 140 MB/s.
>>
>> The single threaded client reduces from 260 MB/s to 165 MB/s.
>>
>> What is strange to me is that after a "rados bench" there are objects left in the cache and the back-end tier. They only disappear if I set the "forward" and force the eviction. Is that by design the desired behaviour to not apply the deletion?
>
> That's not too surprising -- you probably put enough data into the
> cluster that some of the bench objects got evicted into the cold
> storage pool, and then they were deleted by rados bench. The cache
> pool needs to keep the object around with a "deleted" and "dirty" flag
> to make sure it eventually gets cleaned up from the backing cold pool
> -- as happened when you set to forward and forced an eviction.
>
>>
>> Some observations:
>> - I think it is important to document the alignment requirements for appends (e.g. if you do rados put it needs aligned appends and the 4M blocks are not aligned for every combination of (k,m) ).
>>
>> - another observation is that seems difficult to run 64 OSDs on a box. I have no obvious memory limitation but it requires ~30k threads and it was difficult to create several pools with many PGs without having OSDs core dumping because resources are not available.
>>
>> - when OSD get 100% full they core dump most of the time. In my case all OSDs become full at the same time and when this happended there is no way to get the cluster up again without manually deleting objects in the OSD directories and make some space.
>>
>> - I get a syntax error in the CEPH CENTOS(RHEL6) startup script:
>>
>> awk: { d=$2/1073741824 ; r = sprintf(\"%.2f\", d); print r }
>> awk:                                 ^ backslash not last character on line
>>
>> - I have run several times into a situation where the only way out was to delete the whole cluster and set it up from scratch
>>
>> - I got this reproducable stack trace with a EC pool and a front end tier:
>> osd/ReplicatedPG.cc: 5554: FAILED assert(cop->data.length() + cop->temp_cursor.data_offset == cop->cursor.data_offset)
>>
>>  ceph version 0.77-900-gce9bfb8 (ce9bfb879c32690d030db6b2a349b7b6f6e6a468)
>>  1: (ReplicatedPG::_write_copy_chunk(boost::shared_ptr<ReplicatedPG::CopyOp>, PGBackend::PGTransaction*)+0x7dd) [0x8a376d]
>>  2: (ReplicatedPG::_build_finish_copy_transaction(boost::shared_ptr<ReplicatedPG::CopyOp>, PGBackend::PGTransaction*)+0x114) [0x8a3954]
>>  3: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x507) [0x8f1097]
>>  4: (C_Copyfrom::finish(int)+0xb7) [0x93fa67]
>>  5: (Context::complete(int)+0x9) [0x65d4b9]
>>  6: (Finisher::finisher_thread_entry()+0x1d8) [0xa9a528]
>>  7: /lib64/libpthread.so.0() [0x3386a079d1]
>>  8: (clone()+0x6d) [0x33866e8b6d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
> Hmm, we've had a lot of bug fixes going in lately (and I know some
> were around that copy infrastructure), so I bet that's fixed now.
>
>>
>> Moreover I did some trivial testing of the meta data part of CephFS and ceph-fuse:
>>
>> - I created a directory hierarchy with like 10/1000/100 = 1 Mio directories. After creation the MDS uses 5.5 GB of memory, ceph-fuse 1.8 GB. It takes 33 minutes to do "find /ceph" on this hierarchy. If I restart the MDS and do the same it takes 18 minutes. After this operation the MDS uses ~10 GB of memory (10k per directory for one entry).
>
> Hmm. That's more than I would expect, but not impossibly so if the MDS
> was having trouble keeping the relevant directories in-memory. We have
> not done any optimization around that sort of scenario right now and
> it's a pretty hard workload for a distributed storage system. :/
>
>>
>> If I do "ls -laRt /ceph" I get "no such file or directory" after some time. When this happened one can pick one of the directory and do a single "ls -la <dir>". The first time one get's again "no such file or directory", the second time it eventually works and shows the contents.

It's symptom of the dir complete bug (exits in kernel < 3.12)

Yan, Zheng

>
> Can you expand on that a bit? What is "after some time"?
>
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-03-26  2:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-20 10:49 ceph-0.77-900.gce9bfb8 Testing Rados EC/Tiering & CephFS Andreas Joachim Peters
2014-03-20 13:09 ` Mark Nelson
2014-03-20 13:43   ` Andreas Joachim Peters
2014-03-20 13:55     ` Mark Nelson
2014-03-20 16:42       ` Andreas Joachim Peters
2014-03-20 17:11         ` Mark Nelson
2014-03-25 18:04 ` Gregory Farnum
2014-03-26  2:41   ` Yan, Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.