All of lore.kernel.org
 help / color / mirror / Atom feed
* OSD Hardware questions
@ 2012-06-27 13:04 Stefan Priebe - Profihost AG
  2012-06-27 13:55 ` Mark Nelson
  0 siblings, 1 reply; 26+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-27 13:04 UTC (permalink / raw)
  To: ceph-devel

Hello list,

i'm still thinking about optimal OSD hardware and while reading through 
the mailinglist and wiki had some questions.

I want to use SSD so my idea was to use a fast single socket cpu with 
8-10 SSD disks per OSD.

I got the following recommandations through the mailinglist:
"Dual socket servers will be overkill given the setup you're describing. 
Our WAG rule of thumb is 1GHz of modern CPU per OSD daemon. You might 
consider it if you decided you wanted to do an OSD per disk instead 
(that's a more common configuration, but it requires more CPU and RAM 
per disk and we don't know yet which is the better choice)."

but in my tests i see a CPU usage of 160% + 15% kworker per OSD Daemon 
on a 3,6ghz intel xeon CPU. That's far away of 1GHz per OSD. That's 
around 6,3Ghz per OSD. Is anything wrong here?

When i want to use 8-10 SSD Disks i need around 20 cores with 3,6Ghz. 
But there is no single socket with 20 cores with 3,6Ghz.

Or should i consider using a Raid 5 or 6?

Anything wrong?

Stefan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-27 13:04 OSD Hardware questions Stefan Priebe - Profihost AG
@ 2012-06-27 13:55 ` Mark Nelson
  2012-06-27 14:55   ` Jim Schutt
  2012-06-27 15:13   ` Stefan Priebe
  0 siblings, 2 replies; 26+ messages in thread
From: Mark Nelson @ 2012-06-27 13:55 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel

On 6/27/12 8:04 AM, Stefan Priebe - Profihost AG wrote:
> Hello list,
>
> i'm still thinking about optimal OSD hardware and while reading through
> the mailinglist and wiki had some questions.
>
> I want to use SSD so my idea was to use a fast single socket cpu with
> 8-10 SSD disks per OSD.
>
> I got the following recommandations through the mailinglist:
> "Dual socket servers will be overkill given the setup you're describing.
> Our WAG rule of thumb is 1GHz of modern CPU per OSD daemon. You might
> consider it if you decided you wanted to do an OSD per disk instead
> (that's a more common configuration, but it requires more CPU and RAM
> per disk and we don't know yet which is the better choice)."
>
> but in my tests i see a CPU usage of 160% + 15% kworker per OSD Daemon
> on a 3,6ghz intel xeon CPU. That's far away of 1GHz per OSD. That's
> around 6,3Ghz per OSD. Is anything wrong here?
>
> When i want to use 8-10 SSD Disks i need around 20 cores with 3,6Ghz.
> But there is no single socket with 20 cores with 3,6Ghz.
>
> Or should i consider using a Raid 5 or 6?
>
> Anything wrong?
>
> Stefan

Hi Stefan,

I'm not entirely clear how you are coming to the conclusion regarding 
the CPU requirements.  If we go by the "1Ghz per OSD" suggestion, does 
that mean you plan to have 3.6GHz*20/1GHz = 72 OSDs per server?

Having said that, not all CPU cores are created equal.  Intel CPUs tend 
to be faster per clock than AMD CPUs, though AMD systems can potentially 
have more cores per node (16 per socket).  If you are really planning on 
having 72 OSDs per node, other things are going to come into play 
including CPU interconnnect, raid controller performance, PCI bus, 
network throughput, etc.  I'd strongly recommend sticking with smaller 
nodes unless you have the time/budget to test such large systems.  I 
haven't gotten a chance to really dig into CPU utilization yet, but I'd 
say if you are going to go for big nodes you might try putting a single 
Xeon E5 into a dual socket motherboard and see how it works.  If it's 
not fast enough stick the second CPU (and associated memory) in.

For what it's worth, I've got a pair of Dell R515 setup with a single 
2.8GHz 6-core 4184 Opteron, 16GB of RAM, and 10 SSDs that are capable of 
about 200MB/s each.  Currently I'm topping out at about 600MB/s with 
rados bench using half of the drives for data and half for journals (at 
2x replication).  Putting journals on the same drive and doing 10 OSDs 
on each node is slower.  Still working on figuring out why.

In terms of raid, the big consideration between raid 5 and raid 6 is the 
potential for drive failure during a rebuild.  Raid6 gives you extra 
protection at the cost of reduced capacity and performance.  If you have 
a small array with small fast drives raid 5 might be fine.  If you have 
a large array with many high capacity drives, 6 may be better.

Mark

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-27 13:55 ` Mark Nelson
@ 2012-06-27 14:55   ` Jim Schutt
  2012-06-27 15:19     ` Stefan Priebe
  2012-06-27 15:53     ` Mark Nelson
  2012-06-27 15:13   ` Stefan Priebe
  1 sibling, 2 replies; 26+ messages in thread
From: Jim Schutt @ 2012-06-27 14:55 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Stefan Priebe - Profihost AG, ceph-devel

Hi Mark,

On 06/27/2012 07:55 AM, Mark Nelson wrote:
>
> For what it's worth, I've got a pair of Dell R515 setup with a single 2.8GHz 6-core 4184 Opteron, 16GB of RAM, and 10 SSDs that are capable of about 200MB/s each.  Currently I'm topping out at about 600MB/s with rados bench using half of the drives for data and half for journals (at 2x replication).  Putting journals on the same drive and doing 10 OSDs on each node is slower.  Still working on figuring out why.

Just for fun, try the following tunings to see if they make
a difference for you.

This is my current best tuning for my hardware, which uses
24 SAS drives/server, and 1 OSD/drive with a journal partition
on the outer tracks and btrfs for the data store.

	journal dio = true
	osd op threads = 24
	osd disk threads = 24
	filestore op threads = 6
	filestore queue max ops = 24

	osd client message size cap = 14000000
	ms dispatch throttle bytes =  17500000

I'd be very curious to hear how these work for you.
My current testing load is streaming writes from
166 linux clients, and the above tunings let me
sustain ~2 GB/s on each server (2x replication,
so 500 MB/s per server aggregate client bandwidth).

I have dual-port 10 GbE NICs, and use one port
for the cluster and one for the clients.  I use
jumbo frames because it freed up ~10% CPU cycles over
the default config of 1500-byte frames + GRO/GSO/etc
on the load I'm currently testing with.

FWIW these servers are dual-socket Intel 5675 Xeons,
so total 12 cores at 3.0 GHz.  On the above load I
usually see 15-30% idle.

FWIW, "perf top" has this to say about where time is being spent
under the above load under normal conditions.

    PerfTop:   19134 irqs/sec  kernel:79.2%  exact:  0.0% [1000Hz cycles],  (all, 24 CPUs)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

              samples  pcnt function                                       DSO
              _______ _____ ______________________________________________ ________________________________________________________________________________________

             37656.00 15.3% ceph_crc32c_le                                 /usr/bin/ceph-osd
             23221.00  9.5% copy_user_generic_string                       [kernel.kallsyms]
             16857.00  6.9% btrfs_end_transaction_dmeta                    /lib/modules/3.5.0-rc4-00011-g15d0694/kernel/fs/btrfs/btrfs.ko
             16787.00  6.8% __crc32c_le                                    [kernel.kallsyms]


But, sometimes I see this:

    PerfTop:    4930 irqs/sec  kernel:97.8%  exact:  0.0% [1000Hz cycles],  (all, 24 CPUs)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

              samples  pcnt function                                       DSO
              _______ _____ ______________________________________________ ________________________________________________________________________________________

            147565.00 45.8% _raw_spin_lock_irqsave                         [kernel.kallsyms]
             24427.00  7.6% isolate_freepages_block                        [kernel.kallsyms]
             23759.00  7.4% ceph_crc32c_le                                 /usr/bin/ceph-osd
             16521.00  5.1% copy_user_generic_string                       [kernel.kallsyms]
             10549.00  3.3% __crc32c_le                                    [kernel.kallsyms]
              8901.00  2.8% btrfs_end_transaction_dmeta                    /lib/modules/3.5.0-rc4-00011-g15d0694/kernel/fs/btrfs/btrfs.ko

When this happens, OSDs cannot process heartbeats in a timely fashion,
get wrongly marked down, thrashing ensues, clients stall.  I'm still
trying to  learn how to get perf to tell me more....

-- Jim


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-27 13:55 ` Mark Nelson
  2012-06-27 14:55   ` Jim Schutt
@ 2012-06-27 15:13   ` Stefan Priebe
       [not found]     ` <CAPYLRzj916kW=KLy3dMTVPJRoNtPMP_Ejz+YAxRUJ5jZc+HeMg@mail.gmail.com>
  1 sibling, 1 reply; 26+ messages in thread
From: Stefan Priebe @ 2012-06-27 15:13 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel

Hi Mark,

 > Am 27.06.2012 15:55, schrieb Mark Nelson:
> On 6/27/12 8:04 AM, Stefan Priebe - Profihost AG wrote:
> Hi Stefan,
>
> I'm not entirely clear how you are coming to the conclusion regarding
> the CPU requirements.  If we go by the "1Ghz per OSD" suggestion, does
> that mean you plan to have 3.6GHz*20/1GHz = 72 OSDs per server?

oh i'm sorry it seems you got me wrong. I've a testsetup 4x OSDs with 
one disk each in ONE Server with a Intel Xeon 3,6Ghz. Each of the 4 
ceph-osd processes take up to 170% CPU usage right now. So with these 4 
osd disks and daemons i go up to 800% load. So EACH osd process takes up 
to 2 cores of 3,6 Ghz.

So i need right now 2 Core of 3,6Ghz each per OSD drive. So if i want to 
use 8-10 drives per osd server i need 20 cores of 3,6Ghz.

Anything wrong?

Greets
Stefan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-27 14:55   ` Jim Schutt
@ 2012-06-27 15:19     ` Stefan Priebe
  2012-06-27 17:23       ` Jim Schutt
  2012-06-27 15:53     ` Mark Nelson
  1 sibling, 1 reply; 26+ messages in thread
From: Stefan Priebe @ 2012-06-27 15:19 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Mark Nelson, ceph-devel

Am 27.06.2012 16:55, schrieb Jim Schutt:
> This is my current best tuning for my hardware, which uses
> 24 SAS drives/server, and 1 OSD/drive with a journal partition
> on the outer tracks and btrfs for the data store.

Which raid level do you use?

> I'd be very curious to hear how these work for you.
> My current testing load is streaming writes from
> 166 linux clients, and the above tunings let me
> sustain ~2 GB/s on each server (2x replication,
> so 500 MB/s per server aggregate client bandwidth).
10GBe max speed shoudl be around 1Gbit/s. Do i miss something?

> I have dual-port 10 GbE NICs, and use one port
> for the cluster and one for the clients.  I use
> jumbo frames because it freed up ~10% CPU cycles over
> the default config of 1500-byte frames + GRO/GSO/etc
> on the load I'm currently testing with.
Do you have ntuple and lro on or off? Which kernel version do you use 
and which driver version? Intel cards?

Stefan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
       [not found]     ` <CAPYLRzj916kW=KLy3dMTVPJRoNtPMP_Ejz+YAxRUJ5jZc+HeMg@mail.gmail.com>
@ 2012-06-27 15:28       ` Stefan Priebe
  2012-06-27 16:00         ` Mark Nelson
  0 siblings, 1 reply; 26+ messages in thread
From: Stefan Priebe @ 2012-06-27 15:28 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Mark Nelson, ceph-devel

 > Am 27.06.2012 17:21, schrieb Gregory Farnum:
>
>
> Well, as we said, 1GHz/OSD was a WAG (wild-ass guess), but 3.6GHz+/OSD
> is farther outside of that range than I would have expected. It might
> just be a consequence of using SSDs, since they can sustain so much more
> throughput.

Sure it was just so much away from 1Ghz that i wanted to ask.

> What is the cluster doing when you see those CPU usage numbers?
random write I/O from one KVM. 14k I/Ops with random 4k writes.

Stefan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-27 14:55   ` Jim Schutt
  2012-06-27 15:19     ` Stefan Priebe
@ 2012-06-27 15:53     ` Mark Nelson
  2012-06-27 17:59       ` Jim Schutt
  1 sibling, 1 reply; 26+ messages in thread
From: Mark Nelson @ 2012-06-27 15:53 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Stefan Priebe - Profihost AG, ceph-devel

On 06/27/2012 09:55 AM, Jim Schutt wrote:
> Hi Mark,
>
> On 06/27/2012 07:55 AM, Mark Nelson wrote:
>>
>> For what it's worth, I've got a pair of Dell R515 setup with a single
>> 2.8GHz 6-core 4184 Opteron, 16GB of RAM, and 10 SSDs that are capable
>> of about 200MB/s each. Currently I'm topping out at about 600MB/s with
>> rados bench using half of the drives for data and half for journals
>> (at 2x replication). Putting journals on the same drive and doing 10
>> OSDs on each node is slower. Still working on figuring out why.
>
> Just for fun, try the following tunings to see if they make
> a difference for you.
>
> This is my current best tuning for my hardware, which uses
> 24 SAS drives/server, and 1 OSD/drive with a journal partition
> on the outer tracks and btrfs for the data store.
>
> journal dio = true
> osd op threads = 24
> osd disk threads = 24
> filestore op threads = 6
> filestore queue max ops = 24
>
> osd client message size cap = 14000000
> ms dispatch throttle bytes = 17500000

I will definitely give this a try when I can get back to it.  I seem to 
remember getting a bit better performance when increasing filestore op 
threads, but I haven't tried fiddling with osd op/disk threads yet.

>
> I'd be very curious to hear how these work for you.
> My current testing load is streaming writes from
> 166 linux clients, and the above tunings let me
> sustain ~2 GB/s on each server (2x replication,
> so 500 MB/s per server aggregate client bandwidth).
>
> I have dual-port 10 GbE NICs, and use one port
> for the cluster and one for the clients. I use
> jumbo frames because it freed up ~10% CPU cycles over
> the default config of 1500-byte frames + GRO/GSO/etc
> on the load I'm currently testing with.
>
> FWIW these servers are dual-socket Intel 5675 Xeons,
> so total 12 cores at 3.0 GHz. On the above load I
> usually see 15-30% idle.

Yeah, you definitely have more horsepower in your nodes than the ones 
I've got.

>
> FWIW, "perf top" has this to say about where time is being spent
> under the above load under normal conditions.
>
> PerfTop: 19134 irqs/sec kernel:79.2% exact: 0.0% [1000Hz cycles], (all,
> 24 CPUs)
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> samples pcnt function DSO
> _______ _____ ______________________________________________
> ________________________________________________________________________________________
>
>
> 37656.00 15.3% ceph_crc32c_le /usr/bin/ceph-osd
> 23221.00 9.5% copy_user_generic_string [kernel.kallsyms]
> 16857.00 6.9% btrfs_end_transaction_dmeta
> /lib/modules/3.5.0-rc4-00011-g15d0694/kernel/fs/btrfs/btrfs.ko
> 16787.00 6.8% __crc32c_le [kernel.kallsyms]
>
>
> But, sometimes I see this:
>
> PerfTop: 4930 irqs/sec kernel:97.8% exact: 0.0% [1000Hz cycles], (all,
> 24 CPUs)
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> samples pcnt function DSO
> _______ _____ ______________________________________________
> ________________________________________________________________________________________
>
>
> 147565.00 45.8% _raw_spin_lock_irqsave [kernel.kallsyms]
> 24427.00 7.6% isolate_freepages_block [kernel.kallsyms]
> 23759.00 7.4% ceph_crc32c_le /usr/bin/ceph-osd
> 16521.00 5.1% copy_user_generic_string [kernel.kallsyms]
> 10549.00 3.3% __crc32c_le [kernel.kallsyms]
> 8901.00 2.8% btrfs_end_transaction_dmeta
> /lib/modules/3.5.0-rc4-00011-g15d0694/kernel/fs/btrfs/btrfs.ko
>
> When this happens, OSDs cannot process heartbeats in a timely fashion,
> get wrongly marked down, thrashing ensues, clients stall. I'm still
> trying to learn how to get perf to tell me more....
>
> -- Jim
>

Thanks for doing this!  I've been wanting to get perf going on our test 
boxes for ages but haven't had time to get the packages built yet for 
our gitbuilder kernels.

Try generating a call-graph ala: http://lwn.net/Articles/340010/

Mark

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-27 15:28       ` Stefan Priebe
@ 2012-06-27 16:00         ` Mark Nelson
  2012-06-28 13:21           ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 26+ messages in thread
From: Mark Nelson @ 2012-06-27 16:00 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Gregory Farnum, ceph-devel

On 06/27/2012 10:28 AM, Stefan Priebe wrote:
>  > Am 27.06.2012 17:21, schrieb Gregory Farnum:
>>
>>
>> Well, as we said, 1GHz/OSD was a WAG (wild-ass guess), but 3.6GHz+/OSD
>> is farther outside of that range than I would have expected. It might
>> just be a consequence of using SSDs, since they can sustain so much more
>> throughput.
>
> Sure it was just so much away from 1Ghz that i wanted to ask.
>
>> What is the cluster doing when you see those CPU usage numbers?
> random write I/O from one KVM. 14k I/Ops with random 4k writes.
>
> Stefan

I think I was seeing about 80-85% CPU utilization with 5 SSD OSDs on our 
6-core AMD nodes, but I was just doing sequential writes with rados bench.

Mark

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-27 15:19     ` Stefan Priebe
@ 2012-06-27 17:23       ` Jim Schutt
  2012-06-27 17:54         ` Stefan Priebe
  0 siblings, 1 reply; 26+ messages in thread
From: Jim Schutt @ 2012-06-27 17:23 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Mark Nelson, ceph-devel

On 06/27/2012 09:19 AM, Stefan Priebe wrote:
> Am 27.06.2012 16:55, schrieb Jim Schutt:
>> This is my current best tuning for my hardware, which uses
>> 24 SAS drives/server, and 1 OSD/drive with a journal partition
>> on the outer tracks and btrfs for the data store.
>
> Which raid level do you use?

No RAID.  Each OSD directly accesses a single
disk, via a partition for the journal and a partition
for the btrfs file store for that OSD.

I've got my 24 drives spread across three 6 Gb/s SAS HBAs,
so I can sustain ~90 MB/s per drive with all drives active,
when writing to the outer tracks using dd.

I want to rely on Ceph for data protection via replication.
At some point I expect to play around with the RAID0
support in btrfs to explore the performance relationship
between number of OSDs and size of each OSD, but haven't yet.

>
>> I'd be very curious to hear how these work for you.
>> My current testing load is streaming writes from
>> 166 linux clients, and the above tunings let me
>> sustain ~2 GB/s on each server (2x replication,
>> so 500 MB/s per server aggregate client bandwidth).
> 10GBe max speed shoudl be around 1Gbit/s. Do i miss something?

Hmmm, not sure.  My servers are limited by the bandwidth
of the SAS drives and HBAs.  So 2 GB/s aggregate disk
bandwidth is 1 GB/s for journals and 1 GB/s for data.
At 2x replication, that's 500 MB/s client data bandwidth.

>
>> I have dual-port 10 GbE NICs, and use one port
>> for the cluster and one for the clients. I use
>> jumbo frames because it freed up ~10% CPU cycles over
>> the default config of 1500-byte frames + GRO/GSO/etc
>> on the load I'm currently testing with.
> Do you have ntuple and lro on or off? Which kernel version do you use and which driver version? Intel cards?

# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off

The NICs are Chelsio T4, but I'm not using any of the
TCP stateful offload features for this testing.
I don't know if they have ntuple support, but the
ethtool version I'm using (2.6.33) doesn't mention it.

For kernels I switch back and forth between latest development
kernel from Linus's tree, or latest stable kernel, depending
on where the kernel development cycle is.  I usually switch
to the development kernel around -rc4 or so.

-- Jim

>
> Stefan
>
>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-27 17:23       ` Jim Schutt
@ 2012-06-27 17:54         ` Stefan Priebe
  2012-06-27 18:38           ` Jim Schutt
  0 siblings, 1 reply; 26+ messages in thread
From: Stefan Priebe @ 2012-06-27 17:54 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Mark Nelson, ceph-devel

Am 27.06.2012 um 19:23 schrieb "Jim Schutt" <jaschut@sandia.gov>:

> On 06/27/2012 09:19 AM, Stefan Priebe wrote:
>> Am 27.06.2012 16:55, schrieb Jim Schutt:
>>> This is my current best tuning for my hardware, which uses
>>> 24 SAS drives/server, and 1 OSD/drive with a journal partition
>>> on the outer tracks and btrfs for the data store.
>> 
>> Which raid level do you use?
> 
> No RAID.  Each OSD directly accesses a single
> disk, via a partition for the journal and a partition
> for the btrfs file store for that OSD.
So you have 24 threads x 24 osds 576 threads running?

> The NICs are Chelsio T4, but I'm not using any of the
> TCP stateful offload features for this testing.
> I don't know if they have ntuple support, but the
> ethtool version I'm using (2.6.33) doesn't mention it.
> 
> For kernels I switch back and forth between latest development
> kernel from Linus's tree, or latest stable kernel, depending
> on where the kernel development cycle is.  I usually switch
> to the development kernel around -rc4 or so.
> 

Crazy that this works for you. Btrfs is crashing to me in 20s while running full speed on ssd.

Stefan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-27 15:53     ` Mark Nelson
@ 2012-06-27 17:59       ` Jim Schutt
  0 siblings, 0 replies; 26+ messages in thread
From: Jim Schutt @ 2012-06-27 17:59 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Stefan Priebe - Profihost AG, ceph-devel

On 06/27/2012 09:53 AM, Mark Nelson wrote:
>> I'm still
>> trying to learn how to get perf to tell me more....
>>
>> -- Jim
>>
>
> Thanks for doing this!  I've been wanting to get perf going on our test boxes for ages but haven't had time to get the packages built yet for our gitbuilder kernels.
>
> Try generating a call-graph ala: http://lwn.net/Articles/340010/

That looks like exactly what I need.

Thanks!

I'll post when I've learned more.

-- Jim

>
> Mark



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-27 17:54         ` Stefan Priebe
@ 2012-06-27 18:38           ` Jim Schutt
  2012-06-27 18:48             ` Stefan Priebe
  0 siblings, 1 reply; 26+ messages in thread
From: Jim Schutt @ 2012-06-27 18:38 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Mark Nelson, ceph-devel

On 06/27/2012 11:54 AM, Stefan Priebe wrote:
> Am 27.06.2012 um 19:23 schrieb "Jim Schutt"<jaschut@sandia.gov>:
>
>> On 06/27/2012 09:19 AM, Stefan Priebe wrote:
>>> Am 27.06.2012 16:55, schrieb Jim Schutt:
>>>> This is my current best tuning for my hardware, which uses
>>>> 24 SAS drives/server, and 1 OSD/drive with a journal partition
>>>> on the outer tracks and btrfs for the data store.
>>>
>>> Which raid level do you use?
>>
>> No RAID.  Each OSD directly accesses a single
>> disk, via a partition for the journal and a partition
>> for the btrfs file store for that OSD.
> So you have 24 threads x 24 osds 576 threads running?

Actually, when my 166-client test is running,
"ps -o pid,nlwp,args -C ceph-osd"
tells me that I typically have ~1200 threads/OSD.

Since NPTL uses a 1:1 threading model, I recently had
to increase /proc/sys/kernel/pid_max from the default
32768 to get them all to fit.....

>
>> The NICs are Chelsio T4, but I'm not using any of the
>> TCP stateful offload features for this testing.
>> I don't know if they have ntuple support, but the
>> ethtool version I'm using (2.6.33) doesn't mention it.
>>
>> For kernels I switch back and forth between latest development
>> kernel from Linus's tree, or latest stable kernel, depending
>> on where the kernel development cycle is.  I usually switch
>> to the development kernel around -rc4 or so.
>>
>
> Crazy that this works for you. Btrfs is crashing to me in 20s while running full speed on ssd.

Hmmm.  The only other obvious difference, based on
what I remember from your other posts, is that you're
testing against RBD, right?  I've been testing exclusively
with the Linux kernel client.

???

-- Jim

>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-27 18:38           ` Jim Schutt
@ 2012-06-27 18:48             ` Stefan Priebe
  2012-06-27 19:10               ` Jim Schutt
  0 siblings, 1 reply; 26+ messages in thread
From: Stefan Priebe @ 2012-06-27 18:48 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Mark Nelson, ceph-devel

Am 27.06.2012 20:38, schrieb Jim Schutt:
> Actually, when my 166-client test is running,
> "ps -o pid,nlwp,args -C ceph-osd"
> tells me that I typically have ~1200 threads/OSD.

huh i see only 124 threads per OSD even with your settings.

> Hmmm.  The only other obvious difference, based on
> what I remember from your other posts, is that you're
> testing against RBD, right?  I've been testing exclusively
> with the Linux kernel client.

right and SSD. So it might be some timing issues.

Stefan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-27 18:48             ` Stefan Priebe
@ 2012-06-27 19:10               ` Jim Schutt
  2012-06-27 19:14                 ` Jim Schutt
  0 siblings, 1 reply; 26+ messages in thread
From: Jim Schutt @ 2012-06-27 19:10 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Mark Nelson, ceph-devel

On 06/27/2012 12:48 PM, Stefan Priebe wrote:
> Am 27.06.2012 20:38, schrieb Jim Schutt:
>> Actually, when my 166-client test is running,
>> "ps -o pid,nlwp,args -C ceph-osd"
>> tells me that I typically have ~1200 threads/OSD.
>
> huh i see only 124 threads per OSD even with your settings.

FWIW:

2 threads/messenger (reader+writer):
   166 clients
  ~200 OSD data peers (this depends on PG distribution/number)
  ~200 OSD heartbeat peers (ditto)
plus
    24 OSD op threads (my tuning)
    24 OSD disk threads (my tuning)
     6 OSD filestore op threads (my tuning)

So, 2*566 + 54 = 1186 threads/OSD

Plus, there's various other worker threads, such as the
timer, message dispatch threads, monitor/MDS messenger, etc.

>
>> Hmmm. The only other obvious difference, based on
>> what I remember from your other posts, is that you're
>> testing against RBD, right? I've been testing exclusively
>> with the Linux kernel client.
>
> right and SSD. So it might be some timing issues.

I guess so.

-- Jim

>
> Stefan
>
>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-27 19:10               ` Jim Schutt
@ 2012-06-27 19:14                 ` Jim Schutt
  0 siblings, 0 replies; 26+ messages in thread
From: Jim Schutt @ 2012-06-27 19:14 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Mark Nelson, ceph-devel

On 06/27/2012 01:10 PM, Jim Schutt wrote:
> On 06/27/2012 12:48 PM, Stefan Priebe wrote:
>> Am 27.06.2012 20:38, schrieb Jim Schutt:
>>> Actually, when my 166-client test is running,
>>> "ps -o pid,nlwp,args -C ceph-osd"
>>> tells me that I typically have ~1200 threads/OSD.
>>
>> huh i see only 124 threads per OSD even with your settings.
>
> FWIW:
>
> 2 threads/messenger (reader+writer):
> 166 clients
> ~200 OSD data peers (this depends on PG distribution/number)
> ~200 OSD heartbeat peers (ditto)

~200 OSD peers because I have 12 such servers
with 24 OSDs each.

-- Jim

> plus
> 24 OSD op threads (my tuning)
> 24 OSD disk threads (my tuning)
> 6 OSD filestore op threads (my tuning)
>
> So, 2*566 + 54 = 1186 threads/OSD
>
> Plus, there's various other worker threads, such as the
> timer, message dispatch threads, monitor/MDS messenger, etc.
>
>>
>>> Hmmm. The only other obvious difference, based on
>>> what I remember from your other posts, is that you're
>>> testing against RBD, right? I've been testing exclusively
>>> with the Linux kernel client.
>>
>> right and SSD. So it might be some timing issues.
>
> I guess so.
>
> -- Jim
>
>>
>> Stefan
>>
>>
>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-27 16:00         ` Mark Nelson
@ 2012-06-28 13:21           ` Stefan Priebe - Profihost AG
  2012-06-28 14:38             ` Mark Nelson
  0 siblings, 1 reply; 26+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-28 13:21 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Gregory Farnum, ceph-devel

Am 27.06.2012 18:00, schrieb Mark Nelson:
> On 06/27/2012 10:28 AM, Stefan Priebe wrote:
>>> Well, as we said, 1GHz/OSD was a WAG (wild-ass guess), but 3.6GHz+/OSD
>>> is farther outside of that range than I would have expected. It might
>>> just be a consequence of using SSDs, since they can sustain so much more
>>> throughput.
>>
>> Sure it was just so much away from 1Ghz that i wanted to ask.
>>
>>> What is the cluster doing when you see those CPU usage numbers?
>> random write I/O from one KVM. 14k I/Ops with random 4k writes.
>>
>> Stefan
>
> I think I was seeing about 80-85% CPU utilization with 5 SSD OSDs on our
> 6-core AMD nodes, but I was just doing sequential writes with rados bench.

While doing sequential writes i see pretty low CPU usage. Random writes 
is the problem.

Stefan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-28 13:21           ` Stefan Priebe - Profihost AG
@ 2012-06-28 14:38             ` Mark Nelson
  2012-06-28 15:18               ` Alexandre DERUMIER
  2012-06-28 16:00               ` Stefan Priebe
  0 siblings, 2 replies; 26+ messages in thread
From: Mark Nelson @ 2012-06-28 14:38 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: Gregory Farnum, ceph-devel

On 06/28/2012 08:21 AM, Stefan Priebe - Profihost AG wrote:
> Am 27.06.2012 18:00, schrieb Mark Nelson:
>> On 06/27/2012 10:28 AM, Stefan Priebe wrote:
>>>> Well, as we said, 1GHz/OSD was a WAG (wild-ass guess), but 3.6GHz+/OSD
>>>> is farther outside of that range than I would have expected. It might
>>>> just be a consequence of using SSDs, since they can sustain so much
>>>> more
>>>> throughput.
>>>
>>> Sure it was just so much away from 1Ghz that i wanted to ask.
>>>
>>>> What is the cluster doing when you see those CPU usage numbers?
>>> random write I/O from one KVM. 14k I/Ops with random 4k writes.
>>>
>>> Stefan
>>
>> I think I was seeing about 80-85% CPU utilization with 5 SSD OSDs on our
>> 6-core AMD nodes, but I was just doing sequential writes with rados
>> bench.
>
> While doing sequential writes i see pretty low CPU usage. Random writes
> is the problem.
>
> Stefan

It would be interesting to see where all your CPU time is being spent. 
What benchmark are you using to do the random writes?

Mark

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-28 14:38             ` Mark Nelson
@ 2012-06-28 15:18               ` Alexandre DERUMIER
  2012-06-28 15:33                 ` Sage Weil
  2012-06-28 16:01                 ` Stefan Priebe
  2012-06-28 16:00               ` Stefan Priebe
  1 sibling, 2 replies; 26+ messages in thread
From: Alexandre DERUMIER @ 2012-06-28 15:18 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Gregory Farnum, ceph-devel, Stefan Priebe - Profihost AG

Hi,
maybe it can help, I'm doing same tests that stefan, 
random write  with 4K with 3 nodes with 5 osds (with 15K drives)
fio --filename=[disk] --direct=1 --rw=randwrite --bs=4k --size=100M --numjobs=50 --runtime=30 --group_reporting --name=file1 

I can achieve 5500io/s (disks are not at 100%)

cpu is around 30%idle ( 8 cores E5420  @ 2.50GHz)


----- Mail original ----- 

De: "Mark Nelson" <mark.nelson@inktank.com> 
À: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
Cc: "Gregory Farnum" <greg@inktank.com>, ceph-devel@vger.kernel.org 
Envoyé: Jeudi 28 Juin 2012 16:38:32 
Objet: Re: OSD Hardware questions 

On 06/28/2012 08:21 AM, Stefan Priebe - Profihost AG wrote: 
> Am 27.06.2012 18:00, schrieb Mark Nelson: 
>> On 06/27/2012 10:28 AM, Stefan Priebe wrote: 
>>>> Well, as we said, 1GHz/OSD was a WAG (wild-ass guess), but 3.6GHz+/OSD 
>>>> is farther outside of that range than I would have expected. It might 
>>>> just be a consequence of using SSDs, since they can sustain so much 
>>>> more 
>>>> throughput. 
>>> 
>>> Sure it was just so much away from 1Ghz that i wanted to ask. 
>>> 
>>>> What is the cluster doing when you see those CPU usage numbers? 
>>> random write I/O from one KVM. 14k I/Ops with random 4k writes. 
>>> 
>>> Stefan 
>> 
>> I think I was seeing about 80-85% CPU utilization with 5 SSD OSDs on our 
>> 6-core AMD nodes, but I was just doing sequential writes with rados 
>> bench. 
> 
> While doing sequential writes i see pretty low CPU usage. Random writes 
> is the problem. 
> 
> Stefan 

It would be interesting to see where all your CPU time is being spent. 
What benchmark are you using to do the random writes? 

Mark 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 

-- 



	

Alexandre D e rumier 

Ingénieur Systèmes et Réseaux 


Fixe : 03 20 68 88 85 

Fax : 03 20 68 90 88 


45 Bvd du Général Leclerc 59100 Roubaix 
12 rue Marivaux 75002 Paris 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-28 15:18               ` Alexandre DERUMIER
@ 2012-06-28 15:33                 ` Sage Weil
  2012-06-28 15:45                   ` Alexandre DERUMIER
  2012-06-28 21:25                   ` Stefan Priebe
  2012-06-28 16:01                 ` Stefan Priebe
  1 sibling, 2 replies; 26+ messages in thread
From: Sage Weil @ 2012-06-28 15:33 UTC (permalink / raw)
  To: Alexandre DERUMIER
  Cc: Mark Nelson, Gregory Farnum, ceph-devel, Stefan Priebe - Profihost AG

On Thu, 28 Jun 2012, Alexandre DERUMIER wrote:
> Hi,
> maybe it can help, I'm doing same tests that stefan, 
> random write  with 4K with 3 nodes with 5 osds (with 15K drives)
> fio --filename=[disk] --direct=1 --rw=randwrite --bs=4k --size=100M --numjobs=50 --runtime=30 --group_reporting --name=file1 
> 
> I can achieve 5500io/s (disks are not at 100%)
> 
> cpu is around 30%idle ( 8 cores E5420  @ 2.50GHz)

Have you tried adjusting 'osd op threads'?  The default is 2, but bumping 
that to, say, 8, might give you better concurrency and throughput.

"Mark Nelson" <mark.nelson@inktank.com> wrote:
> It would be interesting to see where all your CPU time is being spent. 
> What benchmark are you using to do the random writes? 

Definitely.  Seeing perf/oprofile/whatever results for the osd under that 
workload would be very interesting!  We need to get perf going in our 
testing environment...

sage

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-28 15:33                 ` Sage Weil
@ 2012-06-28 15:45                   ` Alexandre DERUMIER
  2012-06-28 15:48                     ` Jim Schutt
  2012-06-28 21:25                   ` Stefan Priebe
  1 sibling, 1 reply; 26+ messages in thread
From: Alexandre DERUMIER @ 2012-06-28 15:45 UTC (permalink / raw)
  To: Sage Weil
  Cc: Mark Nelson, Gregory Farnum, ceph-devel, Stefan Priebe - Profihost AG

>>Have you tried adjusting 'osd op threads'? The default is 2, but bumping 
>>that to, say, 8, might give you better concurrency and throughput. 
Not yet,I'll try that tomorrow

>>Definitely. Seeing perf/oprofile/whatever results for the osd under that 
>>workload would be very interesting! We need to get perf going in our 
>>testing environment... 

I'm not an expert, but if you give me command line, I'll do it ;)


BTW : wip-flushmin improve a lot the performance with random write. I jump from 1500-2000 iop/s (with spikes/slowdown),to constant 5500 iop/s.
      With btrfs, I see random write flushed each x seconds sequentially. (so it's really help with seeks)
      But with xfs,I see constant random writes...(maybe it's an xfs bug...)



----- Mail original ----- 

De: "Sage Weil" <sage@inktank.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Mark Nelson" <mark.nelson@inktank.com>, "Gregory Farnum" <greg@inktank.com>, ceph-devel@vger.kernel.org, "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
Envoyé: Jeudi 28 Juin 2012 17:33:31 
Objet: Re: OSD Hardware questions 

On Thu, 28 Jun 2012, Alexandre DERUMIER wrote: 
> Hi, 
> maybe it can help, I'm doing same tests that stefan, 
> random write with 4K with 3 nodes with 5 osds (with 15K drives) 
> fio --filename=[disk] --direct=1 --rw=randwrite --bs=4k --size=100M --numjobs=50 --runtime=30 --group_reporting --name=file1 
> 
> I can achieve 5500io/s (disks are not at 100%) 
> 
> cpu is around 30%idle ( 8 cores E5420 @ 2.50GHz) 

Have you tried adjusting 'osd op threads'? The default is 2, but bumping 
that to, say, 8, might give you better concurrency and throughput. 

"Mark Nelson" <mark.nelson@inktank.com> wrote: 
> It would be interesting to see where all your CPU time is being spent. 
> What benchmark are you using to do the random writes? 

Definitely. Seeing perf/oprofile/whatever results for the osd under that 
workload would be very interesting! We need to get perf going in our 
testing environment... 

sage 



-- 

-- 



	

Alexandre D e rumier 

Ingénieur Systèmes et Réseaux 


Fixe : 03 20 68 88 85 

Fax : 03 20 68 90 88 


45 Bvd du Général Leclerc 59100 Roubaix 
12 rue Marivaux 75002 Paris 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-28 15:45                   ` Alexandre DERUMIER
@ 2012-06-28 15:48                     ` Jim Schutt
  0 siblings, 0 replies; 26+ messages in thread
From: Jim Schutt @ 2012-06-28 15:48 UTC (permalink / raw)
  To: Alexandre DERUMIER
  Cc: Sage Weil, Mark Nelson, Gregory Farnum, ceph-devel,
	Stefan Priebe - Profihost AG

On 06/28/2012 09:45 AM, Alexandre DERUMIER wrote:
>>> >>Definitely. Seeing perf/oprofile/whatever results for the osd under that
>>> >>workload would be very interesting! We need to get perf going in our
>>> >>testing environment...
> I'm not an expert, but if you give me command line, I'll do it ;)

Thanks to Mark's help, I'm now using:

perf record -g -a sleep 10
perf report --sort symbol --call-graph fractal,5 | more

-- Jim


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-28 14:38             ` Mark Nelson
  2012-06-28 15:18               ` Alexandre DERUMIER
@ 2012-06-28 16:00               ` Stefan Priebe
  1 sibling, 0 replies; 26+ messages in thread
From: Stefan Priebe @ 2012-06-28 16:00 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Gregory Farnum, ceph-devel

Am 28.06.2012 16:38, schrieb Mark Nelson:
> On 06/28/2012 08:21 AM, Stefan Priebe - Profihost AG wrote:
>> While doing sequential writes i see pretty low CPU usage. Random writes
>> is the problem.
>>
>> Stefan
>
> It would be interesting to see where all your CPU time is being spent.

Correct - what do you need?

> What benchmark are you using to do the random writes?
in KVM VM:
export DISK=/dev/vda

fio --filename=$DISK --direct=1 --rw=randwrite --bs=4k --size=200G 
--numjobs=50 --runtime=90 --group_reporting --name=file1

Stefan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-28 15:18               ` Alexandre DERUMIER
  2012-06-28 15:33                 ` Sage Weil
@ 2012-06-28 16:01                 ` Stefan Priebe
  1 sibling, 0 replies; 26+ messages in thread
From: Stefan Priebe @ 2012-06-28 16:01 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Mark Nelson, Gregory Farnum, ceph-devel

Am 28.06.2012 17:18, schrieb Alexandre DERUMIER:
> Hi,
> maybe it can help, I'm doing same tests that stefan,
> random write  with 4K with 3 nodes with 5 osds (with 15K drives)
> fio --filename=[disk] --direct=1 --rw=randwrite --bs=4k --size=100M --numjobs=50 --runtime=30 --group_reporting --name=file1
>
> I can achieve 5500io/s (disks are not at 100%)
>
> cpu is around 30%idle ( 8 cores E5420  @ 2.50GHz)

mine is 25-30% idle? May be this is the limit? So our osds are the problem?

Stefan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-28 15:33                 ` Sage Weil
  2012-06-28 15:45                   ` Alexandre DERUMIER
@ 2012-06-28 21:25                   ` Stefan Priebe
  2012-06-29 11:37                     ` Mark Nelson
  1 sibling, 1 reply; 26+ messages in thread
From: Stefan Priebe @ 2012-06-28 21:25 UTC (permalink / raw)
  To: Sage Weil; +Cc: Alexandre DERUMIER, Mark Nelson, Gregory Farnum, ceph-devel

Am 28.06.2012 17:33, schrieb Sage Weil:
> Have you tried adjusting 'osd op threads'?  The default is 2, but bumping
> that to, say, 8, might give you better concurrency and throughput.
For me this doesn't change anything. I believe the ceph-osd processes 
are the problem. I mean i've 8 cores x 3,6Ghz and 4 ceph-osd processses 
use around 80%.

> "Mark Nelson" <mark.nelson@inktank.com> wrote:
>> It would be interesting to see where all your CPU time is being spent.
>> What benchmark are you using to do the random writes?
>
> Definitely.  Seeing perf/oprofile/whatever results for the osd under that
> workload would be very interesting!  We need to get perf going in our
> testing environment...
I have it working. But even a call graph of 10s is around 120 000 lines 
long ?!

Stefan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-28 21:25                   ` Stefan Priebe
@ 2012-06-29 11:37                     ` Mark Nelson
  2012-06-29 12:35                       ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 26+ messages in thread
From: Mark Nelson @ 2012-06-29 11:37 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Sage Weil, Alexandre DERUMIER, Gregory Farnum, ceph-devel

On 6/28/12 4:25 PM, Stefan Priebe wrote:
> Am 28.06.2012 17:33, schrieb Sage Weil:
>> Have you tried adjusting 'osd op threads'? The default is 2, but bumping
>> that to, say, 8, might give you better concurrency and throughput.
> For me this doesn't change anything. I believe the ceph-osd processes
> are the problem. I mean i've 8 cores x 3,6Ghz and 4 ceph-osd processses
> use around 80%.
>
>> "Mark Nelson" <mark.nelson@inktank.com> wrote:
>>> It would be interesting to see where all your CPU time is being spent.
>>> What benchmark are you using to do the random writes?
>>
>> Definitely. Seeing perf/oprofile/whatever results for the osd under that
>> workload would be very interesting! We need to get perf going in our
>> testing environment...
> I have it working. But even a call graph of 10s is around 120 000 lines
> long ?!
>
> Stefan

What tool did you use to do the profiling?

Mark

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: OSD Hardware questions
  2012-06-29 11:37                     ` Mark Nelson
@ 2012-06-29 12:35                       ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 26+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-29 12:35 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Sage Weil, Alexandre DERUMIER, Gregory Farnum, ceph-devel


Am 29.06.2012 13:37, schrieb Mark Nelson:
> On 6/28/12 4:25 PM, Stefan Priebe wrote:
>> Am 28.06.2012 17:33, schrieb Sage Weil:
>>> Have you tried adjusting 'osd op threads'? The default is 2, but bumping
>>> that to, say, 8, might give you better concurrency and throughput.
>> For me this doesn't change anything. I believe the ceph-osd processes
>> are the problem. I mean i've 8 cores x 3,6Ghz and 4 ceph-osd processses
>> use around 80%.
>>
>>> "Mark Nelson" <mark.nelson@inktank.com> wrote:
>>>> It would be interesting to see where all your CPU time is being spent.
>>>> What benchmark are you using to do the random writes?
>>>
>>> Definitely. Seeing perf/oprofile/whatever results for the osd under that
>>> workload would be very interesting! We need to get perf going in our
>>> testing environment...
>> I have it working. But even a call graph of 10s is around 120 000 lines
>> long ?!
>>
>> Stefan
>
> What tool did you use to do the profiling?

i used the perf tool bundled with the kernel source. I can send you the 
perf.data file if you want. But i don't see anything interesting. But 
maybe i've just used perf the wrong way.

Stefan

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2012-06-29 12:35 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-27 13:04 OSD Hardware questions Stefan Priebe - Profihost AG
2012-06-27 13:55 ` Mark Nelson
2012-06-27 14:55   ` Jim Schutt
2012-06-27 15:19     ` Stefan Priebe
2012-06-27 17:23       ` Jim Schutt
2012-06-27 17:54         ` Stefan Priebe
2012-06-27 18:38           ` Jim Schutt
2012-06-27 18:48             ` Stefan Priebe
2012-06-27 19:10               ` Jim Schutt
2012-06-27 19:14                 ` Jim Schutt
2012-06-27 15:53     ` Mark Nelson
2012-06-27 17:59       ` Jim Schutt
2012-06-27 15:13   ` Stefan Priebe
     [not found]     ` <CAPYLRzj916kW=KLy3dMTVPJRoNtPMP_Ejz+YAxRUJ5jZc+HeMg@mail.gmail.com>
2012-06-27 15:28       ` Stefan Priebe
2012-06-27 16:00         ` Mark Nelson
2012-06-28 13:21           ` Stefan Priebe - Profihost AG
2012-06-28 14:38             ` Mark Nelson
2012-06-28 15:18               ` Alexandre DERUMIER
2012-06-28 15:33                 ` Sage Weil
2012-06-28 15:45                   ` Alexandre DERUMIER
2012-06-28 15:48                     ` Jim Schutt
2012-06-28 21:25                   ` Stefan Priebe
2012-06-29 11:37                     ` Mark Nelson
2012-06-29 12:35                       ` Stefan Priebe - Profihost AG
2012-06-28 16:01                 ` Stefan Priebe
2012-06-28 16:00               ` Stefan Priebe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.