All of lore.kernel.org
 help / color / mirror / Atom feed
* Infiniband 40GB
@ 2012-06-03  8:10 Stefan Priebe
  2012-06-03 12:56 ` Mark Nelson
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Priebe @ 2012-06-03  8:10 UTC (permalink / raw)
  To: ceph-devel

Hi List,

has anybody already tried CEPH over Infiniband 40GB?

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-03  8:10 Infiniband 40GB Stefan Priebe
@ 2012-06-03 12:56 ` Mark Nelson
  2012-06-04  6:22   ` Hannes Reinecke
  0 siblings, 1 reply; 36+ messages in thread
From: Mark Nelson @ 2012-06-03 12:56 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: ceph-devel

On 6/3/12 3:10 AM, Stefan Priebe wrote:
> Hi List,
>
> has anybody already tried CEPH over Infiniband 40GB?
>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Hi Stefan,

A couple of folks have done DDR IB.  For now you are limited to ipoib 
though.  If you have the hardware available I'd be really curious what 
kind of throughput/latencies you see.

Mark

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-03 12:56 ` Mark Nelson
@ 2012-06-04  6:22   ` Hannes Reinecke
  2012-06-04  7:26     ` Stefan Priebe - Profihost AG
  2012-06-04 12:28     ` Mark Nelson
  0 siblings, 2 replies; 36+ messages in thread
From: Hannes Reinecke @ 2012-06-04  6:22 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Stefan Priebe, ceph-devel

On 06/03/2012 02:56 PM, Mark Nelson wrote:
> On 6/3/12 3:10 AM, Stefan Priebe wrote:
>> Hi List,
>>
>> has anybody already tried CEPH over Infiniband 40GB?
>>
>> Stefan
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe
>> ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> 
> Hi Stefan,
> 
> A couple of folks have done DDR IB.  For now you are limited to
> ipoib though.  If you have the hardware available I'd be really
> curious what kind of throughput/latencies you see.
> 
Hehe.

Good luck with that.

We've tried on 10GigE with _disastrous_ results.
Up to the point where 1GigE was actually _faster_.

So far we've uncovered two issues:
- intel_idle was/is seriously broken (we've tried on 3.0-stable,
  so might've been fixed by now)
- osd-server is calling 'fsync' on each and every write request.
  Does wonders for performance ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04  6:22   ` Hannes Reinecke
@ 2012-06-04  7:26     ` Stefan Priebe - Profihost AG
  2012-06-04  7:39       ` Hannes Reinecke
  2012-06-04 12:28     ` Mark Nelson
  1 sibling, 1 reply; 36+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-04  7:26 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Mark Nelson, ceph-devel

Am 04.06.2012 08:22, schrieb Hannes Reinecke:
> Hehe.
> Good luck with that.
>
> We've tried on 10GigE with _disastrous_ results.
> Up to the point where 1GigE was actually _faster_.

So you mean you've tried 10GBE or 10GB ipoib with Infiniband?

> - osd-server is calling 'fsync' on each and every write request.
>    Does wonders for performance ...
Already talked to the ceph guys?

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04  7:26     ` Stefan Priebe - Profihost AG
@ 2012-06-04  7:39       ` Hannes Reinecke
  2012-06-04  7:53         ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 36+ messages in thread
From: Hannes Reinecke @ 2012-06-04  7:39 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: Mark Nelson, ceph-devel

On 06/04/2012 09:26 AM, Stefan Priebe - Profihost AG wrote:
> Am 04.06.2012 08:22, schrieb Hannes Reinecke:
>> Hehe.
>> Good luck with that.
>>
>> We've tried on 10GigE with _disastrous_ results.
>> Up to the point where 1GigE was actually _faster_.
> 
> So you mean you've tried 10GBE or 10GB ipoib with Infiniband?
> 
>> - osd-server is calling 'fsync' on each and every write request.
>>    Does wonders for performance ...
> Already talked to the ceph guys?
> 
Still not there yet. Still need to figure out the exact details;
performance regressions are notoriously hard to track.

But yeah, rumours have it we are in contact.
Project management on our side could be improved, though ;)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04  7:39       ` Hannes Reinecke
@ 2012-06-04  7:53         ` Stefan Priebe - Profihost AG
  2012-06-04  8:02           ` Hannes Reinecke
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-04  7:53 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Mark Nelson, ceph-devel

Am 04.06.2012 09:39, schrieb Hannes Reinecke:
> On 06/04/2012 09:26 AM, Stefan Priebe - Profihost AG wrote:
>> Am 04.06.2012 08:22, schrieb Hannes Reinecke:
>>> Hehe.
>>> Good luck with that.
>>>
>>> We've tried on 10GigE with _disastrous_ results.
>>> Up to the point where 1GigE was actually _faster_.
>>
>> So you mean you've tried 10GBE or 10GB ipoib with Infiniband?

Could you please answer this question too? Thx.

Cheers,
Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04  7:53         ` Stefan Priebe - Profihost AG
@ 2012-06-04  8:02           ` Hannes Reinecke
  2012-06-04  8:23             ` Stefan Majer
  0 siblings, 1 reply; 36+ messages in thread
From: Hannes Reinecke @ 2012-06-04  8:02 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: Mark Nelson, ceph-devel

On 06/04/2012 09:53 AM, Stefan Priebe - Profihost AG wrote:
> Am 04.06.2012 09:39, schrieb Hannes Reinecke:
>> On 06/04/2012 09:26 AM, Stefan Priebe - Profihost AG wrote:
>>> Am 04.06.2012 08:22, schrieb Hannes Reinecke:
>>>> Hehe.
>>>> Good luck with that.
>>>>
>>>> We've tried on 10GigE with _disastrous_ results.
>>>> Up to the point where 1GigE was actually _faster_.
>>>
>>> So you mean you've tried 10GBE or 10GB ipoib with Infiniband?
> 
> Could you please answer this question too? Thx.
> 
This was plain 10GigE, ie TCP/IP. Not infiniband, I'm afraid.

However, given that our problems have not been related to the actual
transport I'd be very much surprised if they would not occur on
Infiniband.

And I would _definitely_ like to hear if someone managed to get any
decent speed (notably write speed) on fast interconnects.
There's always a chance we've messed things up and were just
measuring our crap setup ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04  8:02           ` Hannes Reinecke
@ 2012-06-04  8:23             ` Stefan Majer
  2012-06-04  9:21               ` Yann Dupont
  2012-06-05  8:54               ` Stefan Priebe - Profihost AG
  0 siblings, 2 replies; 36+ messages in thread
From: Stefan Majer @ 2012-06-04  8:23 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Stefan Priebe - Profihost AG, Mark Nelson, ceph-devel

Hi Hannes,

our production environment is running on 10GB infrastructure. We had a
lot of troubles till we got to where we are today.
We use Intel X520 D2 cards on our OSD´s and nexus switch
infrastructure. All other cards we where testing failed horrible.

Some of the problems we encountered have been:
- page allocation failures in the ixgbe driver --> fixed in upstream
- problems with jumbo frames, we had to disable tso, gro, lro -- >
this is the most obscure thing
- various tuning via sysctl in the net.tcp and net.ipv4 area --> this
was also the outcome of stefan´s benchmarking odysee.

But after all this we a quite happy actully and are only limited by
the speed of the drives (2TB SATA).
The fsync is a fdatasync in fact which is available in newer glibc. If
you dont use btrfs (we use xfs) you need to use a recent glibc with
fdatasync support.

On Mon, Jun 4, 2012 at 10:02 AM, Hannes Reinecke <hare@suse.de> wrote:

hope this helps

Greetings
Stefan

> On 06/04/2012 09:53 AM, Stefan Priebe - Profihost AG wrote:
>> Am 04.06.2012 09:39, schrieb Hannes Reinecke:
>>> On 06/04/2012 09:26 AM, Stefan Priebe - Profihost AG wrote:
>>>> Am 04.06.2012 08:22, schrieb Hannes Reinecke:
>>>>> Hehe.
>>>>> Good luck with that.
>>>>>
>>>>> We've tried on 10GigE with _disastrous_ results.
>>>>> Up to the point where 1GigE was actually _faster_.
>>>>
>>>> So you mean you've tried 10GBE or 10GB ipoib with Infiniband?
>>
>> Could you please answer this question too? Thx.
>>
> This was plain 10GigE, ie TCP/IP. Not infiniband, I'm afraid.
>
> However, given that our problems have not been related to the actual
> transport I'd be very much surprised if they would not occur on
> Infiniband.
>
> And I would _definitely_ like to hear if someone managed to get any
> decent speed (notably write speed) on fast interconnects.
> There's always a chance we've messed things up and were just
> measuring our crap setup ...
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                   zSeries & Storage
> hare@suse.de                          +49 911 74053 688
> SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Stefan Majer
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04  8:23             ` Stefan Majer
@ 2012-06-04  9:21               ` Yann Dupont
  2012-06-04  9:35                 ` Alexandre DERUMIER
  2012-06-04  9:47                 ` Amon Ott
  2012-06-05  8:54               ` Stefan Priebe - Profihost AG
  1 sibling, 2 replies; 36+ messages in thread
From: Yann Dupont @ 2012-06-04  9:21 UTC (permalink / raw)
  To: Stefan Majer
  Cc: Hannes Reinecke, Stefan Priebe - Profihost AG, Mark Nelson, ceph-devel

Le 04/06/2012 10:23, Stefan Majer a écrit :
> Hi Hannes,
>
> our production environment is running on 10GB infrastructure. We had a
> lot of troubles till we got to where we are today.
> We use Intel X520 D2 cards on our OSD´s and nexus switch
> infrastructure. All other cards we where testing failed horrible.
>

we have Intel Corporation 82599EB 10 Gigabit Dual Port Backplane 
Connection (rev 01)... Don't know the 'commercial name'. ixgbe driver.


> Some of the problems we encountered have been:
> - page allocation failures in the ixgbe driver --> fixed in upstream
> - problems with jumbo frames, we had to disable tso, gro, lro -- >
> this is the most obscure thing
> - various tuning via sysctl in the net.tcp and net.ipv4 area --> this
> was also the outcome of stefan´s benchmarking odysee.

some tuning we made :

-> Turning off Virtualisation extension in BIOS. Don't know why, but it 
gaves us crappy performance. We usually put it on, because we use KVM a 
lot. In our case, OSD are in bare metal and disabling virtualisation 
extension gives us a very big boost.
It may be a BIOS bug in our machines (DELL M610).

-> One of my colleague played with receive flow steeting ; the intel 
card supports multi queue, so it seems we can gain a little with it :

!/bin/sh

for x in $(seq 0 23); do echo FFFFFFFF > 
/sys/class/net/eth2/queues/rx-${x}/rps_cpus; done
echo 16384 > /proc/sys/net/core/rps_sock_flow_entries
for x in $(seq 0 23); do echo 16384 > 
/sys/class/net/eth2/queues/rx-${x}/rps_flow_cnt; done


>
> But after all this we a quite happy actully and are only limited by
> the speed of the drives (2TB SATA).
> The fsync is a fdatasync in fact which is available in newer glibc. If
> you dont use btrfs (we use xfs) you need to use a recent glibc with
> fdatasync support.

Does it may explain why we see loosy performance with xfs right now ? 
That the main reason we're stuck with btrfs for the moment.

we're using debian 'stable' : libc is
libc6                                   2.11.3-3
probably too old ?

Cheers,
-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04  9:21               ` Yann Dupont
@ 2012-06-04  9:35                 ` Alexandre DERUMIER
  2012-06-04  9:53                   ` Yann Dupont
  2012-06-04  9:47                 ` Amon Ott
  1 sibling, 1 reply; 36+ messages in thread
From: Alexandre DERUMIER @ 2012-06-04  9:35 UTC (permalink / raw)
  To: Yann Dupont
  Cc: Hannes Reinecke, Stefan Priebe - Profihost AG, Mark Nelson,
	ceph-devel, Stefan Majer

Hi,
about this:
>> Turning off Virtualisation extension in BIOS. Don't know why, but it 
>>gaves us crappy performance. We usually put it on, because we use KVM a 
>>lot. In our case, OSD are in bare metal and disabling virtualisation 
>>extension gives us a very big boost. 
>>It may be a BIOS bug in our machines (DELL M610). 

It could be related to iommu, if you pass intel_iommu=on in grub.
I have already had this kind of problem.

When intel_iommu=on, Linux (completely unrelated to KVM) adds a new level
of protection which didn't exist without an IOMMU - the network card, which
without an IOMMU could write (via DMA) to any memory location, now is
not allowed - the card can only write to memory locates which the OS
wanted it to write. Theoretically, this can protect the OS against
various kinds of attacks. But what happens now is that every time that
Linux passes a new buffer to the card, it needs to change the IOMMU
mappings. This noticably slows down I/O, unfortunately. 


----- Mail original ----- 

De: "Yann Dupont" <Yann.Dupont@univ-nantes.fr> 
À: "Stefan Majer" <stefan.majer@gmail.com> 
Cc: "Hannes Reinecke" <hare@suse.de>, "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag>, "Mark Nelson" <mark.nelson@inktank.com>, ceph-devel@vger.kernel.org 
Envoyé: Lundi 4 Juin 2012 11:21:56 
Objet: Re: Infiniband 40GB 

Le 04/06/2012 10:23, Stefan Majer a écrit : 
> Hi Hannes, 
> 
> our production environment is running on 10GB infrastructure. We had a 
> lot of troubles till we got to where we are today. 
> We use Intel X520 D2 cards on our OSD´s and nexus switch 
> infrastructure. All other cards we where testing failed horrible. 
> 

we have Intel Corporation 82599EB 10 Gigabit Dual Port Backplane 
Connection (rev 01)... Don't know the 'commercial name'. ixgbe driver. 


> Some of the problems we encountered have been: 
> - page allocation failures in the ixgbe driver --> fixed in upstream 
> - problems with jumbo frames, we had to disable tso, gro, lro -- > 
> this is the most obscure thing 
> - various tuning via sysctl in the net.tcp and net.ipv4 area --> this 
> was also the outcome of stefan´s benchmarking odysee. 

some tuning we made : 

-> Turning off Virtualisation extension in BIOS. Don't know why, but it 
gaves us crappy performance. We usually put it on, because we use KVM a 
lot. In our case, OSD are in bare metal and disabling virtualisation 
extension gives us a very big boost. 
It may be a BIOS bug in our machines (DELL M610). 

-> One of my colleague played with receive flow steeting ; the intel 
card supports multi queue, so it seems we can gain a little with it : 

!/bin/sh 

for x in $(seq 0 23); do echo FFFFFFFF > 
/sys/class/net/eth2/queues/rx-${x}/rps_cpus; done 
echo 16384 > /proc/sys/net/core/rps_sock_flow_entries 
for x in $(seq 0 23); do echo 16384 > 
/sys/class/net/eth2/queues/rx-${x}/rps_flow_cnt; done 


> 
> But after all this we a quite happy actully and are only limited by 
> the speed of the drives (2TB SATA). 
> The fsync is a fdatasync in fact which is available in newer glibc. If 
> you dont use btrfs (we use xfs) you need to use a recent glibc with 
> fdatasync support. 

Does it may explain why we see loosy performance with xfs right now ? 
That the main reason we're stuck with btrfs for the moment. 

we're using debian 'stable' : libc is 
libc6 2.11.3-3 
probably too old ? 

Cheers, 
-- 
Yann Dupont - Service IRTS, DSI Université de Nantes 
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr 


-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 

-- 




	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 
	
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04  9:21               ` Yann Dupont
  2012-06-04  9:35                 ` Alexandre DERUMIER
@ 2012-06-04  9:47                 ` Amon Ott
  2012-06-04  9:58                   ` Yann Dupont
                                     ` (3 more replies)
  1 sibling, 4 replies; 36+ messages in thread
From: Amon Ott @ 2012-06-04  9:47 UTC (permalink / raw)
  To: Yann Dupont; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 3097 bytes --]

On Monday 04 June 2012 you wrote:
> Le 04/06/2012 10:23, Stefan Majer a écrit :
> > Hi Hannes,
> >
> > our production environment is running on 10GB infrastructure. We had a
> > lot of troubles till we got to where we are today.
> > We use Intel X520 D2 cards on our OSD´s and nexus switch
> > infrastructure. All other cards we where testing failed horrible.
>
> we have Intel Corporation 82599EB 10 Gigabit Dual Port Backplane
> Connection (rev 01)... Don't know the 'commercial name'. ixgbe driver.
>
> > Some of the problems we encountered have been:
> > - page allocation failures in the ixgbe driver --> fixed in upstream
> > - problems with jumbo frames, we had to disable tso, gro, lro -- >
> > this is the most obscure thing
> > - various tuning via sysctl in the net.tcp and net.ipv4 area --> this
> > was also the outcome of stefan´s benchmarking odysee.
>
> some tuning we made :
>
> -> Turning off Virtualisation extension in BIOS. Don't know why, but it
> gaves us crappy performance. We usually put it on, because we use KVM a
> lot. In our case, OSD are in bare metal and disabling virtualisation
> extension gives us a very big boost.
> It may be a BIOS bug in our machines (DELL M610).
>
> -> One of my colleague played with receive flow steeting ; the intel
> card supports multi queue, so it seems we can gain a little with it :
>
> !/bin/sh
>
> for x in $(seq 0 23); do echo FFFFFFFF >
> /sys/class/net/eth2/queues/rx-${x}/rps_cpus; done
> echo 16384 > /proc/sys/net/core/rps_sock_flow_entries
> for x in $(seq 0 23); do echo 16384 >
> /sys/class/net/eth2/queues/rx-${x}/rps_flow_cnt; done
>
> > But after all this we a quite happy actully and are only limited by
> > the speed of the drives (2TB SATA).
> > The fsync is a fdatasync in fact which is available in newer glibc. If
> > you dont use btrfs (we use xfs) you need to use a recent glibc with
> > fdatasync support.
>
> Does it may explain why we see loosy performance with xfs right now ?
> That the main reason we're stuck with btrfs for the moment.
>
> we're using debian 'stable' : libc is
> libc6                                   2.11.3-3
> probably too old ?

One reason for performance problems with that libc6 version is missing 
syncfs() support. I backported a patch for 2.13, originally by Andreas 
Schwab, schwab@redhat.com, to Debian stable code. Patch is attached.

Copy the patch to eglibc's debian/patches/, add to debian/patches/series, 
rebuild eglibc packages (including libc6) with dpkg-buildpackage, install new 
libc6-dev, rebuild ceph packages against it, install and retry. AFAIK, not 
even libc6 in Debian experimental has syncfs() support.

Also see thread "OSD deadlock with cephfs client and OSD on same machine"

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649

[-- Attachment #2: syncfs.diff --]
[-- Type: text/x-diff, Size: 4110 bytes --]

 Versions.def               |    1 +
 misc/Makefile              |    4 ++--
 misc/Versions              |    3 +++
 misc/syncfs.c              |   33 +++++++++++++++++++++++++++++++++
 posix/unistd.h             |    9 ++++++++-
 sysdeps/unix/syscalls.list |    1 +
 6 files changed, 48 insertions(+), 3 deletions(-)
 create mode 100644 misc/syncfs.c

diff --git a/Versions.def b/Versions.def
index 0ccda50..e478fdd 100644
--- a/Versions.def
+++ b/Versions.def
@@ -30,5 +30,6 @@ libc {
   GLIBC_2.11
   GLIBC_2.12
+  GLIBC_2.14
 %ifdef USE_IN_LIBIO
   HURD_CTHREADS_0.3
 %endif
diff --git a/misc/Makefile b/misc/Makefile
index ee69361..52b13da 100644
--- a/misc/Makefile
+++ b/misc/Makefile
@@ -1,4 +1,4 @@
-# Copyright (C) 1991-2006, 2007, 2009 Free Software Foundation, Inc.
+# Copyright (C) 1991-2006, 2007, 2009, 2011 Free Software Foundation, Inc.
 # This file is part of the GNU C Library.
 
 # The GNU C Library is free software; you can redistribute it and/or
@@ -45,7 +45,7 @@ routines := brk sbrk sstk ioctl \
 	    getdtsz \
 	    gethostname sethostname getdomain setdomain \
 	    select pselect \
-	    acct chroot fsync sync fdatasync reboot \
+	    acct chroot fsync sync fdatasync syncfs reboot \
 	    gethostid sethostid \
 	    vhangup \
 	    swapon swapoff mktemp mkstemp mkstemp64 mkdtemp \
diff --git a/misc/Versions b/misc/Versions
index 3ffe3d1..3a31c7f 100644
--- a/misc/Versions
+++ b/misc/Versions
@@ -143,4 +143,7 @@ libc {
   GLIBC_2.11 {
     mkstemps; mkstemps64; mkostemps; mkostemps64;
   }
+  GLIBC_2.14 {
+    syncfs;
+  }
 }
diff --git a/misc/syncfs.c b/misc/syncfs.c
new file mode 100644
index 0000000..bd7328c
--- /dev/null
+++ b/misc/syncfs.c
@@ -0,0 +1,33 @@
+/* Copyright (C) 2011 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, write to the Free
+   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
+   02111-1307 USA.  */
+
+#include <errno.h>
+#include <unistd.h>
+
+/* Make all changes done to all files on the file system associated
+   with FD actually appear on disk.  */
+int
+syncfs (int fd)
+{
+  __set_errno (ENOSYS);
+  return -1;
+}
+
+
+stub_warning (syncfs)
+#include <stub-tag.h>
diff --git a/posix/unistd.h b/posix/unistd.h
index 5ebcaf1..aa11860 100644
--- a/posix/unistd.h
+++ b/posix/unistd.h
@@ -1,4 +1,4 @@
-/* Copyright (C) 1991-2006, 2007, 2008, 2009 Free Software Foundation, Inc.
+/* Copyright (C) 1991-2009, 2010, 2011 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -974,6 +974,13 @@ extern int fsync (int __fd);
 #endif /* Use BSD || X/Open || Unix98.  */
 
 
+#ifdef __USE_GNU
+/* Make all changes done to all files on the file system associated
+   with FD actually appear on disk.  */
+extern int syncfs (int __fd) __THROW;
+#endif
+
+
 #if defined __USE_BSD || defined __USE_XOPEN_EXTENDED
 
 /* Return identifier for the current host.  */
diff --git a/sysdeps/unix/syscalls.list b/sysdeps/unix/syscalls.list
index 04ed63c..ad49170 100644
--- a/sysdeps/unix/syscalls.list
+++ b/sysdeps/unix/syscalls.list
@@ -55,6 +55,7 @@ swapoff		-	swapoff		i:s	swapoff
 swapon		-	swapon		i:s	swapon
 symlink		-	symlink		i:ss	__symlink	symlink
 sync		-	sync		i:	sync
+syncfs		-	syncfs		i:i	syncfs
 sys_fstat	fxstat	fstat		i:ip	__syscall_fstat
 sys_mknod	xmknod	mknod		i:sii	__syscall_mknod
 sys_stat	xstat	stat		i:sp	__syscall_stat
-- 
1.7.4



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04  9:35                 ` Alexandre DERUMIER
@ 2012-06-04  9:53                   ` Yann Dupont
  0 siblings, 0 replies; 36+ messages in thread
From: Yann Dupont @ 2012-06-04  9:53 UTC (permalink / raw)
  To: Alexandre DERUMIER
  Cc: Hannes Reinecke, Stefan Priebe - Profihost AG, Mark Nelson,
	ceph-devel, Stefan Majer

Le 04/06/2012 11:35, Alexandre DERUMIER a écrit :
> Hi,
> about this:
>>> Turning off Virtualisation extension in BIOS. Don't know why, but it
>>> gaves us crappy performance. We usually put it on, because we use KVM a
>>> lot. In our case, OSD are in bare metal and disabling virtualisation
>>> extension gives us a very big boost.
>>> It may be a BIOS bug in our machines (DELL M610).
>
> It could be related to iommu, if you pass intel_iommu=on in grub.
> I have already had this kind of problem.
>
> When intel_iommu=on, Linux (completely unrelated to KVM) adds a new level
> of protection which didn't exist without an IOMMU - the network card, which
> without an IOMMU could write (via DMA) to any memory location, now is
> not allowed - the card can only write to memory locates which the OS
> wanted it to write. Theoretically, this can protect the OS against
> various kinds of attacks. But what happens now is that every time that
> Linux passes a new buffer to the card, it needs to change the IOMMU
> mappings. This noticably slows down I/O, unfortunately.
>
>

Infortunately, this is not the case. The intel card supports it, but 
DELL M160 don't.
And I just checked, ou linux command line don't include intel_iommu=on.

BTW, it seems that turning on virtualization on bios kills performance 
on integrated ixgbe driver. Sourceforge one seems less affected. Our 
tests were circa kernel 3.2 , it may have changed since.

Cheers,



-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04  9:47                 ` Amon Ott
@ 2012-06-04  9:58                   ` Yann Dupont
  2012-06-04 11:40                   ` Alexandre DERUMIER
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 36+ messages in thread
From: Yann Dupont @ 2012-06-04  9:58 UTC (permalink / raw)
  To: Amon Ott; +Cc: ceph-devel

Le 04/06/2012 11:47, Amon Ott a écrit :
> even libc6 in Debian experimental has syncfs() support.
>
> Also see thread "OSD deadlock with cephfs client and OSD on same machine"

Great , thanks for explanation.

... lots of tests to do this afternoon :) I need to convert my OSD with 
xfs, benchmark with standard libc, then convert libc with your patch & 
retest.

Thanks,
cheers,
-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04  9:47                 ` Amon Ott
  2012-06-04  9:58                   ` Yann Dupont
@ 2012-06-04 11:40                   ` Alexandre DERUMIER
  2012-06-04 12:59                     ` Mark Nelson
  2012-06-04 15:42                   ` Stefan Priebe
  2012-06-06 10:48                   ` Stefan Priebe - Profihost AG
  3 siblings, 1 reply; 36+ messages in thread
From: Alexandre DERUMIER @ 2012-06-04 11:40 UTC (permalink / raw)
  To: Amon Ott; +Cc: ceph-devel, Yann Dupont

Hi, 

I'm currently doing some tests with xfs, debian wheezy with standard libc6 (2.11.3-3) and 3.2 kernel.

I'm doing some iostats(3 nodes with 5 osd), and I see constant writes to disks.(as the datas are flushed each second from journal to disk).

Journal is big enough (20GB tmpfs) to handle 30s of write.

Do you think it's related to the missing syncfs() support ?

-Alexandre


----- Mail original ----- 

De: "Amon Ott" <a.ott@m-privacy.de> 
À: "Yann Dupont" <Yann.Dupont@univ-nantes.fr> 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Lundi 4 Juin 2012 11:47:22 
Objet: Re: Infiniband 40GB 

On Monday 04 June 2012 you wrote: 
> Le 04/06/2012 10:23, Stefan Majer a écrit : 
> > Hi Hannes, 
> > 
> > our production environment is running on 10GB infrastructure. We had a 
> > lot of troubles till we got to where we are today. 
> > We use Intel X520 D2 cards on our OSD´s and nexus switch 
> > infrastructure. All other cards we where testing failed horrible. 
> 
> we have Intel Corporation 82599EB 10 Gigabit Dual Port Backplane 
> Connection (rev 01)... Don't know the 'commercial name'. ixgbe driver. 
> 
> > Some of the problems we encountered have been: 
> > - page allocation failures in the ixgbe driver --> fixed in upstream 
> > - problems with jumbo frames, we had to disable tso, gro, lro -- > 
> > this is the most obscure thing 
> > - various tuning via sysctl in the net.tcp and net.ipv4 area --> this 
> > was also the outcome of stefan´s benchmarking odysee. 
> 
> some tuning we made : 
> 
> -> Turning off Virtualisation extension in BIOS. Don't know why, but it 
> gaves us crappy performance. We usually put it on, because we use KVM a 
> lot. In our case, OSD are in bare metal and disabling virtualisation 
> extension gives us a very big boost. 
> It may be a BIOS bug in our machines (DELL M610). 
> 
> -> One of my colleague played with receive flow steeting ; the intel 
> card supports multi queue, so it seems we can gain a little with it : 
> 
> !/bin/sh 
> 
> for x in $(seq 0 23); do echo FFFFFFFF > 
> /sys/class/net/eth2/queues/rx-${x}/rps_cpus; done 
> echo 16384 > /proc/sys/net/core/rps_sock_flow_entries 
> for x in $(seq 0 23); do echo 16384 > 
> /sys/class/net/eth2/queues/rx-${x}/rps_flow_cnt; done 
> 
> > But after all this we a quite happy actully and are only limited by 
> > the speed of the drives (2TB SATA). 
> > The fsync is a fdatasync in fact which is available in newer glibc. If 
> > you dont use btrfs (we use xfs) you need to use a recent glibc with 
> > fdatasync support. 
> 
> Does it may explain why we see loosy performance with xfs right now ? 
> That the main reason we're stuck with btrfs for the moment. 
> 
> we're using debian 'stable' : libc is 
> libc6 2.11.3-3 
> probably too old ? 

One reason for performance problems with that libc6 version is missing 
syncfs() support. I backported a patch for 2.13, originally by Andreas 
Schwab, schwab@redhat.com, to Debian stable code. Patch is attached. 

Copy the patch to eglibc's debian/patches/, add to debian/patches/series, 
rebuild eglibc packages (including libc6) with dpkg-buildpackage, install new 
libc6-dev, rebuild ceph packages against it, install and retry. AFAIK, not 
even libc6 in Debian experimental has syncfs() support. 

Also see thread "OSD deadlock with cephfs client and OSD on same machine" 

Amon Ott 
-- 
Dr. Amon Ott 
m-privacy GmbH Tel: +49 30 24342334 
Am Köllnischen Park 1 Fax: +49 30 24342336 
10179 Berlin http://www.m-privacy.de 

Amtsgericht Charlottenburg, HRB 84946 

Geschäftsführer: 
Dipl.-Kfm. Holger Maczkowsky, 
Roman Maczkowsky 

GnuPG-Key-ID: 0x2DD3A649 



-- 

-- 




	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 
	
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04  6:22   ` Hannes Reinecke
  2012-06-04  7:26     ` Stefan Priebe - Profihost AG
@ 2012-06-04 12:28     ` Mark Nelson
  2012-06-04 12:34       ` Tomasz Paszkowski
  1 sibling, 1 reply; 36+ messages in thread
From: Mark Nelson @ 2012-06-04 12:28 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Stefan Priebe, ceph-devel

On 6/4/12 1:22 AM, Hannes Reinecke wrote:
> On 06/03/2012 02:56 PM, Mark Nelson wrote:
>> On 6/3/12 3:10 AM, Stefan Priebe wrote:
>>> Hi List,
>>>
>>> has anybody already tried CEPH over Infiniband 40GB?
>>>
>>> Stefan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>> Hi Stefan,
>>
>> A couple of folks have done DDR IB.  For now you are limited to
>> ipoib though.  If you have the hardware available I'd be really
>> curious what kind of throughput/latencies you see.
>>
> Hehe.
>
> Good luck with that.
>
> We've tried on 10GigE with _disastrous_ results.
> Up to the point where 1GigE was actually _faster_.

Strange!  Do you see good results with something like iperf?  Internally 
we have 10GE on some of our test nodes and I can get up to around 
600MB/s per node during rados bench testing.

> So far we've uncovered two issues:
> - intel_idle was/is seriously broken (we've tried on 3.0-stable,
>    so might've been fixed by now)
> - osd-server is calling 'fsync' on each and every write request.
>    Does wonders for performance ...

For syncfs support, upgrade to a distro with glibc 2.13+ (ie precise). 
I've noticed a significant improvement in our spinning disk performance 
going from oneiric and kernel 3.3 to precise and kernel 3.4.  I think 
part of this is related to the raid drivers for the cards we have in our 
test boxes though.  I'm actually recording blktrace and seekwatcher 
results for all of our tests to specifically look at syncs and disk seek 
behavior...

>
> Cheers,
>
> Hannes

Mark

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04 12:28     ` Mark Nelson
@ 2012-06-04 12:34       ` Tomasz Paszkowski
  2012-06-04 12:40         ` Mark Nelson
  0 siblings, 1 reply; 36+ messages in thread
From: Tomasz Paszkowski @ 2012-06-04 12:34 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Hannes Reinecke, Stefan Priebe, ceph-devel

On Mon, Jun 4, 2012 at 2:28 PM, Mark Nelson <mark.nelson@inktank.com> wrote:
>
> For syncfs support, upgrade to a distro with glibc 2.13+ (ie precise). I've
> noticed a significant improvement in our spinning disk performance going
> from oneiric and kernel 3.3 to precise and kernel 3.4.  I think part of this
> is related to the raid drivers for the cards we have in our test boxes
> though.  I'm actually recording blktrace and seekwatcher results for all of
> our tests to specifically look at syncs and disk seek behavior...
>

Correct me if I'am wrong. But AFAIR precise in running 3.2 kernel.
-- 
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04 12:34       ` Tomasz Paszkowski
@ 2012-06-04 12:40         ` Mark Nelson
  0 siblings, 0 replies; 36+ messages in thread
From: Mark Nelson @ 2012-06-04 12:40 UTC (permalink / raw)
  To: Tomasz Paszkowski; +Cc: Hannes Reinecke, Stefan Priebe, ceph-devel

On 6/4/12 7:34 AM, Tomasz Paszkowski wrote:
> On Mon, Jun 4, 2012 at 2:28 PM, Mark Nelson<mark.nelson@inktank.com>  wrote:
>>
>> For syncfs support, upgrade to a distro with glibc 2.13+ (ie precise). I've
>> noticed a significant improvement in our spinning disk performance going
>> from oneiric and kernel 3.3 to precise and kernel 3.4.  I think part of this
>> is related to the raid drivers for the cards we have in our test boxes
>> though.  I'm actually recording blktrace and seekwatcher results for all of
>> our tests to specifically look at syncs and disk seek behavior...
>>
>
> Correct me if I'am wrong. But AFAIR precise in running 3.2 kernel.

Sorry, I should have been more clear.  We were running oneiric with our 
own kernel 3.3 build and are now running precise with our own kernel 3.4 
build (available on gitbuilder.ceph.com).

Mark

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04 11:40                   ` Alexandre DERUMIER
@ 2012-06-04 12:59                     ` Mark Nelson
  2012-06-04 13:07                       ` Alexandre DERUMIER
  2012-06-06 16:05                       ` Alexandre DERUMIER
  0 siblings, 2 replies; 36+ messages in thread
From: Mark Nelson @ 2012-06-04 12:59 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Amon Ott, ceph-devel, Yann Dupont

On 6/4/12 6:40 AM, Alexandre DERUMIER wrote:
> Hi,
>
> I'm currently doing some tests with xfs, debian wheezy with standard libc6 (2.11.3-3) and 3.2 kernel.
>
> I'm doing some iostats(3 nodes with 5 osd), and I see constant writes to disks.(as the datas are flushed each second from journal to disk).
>
> Journal is big enough (20GB tmpfs) to handle 30s of write.
>
> Do you think it's related to the missing syncfs() support ?
>
> -Alexandre

Hi Alexandre,

I've included some seekwatcher results for rados bench tests using 16 
concurrent 4MB writes on XFS OSD.  One shows ubuntu oneiric and the 
other precise (ie no syncfs support vs syncfs support in libc). 
Unfortunately the original test was on 0.46 and the second test was on 
0.47.2, so multiple things changed between the tests.  Both were tested 
with kernel 3.4.  Interestingly the seeks/second don't seem to drop much 
but the overall performance has about doubled.  This was using a single 
7200rpm disk for the OSD data disk and a seperate 7200rpm disk for the 
journal in both cases.  I'd definitely try 0.47.2 with a new libc though 
and see how that works for you.

ceph 0.46/oneiric:
http://nhm.ceph.com/movies/mailinglist-tests/xfs-osd0-oneiric-3.4.mpg

ceph 0.47.2/precise:
http://nhm.ceph.com/movies/mailinglist-tests/xfs-osd0-precise-3.4.mpg

Mark

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04 12:59                     ` Mark Nelson
@ 2012-06-04 13:07                       ` Alexandre DERUMIER
  2012-06-04 13:28                         ` Mark Nelson
  2012-06-06 16:05                       ` Alexandre DERUMIER
  1 sibling, 1 reply; 36+ messages in thread
From: Alexandre DERUMIER @ 2012-06-04 13:07 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Amon Ott, ceph-devel, Yann Dupont

Thanks Mark,
I'll rebuild my cluster with ubuntu precise tomorrow. (Don't have time to backport/maintain libc6 ;)


BTW, do you use mainly ubuntu at intank for your tests ?

I'd like to have a setup as close as possible of intank setup.


----- Mail original ----- 

De: "Mark Nelson" <mark.nelson@inktank.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Amon Ott" <a.ott@m-privacy.de>, ceph-devel@vger.kernel.org, "Yann Dupont" <Yann.Dupont@univ-nantes.fr> 
Envoyé: Lundi 4 Juin 2012 14:59:58 
Objet: Re: Infiniband 40GB 

On 6/4/12 6:40 AM, Alexandre DERUMIER wrote: 
> Hi, 
> 
> I'm currently doing some tests with xfs, debian wheezy with standard libc6 (2.11.3-3) and 3.2 kernel. 
> 
> I'm doing some iostats(3 nodes with 5 osd), and I see constant writes to disks.(as the datas are flushed each second from journal to disk). 
> 
> Journal is big enough (20GB tmpfs) to handle 30s of write. 
> 
> Do you think it's related to the missing syncfs() support ? 
> 
> -Alexandre 

Hi Alexandre, 

I've included some seekwatcher results for rados bench tests using 16 
concurrent 4MB writes on XFS OSD. One shows ubuntu oneiric and the 
other precise (ie no syncfs support vs syncfs support in libc). 
Unfortunately the original test was on 0.46 and the second test was on 
0.47.2, so multiple things changed between the tests. Both were tested 
with kernel 3.4. Interestingly the seeks/second don't seem to drop much 
but the overall performance has about doubled. This was using a single 
7200rpm disk for the OSD data disk and a seperate 7200rpm disk for the 
journal in both cases. I'd definitely try 0.47.2 with a new libc though 
and see how that works for you. 

ceph 0.46/oneiric: 
http://nhm.ceph.com/movies/mailinglist-tests/xfs-osd0-oneiric-3.4.mpg 

ceph 0.47.2/precise: 
http://nhm.ceph.com/movies/mailinglist-tests/xfs-osd0-precise-3.4.mpg 

Mark 



-- 

-- 




	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 
	
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04 13:07                       ` Alexandre DERUMIER
@ 2012-06-04 13:28                         ` Mark Nelson
  2012-06-04 15:11                           ` Gregory Farnum
  0 siblings, 1 reply; 36+ messages in thread
From: Mark Nelson @ 2012-06-04 13:28 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Amon Ott, ceph-devel, Yann Dupont

Hi Alexandre,

A lot of our testing is on Ubuntu right now.  I'm using the ceph and 
kernel debs from ceph.gitbuilder.com for my tests.  Post some results to 
the list once you get your cluster setup!

Thanks,
Mark


On 6/4/12 8:07 AM, Alexandre DERUMIER wrote:
> Thanks Mark,
> I'll rebuild my cluster with ubuntu precise tomorrow. (Don't have time to backport/maintain libc6 ;)
>
>
> BTW, do you use mainly ubuntu at intank for your tests ?
>
> I'd like to have a setup as close as possible of intank setup.
>
>
> ----- Mail original -----
>
> De: "Mark Nelson"<mark.nelson@inktank.com>
> À: "Alexandre DERUMIER"<aderumier@odiso.com>
> Cc: "Amon Ott"<a.ott@m-privacy.de>, ceph-devel@vger.kernel.org, "Yann Dupont"<Yann.Dupont@univ-nantes.fr>
> Envoyé: Lundi 4 Juin 2012 14:59:58
> Objet: Re: Infiniband 40GB
>
> On 6/4/12 6:40 AM, Alexandre DERUMIER wrote:
>> Hi,
>>
>> I'm currently doing some tests with xfs, debian wheezy with standard libc6 (2.11.3-3) and 3.2 kernel.
>>
>> I'm doing some iostats(3 nodes with 5 osd), and I see constant writes to disks.(as the datas are flushed each second from journal to disk).
>>
>> Journal is big enough (20GB tmpfs) to handle 30s of write.
>>
>> Do you think it's related to the missing syncfs() support ?
>>
>> -Alexandre
>
> Hi Alexandre,
>
> I've included some seekwatcher results for rados bench tests using 16
> concurrent 4MB writes on XFS OSD. One shows ubuntu oneiric and the
> other precise (ie no syncfs support vs syncfs support in libc).
> Unfortunately the original test was on 0.46 and the second test was on
> 0.47.2, so multiple things changed between the tests. Both were tested
> with kernel 3.4. Interestingly the seeks/second don't seem to drop much
> but the overall performance has about doubled. This was using a single
> 7200rpm disk for the OSD data disk and a seperate 7200rpm disk for the
> journal in both cases. I'd definitely try 0.47.2 with a new libc though
> and see how that works for you.
>
> ceph 0.46/oneiric:
> http://nhm.ceph.com/movies/mailinglist-tests/xfs-osd0-oneiric-3.4.mpg
>
> ceph 0.47.2/precise:
> http://nhm.ceph.com/movies/mailinglist-tests/xfs-osd0-precise-3.4.mpg
>
> Mark
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04 13:28                         ` Mark Nelson
@ 2012-06-04 15:11                           ` Gregory Farnum
  2012-06-04 15:34                             ` Mark Nelson
  0 siblings, 1 reply; 36+ messages in thread
From: Gregory Farnum @ 2012-06-04 15:11 UTC (permalink / raw)
  To: ceph-devel; +Cc: Alexandre DERUMIER, Amon Ott, Yann Dupont, Mark Nelson

On Monday, June 4, 2012 at 6:28 AM, Mark Nelson wrote:
> Hi Alexandre,
> 
> A lot of our testing is on Ubuntu right now. I'm using the ceph and 
> kernel debs from ceph.gitbuilder.com (http://ceph.gitbuilder.com) for my tests. Post some results to 
> the list once you get your cluster setup!
> 

I think he means gitbuilder.ceph.com. ;)
-Greg


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04 15:11                           ` Gregory Farnum
@ 2012-06-04 15:34                             ` Mark Nelson
  0 siblings, 0 replies; 36+ messages in thread
From: Mark Nelson @ 2012-06-04 15:34 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, Alexandre DERUMIER, Amon Ott, Yann Dupont

On 06/04/2012 10:11 AM, Gregory Farnum wrote:
> On Monday, June 4, 2012 at 6:28 AM, Mark Nelson wrote:
>> Hi Alexandre,
>>
>> A lot of our testing is on Ubuntu right now. I'm using the ceph and
>> kernel debs from ceph.gitbuilder.com (http://ceph.gitbuilder.com) for my tests. Post some results to
>> the list once you get your cluster setup!
>>
>
> I think he means gitbuilder.ceph.com. ;)
> -Greg
>

Doh!  This is why I need caffeine before writing emails.  Thanks Greg. :)

Mark

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04  9:47                 ` Amon Ott
  2012-06-04  9:58                   ` Yann Dupont
  2012-06-04 11:40                   ` Alexandre DERUMIER
@ 2012-06-04 15:42                   ` Stefan Priebe
  2012-06-05  7:08                     ` Amon Ott
  2012-06-06 10:48                   ` Stefan Priebe - Profihost AG
  3 siblings, 1 reply; 36+ messages in thread
From: Stefan Priebe @ 2012-06-04 15:42 UTC (permalink / raw)
  To: Amon Ott; +Cc: Yann Dupont, ceph-devel

Hi Amon,

thanks for your backported patch. At least it doesn't cleanly apply to 
debian squeeze stable as it wants a glic 2.12 in Versions.def but Debian 
is only at 2.11? Do you use another patch too?

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04 15:42                   ` Stefan Priebe
@ 2012-06-05  7:08                     ` Amon Ott
  2012-06-05  7:46                       ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 36+ messages in thread
From: Amon Ott @ 2012-06-05  7:08 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Yann Dupont, ceph-devel

On Monday 04 June 2012 wrote Stefan Priebe:
> Hi Amon,
>
> thanks for your backported patch. At least it doesn't cleanly apply to
> debian squeeze stable as it wants a glic 2.12 in Versions.def but Debian
> is only at 2.11? Do you use another patch too?

I ripped the patch right out of our previously built 2.11.3-3 source tree. It 
needs to be last in the series file, because several existing Debian patches 
modify the sources at various places. I could also make our compiled packages 
available to you for download.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-05  7:08                     ` Amon Ott
@ 2012-06-05  7:46                       ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 36+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-05  7:46 UTC (permalink / raw)
  To: Amon Ott; +Cc: Yann Dupont, ceph-devel

Am 05.06.2012 09:08, schrieb Amon Ott:
> On Monday 04 June 2012 wrote Stefan Priebe:
>> Hi Amon,
>>
>> thanks for your backported patch. At least it doesn't cleanly apply to
>> debian squeeze stable as it wants a glic 2.12 in Versions.def but Debian
>> is only at 2.11? Do you use another patch too?
>
> I ripped the patch right out of our previously built 2.11.3-3 source tree. It
> needs to be last in the series file, because several existing Debian patches
> modify the sources at various places. I could also make our compiled packages
> available to you for download.

Sorry i added the file in front of the series file...

Thanks
Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04  8:23             ` Stefan Majer
  2012-06-04  9:21               ` Yann Dupont
@ 2012-06-05  8:54               ` Stefan Priebe - Profihost AG
  1 sibling, 0 replies; 36+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-05  8:54 UTC (permalink / raw)
  To: Stefan Majer; +Cc: Hannes Reinecke, Mark Nelson, ceph-devel

Hi Stefan,

Am 04.06.2012 10:23, schrieb Stefan Majer:
> our production environment is running on 10GB infrastructure. We had a
> lot of troubles till we got to where we are today.
> We use Intel X520 D2 cards on our OSD´s and nexus switch
> infrastructure. All other cards we where testing failed horrible.

Have you also tried emulex cards? (also used by HP)

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04  9:47                 ` Amon Ott
                                     ` (2 preceding siblings ...)
  2012-06-04 15:42                   ` Stefan Priebe
@ 2012-06-06 10:48                   ` Stefan Priebe - Profihost AG
  2012-06-06 10:57                     ` Amon Ott
  3 siblings, 1 reply; 36+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-06 10:48 UTC (permalink / raw)
  To: Amon Ott; +Cc: Yann Dupont, ceph-devel

Hi Amon,

i've added your patch:
# strings /lib/libc-2.11.3.so |grep -i syncfs
syncfs

But configure of ceph still claims there is no syncfs support.

# ./configure |grep -i sync
checking for syncfs... no
checking for sync_file_range... yes

Any ideas?

Hint: I'm compiling my packages on an OpenVZ RHEL6 based virtual 
container - so THIS kernel where i'm compiling does not support syncfs 
is this the reason?

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-06 10:48                   ` Stefan Priebe - Profihost AG
@ 2012-06-06 10:57                     ` Amon Ott
  2012-06-06 11:02                       ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 36+ messages in thread
From: Amon Ott @ 2012-06-06 10:57 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel

On Wednesday 06 June 2012 wrote Stefan Priebe - Profihost AG:
> Hi Amon,
>
> i've added your patch:
> # strings /lib/libc-2.11.3.so |grep -i syncfs
> syncfs
>
> But configure of ceph still claims there is no syncfs support.
>
> # ./configure |grep -i sync
> checking for syncfs... no
> checking for sync_file_range... yes
>
> Any ideas?

Did you also install the new libc6-dev, which contains the new header files?

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-06 10:57                     ` Amon Ott
@ 2012-06-06 11:02                       ` Stefan Priebe - Profihost AG
  2012-06-07 11:33                         ` Amon Ott
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-06 11:02 UTC (permalink / raw)
  To: Amon Ott; +Cc: ceph-devel

Am 06.06.2012 12:57, schrieb Amon Ott:
> On Wednesday 06 June 2012 wrote Stefan Priebe - Profihost AG:
>> Hi Amon,
>>
>> i've added your patch:
>> # strings /lib/libc-2.11.3.so |grep -i syncfs
>> syncfs
>>
>> But configure of ceph still claims there is no syncfs support.
>>
>> # ./configure |grep -i sync
>> checking for syncfs... no
>> checking for sync_file_range... yes
>>
>> Any ideas?
>
> Did you also install the new libc6-dev, which contains the new header files?

Yes.

/usr/include/unistd.h:
extern int syncfs (int __fd) __THROW;

/usr/include/gnu/stubs-64.h:
#define __stub_syncfs

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-04 12:59                     ` Mark Nelson
  2012-06-04 13:07                       ` Alexandre DERUMIER
@ 2012-06-06 16:05                       ` Alexandre DERUMIER
  2012-06-06 16:43                         ` Mark Nelson
  1 sibling, 1 reply; 36+ messages in thread
From: Alexandre DERUMIER @ 2012-06-06 16:05 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Amon Ott, ceph-devel, Yann Dupont

Hi, I have rebuild my cluster with ubuntu precise, 

-kernel 3.2
-ceph 0.47.2
-libc6 2.15
-3 nodes - 5 osd (xfs) by node and 1 tmpfs with 5 journal file.

I had launch rados bench,
and I see again constant writes to xfs....

Maybe this is related to tmpfs ?


I'll retry with kernel 3.4 from intank tomorrow.
I'll also try with journal on a physical disk with xfs partition.

I'll keep you in touch.


----- Mail original ----- 

De: "Mark Nelson" <mark.nelson@inktank.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Amon Ott" <a.ott@m-privacy.de>, ceph-devel@vger.kernel.org, "Yann Dupont" <Yann.Dupont@univ-nantes.fr> 
Envoyé: Lundi 4 Juin 2012 14:59:58 
Objet: Re: Infiniband 40GB 

On 6/4/12 6:40 AM, Alexandre DERUMIER wrote: 
> Hi, 
> 
> I'm currently doing some tests with xfs, debian wheezy with standard libc6 (2.11.3-3) and 3.2 kernel. 
> 
> I'm doing some iostats(3 nodes with 5 osd), and I see constant writes to disks.(as the datas are flushed each second from journal to disk). 
> 
> Journal is big enough (20GB tmpfs) to handle 30s of write. 
> 
> Do you think it's related to the missing syncfs() support ? 
> 
> -Alexandre 

Hi Alexandre, 

I've included some seekwatcher results for rados bench tests using 16 
concurrent 4MB writes on XFS OSD. One shows ubuntu oneiric and the 
other precise (ie no syncfs support vs syncfs support in libc). 
Unfortunately the original test was on 0.46 and the second test was on 
0.47.2, so multiple things changed between the tests. Both were tested 
with kernel 3.4. Interestingly the seeks/second don't seem to drop much 
but the overall performance has about doubled. This was using a single 
7200rpm disk for the OSD data disk and a seperate 7200rpm disk for the 
journal in both cases. I'd definitely try 0.47.2 with a new libc though 
and see how that works for you. 

ceph 0.46/oneiric: 
http://nhm.ceph.com/movies/mailinglist-tests/xfs-osd0-oneiric-3.4.mpg 

ceph 0.47.2/precise: 
http://nhm.ceph.com/movies/mailinglist-tests/xfs-osd0-precise-3.4.mpg 

Mark 



-- 

-- 




	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 
	
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-06 16:05                       ` Alexandre DERUMIER
@ 2012-06-06 16:43                         ` Mark Nelson
  0 siblings, 0 replies; 36+ messages in thread
From: Mark Nelson @ 2012-06-06 16:43 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Amon Ott, ceph-devel, Yann Dupont

Hi Alexandre,

If you can run blktrace during your test on one of the OSD data disks 
and send me the results I can take a look at them.  Also, the rados 
bench settings and output would be useful too.

Thanks,
Mark

On 6/6/12 11:05 AM, Alexandre DERUMIER wrote:
> Hi, I have rebuild my cluster with ubuntu precise,
>
> -kernel 3.2
> -ceph 0.47.2
> -libc6 2.15
> -3 nodes - 5 osd (xfs) by node and 1 tmpfs with 5 journal file.
>
> I had launch rados bench,
> and I see again constant writes to xfs....
>
> Maybe this is related to tmpfs ?
>
>
> I'll retry with kernel 3.4 from intank tomorrow.
> I'll also try with journal on a physical disk with xfs partition.
>
> I'll keep you in touch.
>
>
> ----- Mail original -----
>
> De: "Mark Nelson"<mark.nelson@inktank.com>
> À: "Alexandre DERUMIER"<aderumier@odiso.com>
> Cc: "Amon Ott"<a.ott@m-privacy.de>, ceph-devel@vger.kernel.org, "Yann Dupont"<Yann.Dupont@univ-nantes.fr>
> Envoyé: Lundi 4 Juin 2012 14:59:58
> Objet: Re: Infiniband 40GB
>
> On 6/4/12 6:40 AM, Alexandre DERUMIER wrote:
>> Hi,
>>
>> I'm currently doing some tests with xfs, debian wheezy with standard libc6 (2.11.3-3) and 3.2 kernel.
>>
>> I'm doing some iostats(3 nodes with 5 osd), and I see constant writes to disks.(as the datas are flushed each second from journal to disk).
>>
>> Journal is big enough (20GB tmpfs) to handle 30s of write.
>>
>> Do you think it's related to the missing syncfs() support ?
>>
>> -Alexandre
>
> Hi Alexandre,
>
> I've included some seekwatcher results for rados bench tests using 16
> concurrent 4MB writes on XFS OSD. One shows ubuntu oneiric and the
> other precise (ie no syncfs support vs syncfs support in libc).
> Unfortunately the original test was on 0.46 and the second test was on
> 0.47.2, so multiple things changed between the tests. Both were tested
> with kernel 3.4. Interestingly the seeks/second don't seem to drop much
> but the overall performance has about doubled. This was using a single
> 7200rpm disk for the OSD data disk and a seperate 7200rpm disk for the
> journal in both cases. I'd definitely try 0.47.2 with a new libc though
> and see how that works for you.
>
> ceph 0.46/oneiric:
> http://nhm.ceph.com/movies/mailinglist-tests/xfs-osd0-oneiric-3.4.mpg
>
> ceph 0.47.2/precise:
> http://nhm.ceph.com/movies/mailinglist-tests/xfs-osd0-precise-3.4.mpg
>
> Mark
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-06 11:02                       ` Stefan Priebe - Profihost AG
@ 2012-06-07 11:33                         ` Amon Ott
  2012-06-07 12:44                           ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 36+ messages in thread
From: Amon Ott @ 2012-06-07 11:33 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel

On Wednesday 06 June 2012 wrote Stefan Priebe - Profihost AG:
> Am 06.06.2012 12:57, schrieb Amon Ott:
> > On Wednesday 06 June 2012 wrote Stefan Priebe - Profihost AG:
> >> Hi Amon,
> >>
> >> i've added your patch:
> >> # strings /lib/libc-2.11.3.so |grep -i syncfs
> >> syncfs
> >>
> >> But configure of ceph still claims there is no syncfs support.
> >>
> >> # ./configure |grep -i sync
> >> checking for syncfs... no
> >> checking for sync_file_range... yes
> >>
> >> Any ideas?
> >
> > Did you also install the new libc6-dev, which contains the new header
> > files?
>
> Yes.
>
> /usr/include/unistd.h:
> extern int syncfs (int __fd) __THROW;
>
> /usr/include/gnu/stubs-64.h:
> #define __stub_syncfs

Are you building on 32 or 64 Bit? We have 32 here.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-07 11:33                         ` Amon Ott
@ 2012-06-07 12:44                           ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 36+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-06-07 12:44 UTC (permalink / raw)
  To: Amon Ott; +Cc: ceph-devel

Am 07.06.2012 13:33, schrieb Amon Ott:
> On Wednesday 06 June 2012 wrote Stefan Priebe - Profihost AG:
>> Am 06.06.2012 12:57, schrieb Amon Ott:
>>> On Wednesday 06 June 2012 wrote Stefan Priebe - Profihost AG:
>> /usr/include/unistd.h:
>> extern int syncfs (int __fd) __THROW;
>>
>> /usr/include/gnu/stubs-64.h:
>> #define __stub_syncfs
>
> Are you building on 32 or 64 Bit? We have 32 here.

64bit but does this make a difference?

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-07 11:25   ` Alexandre DERUMIER
@ 2012-06-07 17:15     ` Mark Nelson
  0 siblings, 0 replies; 36+ messages in thread
From: Mark Nelson @ 2012-06-07 17:15 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Amon Ott, ceph-devel, Yann Dupont

On 6/7/12 6:25 AM, Alexandre DERUMIER wrote:
> others tests done today: (kernel 3.4 - ubuntu precise)
>
> 3 nodes with 5 osd with btrfs, 1GB journal in tmps forced in writeahead
> 3 nodes with 1 osd with xfs,8GB journal in tmpfs
> 3 nodes with 1 osd with btfs,8GB journal in tmpfs forced in writeahead
>
> 3 nodes with 5 osd with btrfs, 20G journal on disk forced in writeahead
> 3 nodes with 1 osd with xfs,20GB journal on disk
> 3 nodes with 1 osd with btfs,20GB journal on disk forced in writeahead
>
>
>
>
> same behaviour for all cases, writes are constant to disk.
>
> benched with:
> rados -p pool3 bench 60 write -t 16
>
> also with
> fio, bonnie  , random/seq write from guest vm with differents block size.
>
>
>
>
> ----- Mail original -----
>
> De: "Alexandre DERUMIER"<aderumier@odiso.com>
> À: "Mark Nelson"<mark.nelson@inktank.com>
> Cc: "Amon Ott"<a.ott@m-privacy.de>, ceph-devel@vger.kernel.org, "Yann Dupont"<Yann.Dupont@univ-nantes.fr>
> Envoyé: Jeudi 7 Juin 2012 05:31:15
> Objet: Re: Infiniband 40GB
>
> Hi again,
> I have done some tests with journals on a real disk, I have same behaviour.
>
> iostat show constant write to journal and write to disks at the same time since the beginning of the benchmark.
>
>
> maybe can I try to use differents partitions for each journal ? (currently I have 1 partition with 5 journal files of each osd)
>
> -Alexandre
>
>
>
> ----- Mail original -----
>
> De: "Alexandre DERUMIER"<aderumier@odiso.com>
> À: "Mark Nelson"<mark.nelson@inktank.com>
> Cc: "Amon Ott"<a.ott@m-privacy.de>, ceph-devel@vger.kernel.org, "Yann Dupont"<Yann.Dupont@univ-nantes.fr>
> Envoyé: Jeudi 7 Juin 2012 05:11:15
> Objet: Re: Infiniband 40GB
>
> Hi mark,
> I have attached a blktrace of /dev/sdb1 of node1 (osd.0)
>
> and also iostat (showing constant writes)
>
> bench used:
>
> rados -p pool3 bench 60 write -t 16
>
>
> kernel use : 3.4 from intank
>
> I'll do tests with journal on an xfs partition today
>

Hi Alexandre,

I'll try to take a look at the data you sent me later today.

Thanks!
Mark

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
  2012-06-07  3:31 ` Alexandre DERUMIER
@ 2012-06-07 11:25   ` Alexandre DERUMIER
  2012-06-07 17:15     ` Mark Nelson
  0 siblings, 1 reply; 36+ messages in thread
From: Alexandre DERUMIER @ 2012-06-07 11:25 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Amon Ott, ceph-devel, Yann Dupont

others tests done today: (kernel 3.4 - ubuntu precise)

3 nodes with 5 osd with btrfs, 1GB journal in tmps forced in writeahead
3 nodes with 1 osd with xfs,8GB journal in tmpfs
3 nodes with 1 osd with btfs,8GB journal in tmpfs forced in writeahead

3 nodes with 5 osd with btrfs, 20G journal on disk forced in writeahead
3 nodes with 1 osd with xfs,20GB journal on disk
3 nodes with 1 osd with btfs,20GB journal on disk forced in writeahead




same behaviour for all cases, writes are constant to disk.

benched with:
rados -p pool3 bench 60 write -t 16

also with 
fio, bonnie  , random/seq write from guest vm with differents block size.




----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "Mark Nelson" <mark.nelson@inktank.com> 
Cc: "Amon Ott" <a.ott@m-privacy.de>, ceph-devel@vger.kernel.org, "Yann Dupont" <Yann.Dupont@univ-nantes.fr> 
Envoyé: Jeudi 7 Juin 2012 05:31:15 
Objet: Re: Infiniband 40GB 

Hi again, 
I have done some tests with journals on a real disk, I have same behaviour. 

iostat show constant write to journal and write to disks at the same time since the beginning of the benchmark. 


maybe can I try to use differents partitions for each journal ? (currently I have 1 partition with 5 journal files of each osd) 

-Alexandre 



----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "Mark Nelson" <mark.nelson@inktank.com> 
Cc: "Amon Ott" <a.ott@m-privacy.de>, ceph-devel@vger.kernel.org, "Yann Dupont" <Yann.Dupont@univ-nantes.fr> 
Envoyé: Jeudi 7 Juin 2012 05:11:15 
Objet: Re: Infiniband 40GB 

Hi mark, 
I have attached a blktrace of /dev/sdb1 of node1 (osd.0) 

and also iostat (showing constant writes) 

bench used: 

rados -p pool3 bench 60 write -t 16 


kernel use : 3.4 from intank 

I'll do tests with journal on an xfs partition today 

----- Mail original ----- 

De: "Mark Nelson" <mark.nelson@inktank.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Amon Ott" <a.ott@m-privacy.de>, ceph-devel@vger.kernel.org, "Yann Dupont" <Yann.Dupont@univ-nantes.fr> 
Envoyé: Mercredi 6 Juin 2012 18:43:50 
Objet: Re: Infiniband 40GB 

Hi Alexandre, 

If you can run blktrace during your test on one of the OSD data disks 
and send me the results I can take a look at them. Also, the rados 
bench settings and output would be useful too. 

Thanks, 
Mark 

On 6/6/12 11:05 AM, Alexandre DERUMIER wrote: 
> Hi, I have rebuild my cluster with ubuntu precise, 
> 
> -kernel 3.2 
> -ceph 0.47.2 
> -libc6 2.15 
> -3 nodes - 5 osd (xfs) by node and 1 tmpfs with 5 journal file. 
> 
> I had launch rados bench, 
> and I see again constant writes to xfs.... 
> 
> Maybe this is related to tmpfs ? 
> 
> 
> I'll retry with kernel 3.4 from intank tomorrow. 
> I'll also try with journal on a physical disk with xfs partition. 
> 
> I'll keep you in touch. 
> 
> 
> ----- Mail original ----- 
> 
> De: "Mark Nelson"<mark.nelson@inktank.com> 
> À: "Alexandre DERUMIER"<aderumier@odiso.com> 
> Cc: "Amon Ott"<a.ott@m-privacy.de>, ceph-devel@vger.kernel.org, "Yann Dupont"<Yann.Dupont@univ-nantes.fr> 
> Envoyé: Lundi 4 Juin 2012 14:59:58 
> Objet: Re: Infiniband 40GB 
> 
> On 6/4/12 6:40 AM, Alexandre DERUMIER wrote: 
>> Hi, 
>> 
>> I'm currently doing some tests with xfs, debian wheezy with standard libc6 (2.11.3-3) and 3.2 kernel. 
>> 
>> I'm doing some iostats(3 nodes with 5 osd), and I see constant writes to disks.(as the datas are flushed each second from journal to disk). 
>> 
>> Journal is big enough (20GB tmpfs) to handle 30s of write. 
>> 
>> Do you think it's related to the missing syncfs() support ? 
>> 
>> -Alexandre 
> 
> Hi Alexandre, 
> 
> I've included some seekwatcher results for rados bench tests using 16 
> concurrent 4MB writes on XFS OSD. One shows ubuntu oneiric and the 
> other precise (ie no syncfs support vs syncfs support in libc). 
> Unfortunately the original test was on 0.46 and the second test was on 
> 0.47.2, so multiple things changed between the tests. Both were tested 
> with kernel 3.4. Interestingly the seeks/second don't seem to drop much 
> but the overall performance has about doubled. This was using a single 
> 7200rpm disk for the OSD data disk and a seperate 7200rpm disk for the 
> journal in both cases. I'd definitely try 0.47.2 with a new libc though 
> and see how that works for you. 
> 
> ceph 0.46/oneiric: 
> http://nhm.ceph.com/movies/mailinglist-tests/xfs-osd0-oneiric-3.4.mpg 
> 
> ceph 0.47.2/precise: 
> http://nhm.ceph.com/movies/mailinglist-tests/xfs-osd0-precise-3.4.mpg 
> 
> Mark 
> 
> 
> 




-- 

-- 




Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 



-- 

-- 




Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 

-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 

-- 




	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 
	
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Infiniband 40GB
       [not found] <a81f3855-1c7d-447b-9bbf-6a891e372909@mailpro>
@ 2012-06-07  3:31 ` Alexandre DERUMIER
  2012-06-07 11:25   ` Alexandre DERUMIER
  0 siblings, 1 reply; 36+ messages in thread
From: Alexandre DERUMIER @ 2012-06-07  3:31 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Amon Ott, ceph-devel, Yann Dupont

Hi again, 
I have done some tests with journals on a real disk, I have same behaviour.

iostat show constant write to journal and write to disks at the same time since the beginning of the benchmark.


maybe can I try to use differents partitions for each journal ? (currently I have 1 partition with 5 journal files of each osd)

-Alexandre



----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "Mark Nelson" <mark.nelson@inktank.com> 
Cc: "Amon Ott" <a.ott@m-privacy.de>, ceph-devel@vger.kernel.org, "Yann Dupont" <Yann.Dupont@univ-nantes.fr> 
Envoyé: Jeudi 7 Juin 2012 05:11:15 
Objet: Re: Infiniband 40GB 

Hi mark, 
I have attached a blktrace of /dev/sdb1 of node1 (osd.0) 

and also iostat (showing constant writes) 

bench used: 

rados -p pool3 bench 60 write -t 16 


kernel use : 3.4 from intank 

I'll do tests with journal on an xfs partition today 

----- Mail original ----- 

De: "Mark Nelson" <mark.nelson@inktank.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Amon Ott" <a.ott@m-privacy.de>, ceph-devel@vger.kernel.org, "Yann Dupont" <Yann.Dupont@univ-nantes.fr> 
Envoyé: Mercredi 6 Juin 2012 18:43:50 
Objet: Re: Infiniband 40GB 

Hi Alexandre, 

If you can run blktrace during your test on one of the OSD data disks 
and send me the results I can take a look at them. Also, the rados 
bench settings and output would be useful too. 

Thanks, 
Mark 

On 6/6/12 11:05 AM, Alexandre DERUMIER wrote: 
> Hi, I have rebuild my cluster with ubuntu precise, 
> 
> -kernel 3.2 
> -ceph 0.47.2 
> -libc6 2.15 
> -3 nodes - 5 osd (xfs) by node and 1 tmpfs with 5 journal file. 
> 
> I had launch rados bench, 
> and I see again constant writes to xfs.... 
> 
> Maybe this is related to tmpfs ? 
> 
> 
> I'll retry with kernel 3.4 from intank tomorrow. 
> I'll also try with journal on a physical disk with xfs partition. 
> 
> I'll keep you in touch. 
> 
> 
> ----- Mail original ----- 
> 
> De: "Mark Nelson"<mark.nelson@inktank.com> 
> À: "Alexandre DERUMIER"<aderumier@odiso.com> 
> Cc: "Amon Ott"<a.ott@m-privacy.de>, ceph-devel@vger.kernel.org, "Yann Dupont"<Yann.Dupont@univ-nantes.fr> 
> Envoyé: Lundi 4 Juin 2012 14:59:58 
> Objet: Re: Infiniband 40GB 
> 
> On 6/4/12 6:40 AM, Alexandre DERUMIER wrote: 
>> Hi, 
>> 
>> I'm currently doing some tests with xfs, debian wheezy with standard libc6 (2.11.3-3) and 3.2 kernel. 
>> 
>> I'm doing some iostats(3 nodes with 5 osd), and I see constant writes to disks.(as the datas are flushed each second from journal to disk). 
>> 
>> Journal is big enough (20GB tmpfs) to handle 30s of write. 
>> 
>> Do you think it's related to the missing syncfs() support ? 
>> 
>> -Alexandre 
> 
> Hi Alexandre, 
> 
> I've included some seekwatcher results for rados bench tests using 16 
> concurrent 4MB writes on XFS OSD. One shows ubuntu oneiric and the 
> other precise (ie no syncfs support vs syncfs support in libc). 
> Unfortunately the original test was on 0.46 and the second test was on 
> 0.47.2, so multiple things changed between the tests. Both were tested 
> with kernel 3.4. Interestingly the seeks/second don't seem to drop much 
> but the overall performance has about doubled. This was using a single 
> 7200rpm disk for the OSD data disk and a seperate 7200rpm disk for the 
> journal in both cases. I'd definitely try 0.47.2 with a new libc though 
> and see how that works for you. 
> 
> ceph 0.46/oneiric: 
> http://nhm.ceph.com/movies/mailinglist-tests/xfs-osd0-oneiric-3.4.mpg 
> 
> ceph 0.47.2/precise: 
> http://nhm.ceph.com/movies/mailinglist-tests/xfs-osd0-precise-3.4.mpg 
> 
> Mark 
> 
> 
> 




-- 

-- 




Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 



-- 

-- 




	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 
	
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2012-06-07 17:15 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-03  8:10 Infiniband 40GB Stefan Priebe
2012-06-03 12:56 ` Mark Nelson
2012-06-04  6:22   ` Hannes Reinecke
2012-06-04  7:26     ` Stefan Priebe - Profihost AG
2012-06-04  7:39       ` Hannes Reinecke
2012-06-04  7:53         ` Stefan Priebe - Profihost AG
2012-06-04  8:02           ` Hannes Reinecke
2012-06-04  8:23             ` Stefan Majer
2012-06-04  9:21               ` Yann Dupont
2012-06-04  9:35                 ` Alexandre DERUMIER
2012-06-04  9:53                   ` Yann Dupont
2012-06-04  9:47                 ` Amon Ott
2012-06-04  9:58                   ` Yann Dupont
2012-06-04 11:40                   ` Alexandre DERUMIER
2012-06-04 12:59                     ` Mark Nelson
2012-06-04 13:07                       ` Alexandre DERUMIER
2012-06-04 13:28                         ` Mark Nelson
2012-06-04 15:11                           ` Gregory Farnum
2012-06-04 15:34                             ` Mark Nelson
2012-06-06 16:05                       ` Alexandre DERUMIER
2012-06-06 16:43                         ` Mark Nelson
2012-06-04 15:42                   ` Stefan Priebe
2012-06-05  7:08                     ` Amon Ott
2012-06-05  7:46                       ` Stefan Priebe - Profihost AG
2012-06-06 10:48                   ` Stefan Priebe - Profihost AG
2012-06-06 10:57                     ` Amon Ott
2012-06-06 11:02                       ` Stefan Priebe - Profihost AG
2012-06-07 11:33                         ` Amon Ott
2012-06-07 12:44                           ` Stefan Priebe - Profihost AG
2012-06-05  8:54               ` Stefan Priebe - Profihost AG
2012-06-04 12:28     ` Mark Nelson
2012-06-04 12:34       ` Tomasz Paszkowski
2012-06-04 12:40         ` Mark Nelson
     [not found] <a81f3855-1c7d-447b-9bbf-6a891e372909@mailpro>
2012-06-07  3:31 ` Alexandre DERUMIER
2012-06-07 11:25   ` Alexandre DERUMIER
2012-06-07 17:15     ` Mark Nelson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.