All of lore.kernel.org
 help / color / mirror / Atom feed
* flashcache
@ 2013-01-16 21:22 Gandalf Corvotempesta
  2013-01-16 21:29 ` flashcache Sage Weil
  0 siblings, 1 reply; 34+ messages in thread
From: Gandalf Corvotempesta @ 2013-01-16 21:22 UTC (permalink / raw)
  To: ceph-devel

In a ceph cluster, flashcache with writeback is considered safe?
In case of SSD failure, the flashcache contents should be already been
replicated (by ceph) in other servers, right?

I'm planning to use this configuration: Supermicro with 12 spinning
disks e 2 SSD.
6 spinning disks will have ceph journal on SSD1, the other 6 disks
will have ceph journal on disks2.

One OSD for each spinning disk (a single XFS filesystem for the whole disk).
XFS metadata to a parition of SSD1
XFS flashcache to another partition of SSD1

So, 3 partitions for each OSD on the SSD.
How big should be these partitions? Any advice?

No raid at all, except for 1 RAID-1 volume made with a 10GB partitions
on each SSD, for the OS. Log files will be replicated to a remote
server, so writes on OS partitions are very very low.

Any hint? Adivice? Critics?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-16 21:22 flashcache Gandalf Corvotempesta
@ 2013-01-16 21:29 ` Sage Weil
  2013-01-16 21:42   ` flashcache Gandalf Corvotempesta
  2013-01-16 21:53   ` flashcache Mark Nelson
  0 siblings, 2 replies; 34+ messages in thread
From: Sage Weil @ 2013-01-16 21:29 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: ceph-devel

On Wed, 16 Jan 2013, Gandalf Corvotempesta wrote:
> In a ceph cluster, flashcache with writeback is considered safe?
> In case of SSD failure, the flashcache contents should be already been
> replicated (by ceph) in other servers, right?

This sort of configuration effectively bundles the disk and SSD into a 
single unit, where the failure of either results in the loss of both.  
From Ceph's perspective, it doesn't matter if the thing it is sitting on 
is a single disk, an SSD+disk flashcache thing, or a big RAID array.  All 
that changes is the probability of failure.

The thing to watch out for *knowing* that the whole is lost when one part 
fails (vs plowing ahead with a corrupt fs).

> I'm planning to use this configuration: Supermicro with 12 spinning
> disks e 2 SSD.
> 6 spinning disks will have ceph journal on SSD1, the other 6 disks
> will have ceph journal on disks2.
> 
> One OSD for each spinning disk (a single XFS filesystem for the whole disk).
> XFS metadata to a parition of SSD1
> XFS flashcache to another partition of SSD1
> 
> So, 3 partitions for each OSD on the SSD.
> How big should be these partitions? Any advice?
> 
> No raid at all, except for 1 RAID-1 volume made with a 10GB partitions
> on each SSD, for the OS. Log files will be replicated to a remote
> server, so writes on OS partitions are very very low.
> 
> Any hint? Adivice? Critics?

I would worry that there is a lot of stuff piling onto the SSD and it may 
become your bottleneck.  My guess is that another 1-2 SSDs will be a 
better 'balance', but only experiementation will really tell us that.

Otherwise, those seem to all be good things to put on teh SSD!

sage

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-16 21:29 ` flashcache Sage Weil
@ 2013-01-16 21:42   ` Gandalf Corvotempesta
  2013-01-16 21:46     ` flashcache Sage Weil
  2013-01-16 21:53   ` flashcache Mark Nelson
  1 sibling, 1 reply; 34+ messages in thread
From: Gandalf Corvotempesta @ 2013-01-16 21:42 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

2013/1/16 Sage Weil <sage@inktank.com>:
> This sort of configuration effectively bundles the disk and SSD into a
> single unit, where the failure of either results in the loss of both.
> From Ceph's perspective, it doesn't matter if the thing it is sitting on
> is a single disk, an SSD+disk flashcache thing, or a big RAID array.  All
> that changes is the probability of failure.

Ok, it will fail, but this should not be an issue, in a cluster like
ceph, right?
With or without flashcache or SSD, ceph should be able to handle
disks/nodes/osds failures on its own by replicating in real time to
multiple server.

Should I worry about loosing data in case of failure? It should
rebalance automatically in case of failure with no data loss.

> I would worry that there is a lot of stuff piling onto the SSD and it may
> become your bottleneck.  My guess is that another 1-2 SSDs will be a
> better 'balance', but only experiementation will really tell us that.
>
> Otherwise, those seem to all be good things to put on teh SSD!

I can't add more than 2 SSD, I don't have enough space.
I can move OS to the first 2 spinning disks in raid1 software, if this
will improve performance of SSD

What about swap? I'm thinking to no use swap at all and start with 16/32GB RAM

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-16 21:42   ` flashcache Gandalf Corvotempesta
@ 2013-01-16 21:46     ` Sage Weil
  2013-01-16 21:55       ` flashcache Mark Nelson
  2013-01-16 21:57       ` flashcache Gandalf Corvotempesta
  0 siblings, 2 replies; 34+ messages in thread
From: Sage Weil @ 2013-01-16 21:46 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: ceph-devel

On Wed, 16 Jan 2013, Gandalf Corvotempesta wrote:
> 2013/1/16 Sage Weil <sage@inktank.com>:
> > This sort of configuration effectively bundles the disk and SSD into a
> > single unit, where the failure of either results in the loss of both.
> > From Ceph's perspective, it doesn't matter if the thing it is sitting on
> > is a single disk, an SSD+disk flashcache thing, or a big RAID array.  All
> > that changes is the probability of failure.
> 
> Ok, it will fail, but this should not be an issue, in a cluster like
> ceph, right?
> With or without flashcache or SSD, ceph should be able to handle
> disks/nodes/osds failures on its own by replicating in real time to
> multiple server.

Exactly.

> Should I worry about loosing data in case of failure? It should
> rebalance automatically in case of failure with no data loss.

You should not worry, except to the extent that 2 might fail 
simultaneously, and failures in general are not good things.

> > I would worry that there is a lot of stuff piling onto the SSD and it may
> > become your bottleneck.  My guess is that another 1-2 SSDs will be a
> > better 'balance', but only experiementation will really tell us that.
> >
> > Otherwise, those seem to all be good things to put on teh SSD!
> 
> I can't add more than 2 SSD, I don't have enough space.
> I can move OS to the first 2 spinning disks in raid1 software, if this
> will improve performance of SSD
> 
> What about swap? I'm thinking to no use swap at all and start with 
> 16/32GB RAM

You could use the first (single) disk for os and logs.  You might not even 
bother with raid1, since you will presumably be replicating across hosts.  
When the OSD disk dies, you can re-run your chef/juju/puppet rule or 
whatever provisioning tool is at work to reinstall/configure the OS disk.  
The data on the SSDs and data disks will all be intact.

sage

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-16 21:29 ` flashcache Sage Weil
  2013-01-16 21:42   ` flashcache Gandalf Corvotempesta
@ 2013-01-16 21:53   ` Mark Nelson
  2013-01-16 22:04     ` flashcache Gandalf Corvotempesta
                       ` (2 more replies)
  1 sibling, 3 replies; 34+ messages in thread
From: Mark Nelson @ 2013-01-16 21:53 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gandalf Corvotempesta, ceph-devel

On 01/16/2013 03:29 PM, Sage Weil wrote:
> On Wed, 16 Jan 2013, Gandalf Corvotempesta wrote:
>> In a ceph cluster, flashcache with writeback is considered safe?
>> In case of SSD failure, the flashcache contents should be already been
>> replicated (by ceph) in other servers, right?
>
> This sort of configuration effectively bundles the disk and SSD into a
> single unit, where the failure of either results in the loss of both.
>  From Ceph's perspective, it doesn't matter if the thing it is sitting on
> is a single disk, an SSD+disk flashcache thing, or a big RAID array.  All
> that changes is the probability of failure.
>
> The thing to watch out for *knowing* that the whole is lost when one part
> fails (vs plowing ahead with a corrupt fs).
>
>> I'm planning to use this configuration: Supermicro with 12 spinning
>> disks e 2 SSD.
>> 6 spinning disks will have ceph journal on SSD1, the other 6 disks
>> will have ceph journal on disks2.
>>
>> One OSD for each spinning disk (a single XFS filesystem for the whole disk).
>> XFS metadata to a parition of SSD1
>> XFS flashcache to another partition of SSD1
>>
>> So, 3 partitions for each OSD on the SSD.
>> How big should be these partitions? Any advice?
>>
>> No raid at all, except for 1 RAID-1 volume made with a 10GB partitions
>> on each SSD, for the OS. Log files will be replicated to a remote
>> server, so writes on OS partitions are very very low.
>>
>> Any hint? Adivice? Critics?

Looks like a fun configuration to test!  Having said that, I have no 
idea how stable flashcache is.  It's certainly not something we've used 
in production before!  Keep that in mind.

With only 2 SSDs for 12 spinning disks, you'll need to make sure the 
SSDs are really fast.  I use Intel 520s for testing which are great, but 
I wouldn't use them in  production.  The S3700 might be a good bet at 
larger sizes, but it looks like the 100GB version is a lot slower than 
the 200GB version, and that's still a bit slower than the 400GB version. 
  Assuming you have 10GbE, you'll probably be capped by the SSDs for 
large block sequential workloads.  Having said that, I still think this 
has potential to be a nice setup.  Just be aware that we usually don't 
stick that much stuff on the SSDs!

>
> I would worry that there is a lot of stuff piling onto the SSD and it may
> become your bottleneck.  My guess is that another 1-2 SSDs will be a
> better 'balance', but only experiementation will really tell us that.
>

It'd be amazing if supermicro could cram another 2 SSD slots in the 
back.  Maybe by that time we'll all be using PCIE flash storage though. :)

> Otherwise, those seem to all be good things to put on teh SSD!
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-16 21:46     ` flashcache Sage Weil
@ 2013-01-16 21:55       ` Mark Nelson
  2013-01-16 21:59         ` flashcache Gandalf Corvotempesta
  2013-01-16 21:57       ` flashcache Gandalf Corvotempesta
  1 sibling, 1 reply; 34+ messages in thread
From: Mark Nelson @ 2013-01-16 21:55 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gandalf Corvotempesta, ceph-devel

On 01/16/2013 03:46 PM, Sage Weil wrote:
> On Wed, 16 Jan 2013, Gandalf Corvotempesta wrote:
>> 2013/1/16 Sage Weil <sage@inktank.com>:
>>> This sort of configuration effectively bundles the disk and SSD into a
>>> single unit, where the failure of either results in the loss of both.
>>>  From Ceph's perspective, it doesn't matter if the thing it is sitting on
>>> is a single disk, an SSD+disk flashcache thing, or a big RAID array.  All
>>> that changes is the probability of failure.
>>
>> Ok, it will fail, but this should not be an issue, in a cluster like
>> ceph, right?
>> With or without flashcache or SSD, ceph should be able to handle
>> disks/nodes/osds failures on its own by replicating in real time to
>> multiple server.
>
> Exactly.
>
>> Should I worry about loosing data in case of failure? It should
>> rebalance automatically in case of failure with no data loss.
>
> You should not worry, except to the extent that 2 might fail
> simultaneously, and failures in general are not good things.
>
>>> I would worry that there is a lot of stuff piling onto the SSD and it may
>>> become your bottleneck.  My guess is that another 1-2 SSDs will be a
>>> better 'balance', but only experiementation will really tell us that.
>>>
>>> Otherwise, those seem to all be good things to put on teh SSD!
>>
>> I can't add more than 2 SSD, I don't have enough space.
>> I can move OS to the first 2 spinning disks in raid1 software, if this
>> will improve performance of SSD
>>
>> What about swap? I'm thinking to no use swap at all and start with
>> 16/32GB RAM
>
> You could use the first (single) disk for os and logs.  You might not even
> bother with raid1, since you will presumably be replicating across hosts.
> When the OSD disk dies, you can re-run your chef/juju/puppet rule or
> whatever provisioning tool is at work to reinstall/configure the OS disk.
> The data on the SSDs and data disks will all be intact.

Other options might be network boot or even usb stick boot.

>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-16 21:46     ` flashcache Sage Weil
  2013-01-16 21:55       ` flashcache Mark Nelson
@ 2013-01-16 21:57       ` Gandalf Corvotempesta
  1 sibling, 0 replies; 34+ messages in thread
From: Gandalf Corvotempesta @ 2013-01-16 21:57 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

2013/1/16 Sage Weil <sage@inktank.com>:
> You should not worry, except to the extent that 2 might fail
> simultaneously, and failures in general are not good things.

Are you talking about 2 OSDs failure (with the same replicated data)?
Ok, but it's very very hard.

> You could use the first (single) disk for os and logs.  You might not even
> bother with raid1, since you will presumably be replicating across hosts.
> When the OSD disk dies, you can re-run your chef/juju/puppet rule or
> whatever provisioning tool is at work to reinstall/configure the OS disk.
> The data on the SSDs and data disks will all be intact.

OS failure will result in the whole node down. This will cause a
rebalance of many TB of data that I prefer to avoid, if possible.
I also prefer to avoid a dedicated disks for the OS. I'll loose to much space.
With OS in a RAID-1 partition of the first and second spinning disks
(the same disks will also act as OSD with another, bigger, partition),
OS failure will not bring down anything and I can hot-replace the
failed disk.

RAID1 will handle OS reconstruction, ceph will handle the cluster rebalance

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-16 21:55       ` flashcache Mark Nelson
@ 2013-01-16 21:59         ` Gandalf Corvotempesta
  0 siblings, 0 replies; 34+ messages in thread
From: Gandalf Corvotempesta @ 2013-01-16 21:59 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Sage Weil, ceph-devel

2013/1/16 Mark Nelson <mark.nelson@inktank.com>:
> Other options might be network boot or even usb stick boot.

GREAT IDEA!
I'll go with a network boot.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-16 21:53   ` flashcache Mark Nelson
@ 2013-01-16 22:04     ` Gandalf Corvotempesta
  2013-01-17  5:47     ` flashcache Stefan Priebe - Profihost AG
  2013-01-17  9:46     ` flashcache Gandalf Corvotempesta
  2 siblings, 0 replies; 34+ messages in thread
From: Gandalf Corvotempesta @ 2013-01-16 22:04 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Sage Weil, ceph-devel

2013/1/16 Mark Nelson <mark.nelson@inktank.com>:
> Looks like a fun configuration to test!  Having said that, I have no idea
> how stable flashcache is.  It's certainly not something we've used in
> production before!  Keep that in mind.

As wrote before, this should not be an issue, ceph should handle failures.
But is ceph able to detect a badly-wrote blocks?

> With only 2 SSDs for 12 spinning disks, you'll need to make sure the SSDs
> are really fast.  I use Intel 520s for testing which are great, but I
> wouldn't use them in  production.  The S3700 might be a good bet at larger
> sizes, but it looks like the 100GB version is a lot slower than the 200GB
> version, and that's still a bit slower than the 400GB version.  Assuming you
> have 10GbE, you'll probably be capped by the SSDs for large block sequential
> workloads.  Having said that, I still think this has potential to be a nice
> setup.  Just be aware that we usually don't stick that much stuff on the
> SSDs!

12 spinning disks will be the worse scenario.
When in production i'll start with 5 server, with 6 disks each (3+3
for each SSD)
Then i'll try to add hosts instead of adding disks.
I don't have 10GbE actually but only 2GbE bonded. I can evaluate 4GbE.

> It'd be amazing if supermicro could cram another 2 SSD slots in the back.
> Maybe by that time we'll all be using PCIE flash storage though. :)

How much a PCIE storage?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-16 21:53   ` flashcache Mark Nelson
  2013-01-16 22:04     ` flashcache Gandalf Corvotempesta
@ 2013-01-17  5:47     ` Stefan Priebe - Profihost AG
  2013-01-17 13:34       ` flashcache Mark Nelson
  2013-01-17  9:46     ` flashcache Gandalf Corvotempesta
  2 siblings, 1 reply; 34+ messages in thread
From: Stefan Priebe - Profihost AG @ 2013-01-17  5:47 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Sage Weil, Gandalf Corvotempesta, ceph-devel

Hi Mark,

Am 16.01.2013 um 22:53 schrieb Mark 
> With only 2 SSDs for 12 spinning disks, you'll need to make sure the SSDs are really fast.  I use Intel 520s for testing which are great, but I wouldn't use them in  production.

Why not? I use them for a ssd only ceph cluster.

Stefan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-16 21:53   ` flashcache Mark Nelson
  2013-01-16 22:04     ` flashcache Gandalf Corvotempesta
  2013-01-17  5:47     ` flashcache Stefan Priebe - Profihost AG
@ 2013-01-17  9:46     ` Gandalf Corvotempesta
  2013-01-17 13:32       ` flashcache Joseph Glanville
  2 siblings, 1 reply; 34+ messages in thread
From: Gandalf Corvotempesta @ 2013-01-17  9:46 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Sage Weil, ceph-devel

2013/1/16 Mark Nelson <mark.nelson@inktank.com>:
> Assuming you have 10GbE, you'll probably be capped by the SSDs for
> large block sequential workloads.  Having said that, I still think this has
> potential to be a nice setup.  Just be aware that we usually don't stick
> that much stuff on the SSDs!

10GbE for which network, OSDs or client network? I think you are
talking about OSD network.

I'm evaluating a small Infiniband 10GB or 20GB network , but I'm still
not sure if should I
make the OSD network in the node by using 2 IB card. What happens when
the OSD network
fails on a single node but that node is still reachable from the client network?

Ceph will remove it from the cluster?

I don't know if I have to use a single two port IB card (switch
redundancy and no card redundancy) or
I have to use two single port cards. (or a single one port IB?)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17  9:46     ` flashcache Gandalf Corvotempesta
@ 2013-01-17 13:32       ` Joseph Glanville
  2013-01-17 13:37         ` flashcache Mark Nelson
  0 siblings, 1 reply; 34+ messages in thread
From: Joseph Glanville @ 2013-01-17 13:32 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: Mark Nelson, Sage Weil, ceph-devel

On 17 January 2013 20:46, Gandalf Corvotempesta
<gandalf.corvotempesta@gmail.com> wrote:
> 2013/1/16 Mark Nelson <mark.nelson@inktank.com>:

> I don't know if I have to use a single two port IB card (switch
> redundancy and no card redundancy) or
> I have to use two single port cards. (or a single one port IB?)

On the topic of IB..

But slightly off-topic all the same..  I would love to attempt getting
Ceph running on rsockets if I could find the time (alas we don't run
Ceph).
rsockets is a fully userland implementation of BSD sockets over RDMA,
supporting fork and all the usual goodies, in theory unless you are
using the kernel RBD module (of the kernel FS module etc) you should
be able to run it on rsockets and enjoy a considerable performance
increase.

rsockets is available in the librdmacm git up on Open Fabrics and dev
+ support happens on the linux-rdma list.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17  5:47     ` flashcache Stefan Priebe - Profihost AG
@ 2013-01-17 13:34       ` Mark Nelson
  0 siblings, 0 replies; 34+ messages in thread
From: Mark Nelson @ 2013-01-17 13:34 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: Sage Weil, Gandalf Corvotempesta, ceph-devel

On 01/16/2013 11:47 PM, Stefan Priebe - Profihost AG wrote:
> Hi Mark,
>
> Am 16.01.2013 um 22:53 schrieb Mark
>> With only 2 SSDs for 12 spinning disks, you'll need to make sure the SSDs are really fast.  I use Intel 520s for testing which are great, but I wouldn't use them in  production.
>
> Why not? I use them for a ssd only ceph cluster.
>
> Stefan

It's pretty tough to get an apples-to-apples comparison of endurance 
when looking at the Intel 520 vs something like the DC S3700. If I were 
actually building out a production deployment I'd probably stick with 
the DC S3700 (especially if sticking journals, flashcache, and xfs 
journals for 6 osds on 1 drive!).  There's probably a reasonable 
endurance per cost argument for a severely under-subscribed 520 (or 
other similar drive) as well.  It'd be an interesting study to look at 
how long it takes a small enterpise drives to die vs a larger 
under-subscribed consumer drive.

-- 
Mark Nelson
Performance Engineer
Inktank

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 13:32       ` flashcache Joseph Glanville
@ 2013-01-17 13:37         ` Mark Nelson
  2013-01-17 13:44           ` flashcache Gandalf Corvotempesta
  2013-01-17 14:30           ` flashcache Atchley, Scott
  0 siblings, 2 replies; 34+ messages in thread
From: Mark Nelson @ 2013-01-17 13:37 UTC (permalink / raw)
  To: Joseph Glanville; +Cc: Gandalf Corvotempesta, Sage Weil, ceph-devel

On 01/17/2013 07:32 AM, Joseph Glanville wrote:
> On 17 January 2013 20:46, Gandalf Corvotempesta
> <gandalf.corvotempesta@gmail.com>  wrote:
>> 2013/1/16 Mark Nelson<mark.nelson@inktank.com>:
>
>> I don't know if I have to use a single two port IB card (switch
>> redundancy and no card redundancy) or
>> I have to use two single port cards. (or a single one port IB?)
>
> On the topic of IB..
>
> But slightly off-topic all the same..  I would love to attempt getting
> Ceph running on rsockets if I could find the time (alas we don't run
> Ceph).
> rsockets is a fully userland implementation of BSD sockets over RDMA,
> supporting fork and all the usual goodies, in theory unless you are
> using the kernel RBD module (of the kernel FS module etc) you should
> be able to run it on rsockets and enjoy a considerable performance
> increase.
>
> rsockets is available in the librdmacm git up on Open Fabrics and dev
> + support happens on the linux-rdma list.
>

There's been some talk about rsockets on the list before.  I think there 
are a couple of different folks that have tried (succeeded?) in getting 
it working.  barring that, it sounds like if you tune interrupt affinity 
settings and various other bits you can get IPoIB up into the 2GB/s+ 
range which while not RDMA speed, is at least better than 10GbE.

-- 
Mark Nelson
Performance Engineer
Inktank

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 13:37         ` flashcache Mark Nelson
@ 2013-01-17 13:44           ` Gandalf Corvotempesta
  2013-01-17 14:30           ` flashcache Atchley, Scott
  1 sibling, 0 replies; 34+ messages in thread
From: Gandalf Corvotempesta @ 2013-01-17 13:44 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Joseph Glanville, Sage Weil, ceph-devel

2013/1/17 Mark Nelson <mark.nelson@inktank.com>:
> There's been some talk about rsockets on the list before.  I think there are
> a couple of different folks that have tried (succeeded?) in getting it
> working.  barring that, it sounds like if you tune interrupt affinity
> settings and various other bits you can get IPoIB up into the 2GB/s+ range
> which while not RDMA speed, is at least better than 10GbE.

More over, an SDR/DDR IB switch is much more cheaper than a full 10GbE switch.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 13:37         ` flashcache Mark Nelson
  2013-01-17 13:44           ` flashcache Gandalf Corvotempesta
@ 2013-01-17 14:30           ` Atchley, Scott
  2013-01-17 14:48             ` flashcache Gandalf Corvotempesta
  1 sibling, 1 reply; 34+ messages in thread
From: Atchley, Scott @ 2013-01-17 14:30 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Joseph Glanville, Gandalf Corvotempesta, Sage Weil, ceph-devel

On Jan 17, 2013, at 8:37 AM, Mark Nelson <mark.nelson@inktank.com> wrote:

> On 01/17/2013 07:32 AM, Joseph Glanville wrote:
>> On 17 January 2013 20:46, Gandalf Corvotempesta
>> <gandalf.corvotempesta@gmail.com>  wrote:
>>> 2013/1/16 Mark Nelson<mark.nelson@inktank.com>:
>> 
>>> I don't know if I have to use a single two port IB card (switch
>>> redundancy and no card redundancy) or
>>> I have to use two single port cards. (or a single one port IB?)
>> 
>> On the topic of IB..
>> 
>> But slightly off-topic all the same..  I would love to attempt getting
>> Ceph running on rsockets if I could find the time (alas we don't run
>> Ceph).
>> rsockets is a fully userland implementation of BSD sockets over RDMA,
>> supporting fork and all the usual goodies, in theory unless you are
>> using the kernel RBD module (of the kernel FS module etc) you should
>> be able to run it on rsockets and enjoy a considerable performance
>> increase.
>> 
>> rsockets is available in the librdmacm git up on Open Fabrics and dev
>> + support happens on the linux-rdma list.
>> 
> 
> There's been some talk about rsockets on the list before.  I think there 
> are a couple of different folks that have tried (succeeded?) in getting 
> it working.  barring that, it sounds like if you tune interrupt affinity 
> settings and various other bits you can get IPoIB up into the 2GB/s+ 
> range which while not RDMA speed, is at least better than 10GbE.

IB DDR should get you close to 2 GB/s with IPoIB. I have gotten our IB QDR PCI-E Gen. 2 up to 2.8 GB/s measured via netperf with lots of tuning. Since it uses the traditional socket stack through the kernel, CPU usage will be as high (or higher if QDR) than 10GbE.

I would be interested in seeing if rsockets helps ceph. I have questions though.

By default, rsockets still has to copy data in and out. It has extensions for zero-copy, but they do not work with non-blocking sockets. Does ceph use non-blocking sockets?

How many simultaneous connections does it support before falling over into the non-rsockets path?

If rsockets has a ceiling on connections and falls over to the non-socket path, can the application determine if it is safe to use the zero-copy extensions for a specific connection or do they fail gracefully?

Scott

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 14:30           ` flashcache Atchley, Scott
@ 2013-01-17 14:48             ` Gandalf Corvotempesta
  2013-01-17 15:00               ` flashcache Atchley, Scott
  0 siblings, 1 reply; 34+ messages in thread
From: Gandalf Corvotempesta @ 2013-01-17 14:48 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: Mark Nelson, Joseph Glanville, Sage Weil, ceph-devel

2013/1/17 Atchley, Scott <atchleyes@ornl.gov>:
> IB DDR should get you close to 2 GB/s with IPoIB. I have gotten our IB QDR PCI-E Gen. 2 up to 2.8 GB/s measured via netperf with lots of tuning. Since it uses the traditional socket stack through the kernel, CPU usage will be as high (or higher if QDR) than 10GbE.

Which kind of tuning? Do you have a paper about this?

But, actually, is possible to use ceph with IPoIB in a stable way or
is this experimental ?
I don't know if i support for rsocket that is experimental/untested
and IPoIB is a stable workaroud or what else.

And is a dual controller needed on each OSD node? Ceph is able to
handle OSD network failures? This is really important to know. It
change the whole network topology.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 14:48             ` flashcache Gandalf Corvotempesta
@ 2013-01-17 15:00               ` Atchley, Scott
  2013-01-17 15:07                 ` flashcache Andrey Korolyov
  2013-01-17 15:14                 ` flashcache Gandalf Corvotempesta
  0 siblings, 2 replies; 34+ messages in thread
From: Atchley, Scott @ 2013-01-17 15:00 UTC (permalink / raw)
  To: Gandalf Corvotempesta
  Cc: Mark Nelson, Joseph Glanville, Sage Weil, ceph-devel

On Jan 17, 2013, at 9:48 AM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote:

> 2013/1/17 Atchley, Scott <atchleyes@ornl.gov>:
>> IB DDR should get you close to 2 GB/s with IPoIB. I have gotten our IB QDR PCI-E Gen. 2 up to 2.8 GB/s measured via netperf with lots of tuning. Since it uses the traditional socket stack through the kernel, CPU usage will be as high (or higher if QDR) than 10GbE.
> 
> Which kind of tuning? Do you have a paper about this?

No, I followed the Mellanox tuning guide and modified their interrupt affinity scripts.

> But, actually, is possible to use ceph with IPoIB in a stable way or
> is this experimental ?

IPoIB appears as a traditional Ethernet device to Linux and can be used as such. Ceph has no idea that it is not Ethernet.

> I don't know if i support for rsocket that is experimental/untested
> and IPoIB is a stable workaroud or what else.

IPoIB is much more used and pretty stable, while rsockets is new with limited testing. That said, more people using it will help Sean improve it.

Ideally, we would like support for zero-copy and reduced CPU usage (via OS-bypass) and with more interconnects than just InfiniBand. :-)

> And is a dual controller needed on each OSD node? Ceph is able to
> handle OSD network failures? This is really important to know. It
> change the whole network topology.

I will let others answer this.

Scott

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 15:00               ` flashcache Atchley, Scott
@ 2013-01-17 15:07                 ` Andrey Korolyov
  2013-01-17 15:47                   ` flashcache Atchley, Scott
  2013-01-17 15:14                 ` flashcache Gandalf Corvotempesta
  1 sibling, 1 reply; 34+ messages in thread
From: Andrey Korolyov @ 2013-01-17 15:07 UTC (permalink / raw)
  To: Atchley, Scott
  Cc: Gandalf Corvotempesta, Mark Nelson, Joseph Glanville, Sage Weil,
	ceph-devel

On Thu, Jan 17, 2013 at 7:00 PM, Atchley, Scott <atchleyes@ornl.gov> wrote:
> On Jan 17, 2013, at 9:48 AM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote:
>
>> 2013/1/17 Atchley, Scott <atchleyes@ornl.gov>:
>>> IB DDR should get you close to 2 GB/s with IPoIB. I have gotten our IB QDR PCI-E Gen. 2 up to 2.8 GB/s measured via netperf with lots of tuning. Since it uses the traditional socket stack through the kernel, CPU usage will be as high (or higher if QDR) than 10GbE.
>>
>> Which kind of tuning? Do you have a paper about this?
>
> No, I followed the Mellanox tuning guide and modified their interrupt affinity scripts.

Did you tried to bind interrupts only to core to which QPI link
belongs in reality and measure difference with spread-over-all-cores
binding?

>
>> But, actually, is possible to use ceph with IPoIB in a stable way or
>> is this experimental ?
>
> IPoIB appears as a traditional Ethernet device to Linux and can be used as such.

Not exactly, this summer kernel added additional driver for fully
featured L2(ib ethernet driver), before that it was quite painful to
do any possible failover using ipoib.

>
>> I don't know if i support for rsocket that is experimental/untested
>> and IPoIB is a stable workaroud or what else.
>
> IPoIB is much more used and pretty stable, while rsockets is new with limited testing. That said, more people using it will help Sean improve it.
>
> Ideally, we would like support for zero-copy and reduced CPU usage (via OS-bypass) and with more interconnects than just InfiniBand. :-)
>
>> And is a dual controller needed on each OSD node? Ceph is able to
>> handle OSD network failures? This is really important to know. It
>> change the whole network topology.
>
> I will let others answer this.
>
> Scott--
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 15:00               ` flashcache Atchley, Scott
  2013-01-17 15:07                 ` flashcache Andrey Korolyov
@ 2013-01-17 15:14                 ` Gandalf Corvotempesta
  2013-01-17 15:50                   ` flashcache Atchley, Scott
  1 sibling, 1 reply; 34+ messages in thread
From: Gandalf Corvotempesta @ 2013-01-17 15:14 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: Mark Nelson, Joseph Glanville, Sage Weil, ceph-devel

2013/1/17 Atchley, Scott <atchleyes@ornl.gov>:
> IPoIB appears as a traditional Ethernet device to Linux and can be used as such. Ceph has no idea that it is not Ethernet.

Ok. Now it's clear.
AFAIK, a standard SDR IB card should give use more speed than GbE
(less overhead?) and lower latency, I think.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 15:07                 ` flashcache Andrey Korolyov
@ 2013-01-17 15:47                   ` Atchley, Scott
  2013-01-17 16:39                     ` flashcache Andrey Korolyov
  0 siblings, 1 reply; 34+ messages in thread
From: Atchley, Scott @ 2013-01-17 15:47 UTC (permalink / raw)
  To: Andrey Korolyov
  Cc: Gandalf Corvotempesta, Mark Nelson, Joseph Glanville, Sage Weil,
	ceph-devel

On Jan 17, 2013, at 10:07 AM, Andrey Korolyov <andrey@xdel.ru> wrote:

> On Thu, Jan 17, 2013 at 7:00 PM, Atchley, Scott <atchleyes@ornl.gov> wrote:
>> On Jan 17, 2013, at 9:48 AM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote:
>> 
>>> 2013/1/17 Atchley, Scott <atchleyes@ornl.gov>:
>>>> IB DDR should get you close to 2 GB/s with IPoIB. I have gotten our IB QDR PCI-E Gen. 2 up to 2.8 GB/s measured via netperf with lots of tuning. Since it uses the traditional socket stack through the kernel, CPU usage will be as high (or higher if QDR) than 10GbE.
>>> 
>>> Which kind of tuning? Do you have a paper about this?
>> 
>> No, I followed the Mellanox tuning guide and modified their interrupt affinity scripts.
> 
> Did you tried to bind interrupts only to core to which QPI link
> belongs in reality and measure difference with spread-over-all-cores
> binding?

This is the modified part. I bound the mlx4-async handler to core 0 and the mlx4-ib-1-0 handle to core 1 for our machines.

>>> But, actually, is possible to use ceph with IPoIB in a stable way or
>>> is this experimental ?
>> 
>> IPoIB appears as a traditional Ethernet device to Linux and can be used as such.
> 
> Not exactly, this summer kernel added additional driver for fully
> featured L2(ib ethernet driver), before that it was quite painful to
> do any possible failover using ipoib.

I assume it is now an EoIB driver. Does it replace the IPoIB driver?

>>> I don't know if i support for rsocket that is experimental/untested
>>> and IPoIB is a stable workaroud or what else.
>> 
>> IPoIB is much more used and pretty stable, while rsockets is new with limited testing. That said, more people using it will help Sean improve it.
>> 
>> Ideally, we would like support for zero-copy and reduced CPU usage (via OS-bypass) and with more interconnects than just InfiniBand. :-)
>> 
>>> And is a dual controller needed on each OSD node? Ceph is able to
>>> handle OSD network failures? This is really important to know. It
>>> change the whole network topology.
>> 
>> I will let others answer this.
>> 
>> Scott--
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 15:14                 ` flashcache Gandalf Corvotempesta
@ 2013-01-17 15:50                   ` Atchley, Scott
  2013-01-17 16:01                     ` flashcache Gandalf Corvotempesta
  0 siblings, 1 reply; 34+ messages in thread
From: Atchley, Scott @ 2013-01-17 15:50 UTC (permalink / raw)
  To: Gandalf Corvotempesta
  Cc: Mark Nelson, Joseph Glanville, Sage Weil, ceph-devel

On Jan 17, 2013, at 10:14 AM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote:

> 2013/1/17 Atchley, Scott <atchleyes@ornl.gov>:
>> IPoIB appears as a traditional Ethernet device to Linux and can be used as such. Ceph has no idea that it is not Ethernet.
> 
> Ok. Now it's clear.
> AFAIK, a standard SDR IB card should give use more speed than GbE
> (less overhead?) and lower latency, I think.

Yes. It should get close to 1 GB/s where 1GbE is limited to about 125 MB/s. Lower latency? Probably since most Ethernet drivers set interrupt coalescing by default. Intel e1000 driver, for example, have a cluster mode that reduces (or turns off) interrupt coalescing. I don't know if ceph is latency sensitive or not.

Scott

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 15:50                   ` flashcache Atchley, Scott
@ 2013-01-17 16:01                     ` Gandalf Corvotempesta
  2013-01-17 16:12                       ` flashcache Atchley, Scott
  0 siblings, 1 reply; 34+ messages in thread
From: Gandalf Corvotempesta @ 2013-01-17 16:01 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: Mark Nelson, Joseph Glanville, Sage Weil, ceph-devel

2013/1/17 Atchley, Scott <atchleyes@ornl.gov>:
> Yes. It should get close to 1 GB/s where 1GbE is limited to about 125 MB/s. Lower latency? Probably since most Ethernet drivers set interrupt coalescing by default. Intel e1000 driver, for example, have a cluster mode that reduces (or turns off) interrupt coalescing. I don't know if ceph is latency sensitive or not.

Sorry, I meant 10GbE.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 16:01                     ` flashcache Gandalf Corvotempesta
@ 2013-01-17 16:12                       ` Atchley, Scott
  2013-01-17 16:19                         ` flashcache Gandalf Corvotempesta
  2013-01-17 16:20                         ` flashcache Stefan Priebe
  0 siblings, 2 replies; 34+ messages in thread
From: Atchley, Scott @ 2013-01-17 16:12 UTC (permalink / raw)
  To: Gandalf Corvotempesta
  Cc: Mark Nelson, Joseph Glanville, Sage Weil, ceph-devel

On Jan 17, 2013, at 11:01 AM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote:

> 2013/1/17 Atchley, Scott <atchleyes@ornl.gov>:
>> Yes. It should get close to 1 GB/s where 1GbE is limited to about 125 MB/s. Lower latency? Probably since most Ethernet drivers set interrupt coalescing by default. Intel e1000 driver, for example, have a cluster mode that reduces (or turns off) interrupt coalescing. I don't know if ceph is latency sensitive or not.
> 
> Sorry, I meant 10GbE.

10GbE should get close to 1.2 GB/s compared to 1 GB/s for IB SDR. Latency again depends on the Ethernet driver.

Scott

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 16:12                       ` flashcache Atchley, Scott
@ 2013-01-17 16:19                         ` Gandalf Corvotempesta
  2013-01-22 21:06                           ` flashcache Atchley, Scott
  2013-01-17 16:20                         ` flashcache Stefan Priebe
  1 sibling, 1 reply; 34+ messages in thread
From: Gandalf Corvotempesta @ 2013-01-17 16:19 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: Mark Nelson, Joseph Glanville, Sage Weil, ceph-devel

2013/1/17 Atchley, Scott <atchleyes@ornl.gov>:
> 10GbE should get close to 1.2 GB/s compared to 1 GB/s for IB SDR. Latency again depends on the Ethernet driver.

10GbE faster than IB SDR? Really ?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 16:12                       ` flashcache Atchley, Scott
  2013-01-17 16:19                         ` flashcache Gandalf Corvotempesta
@ 2013-01-17 16:20                         ` Stefan Priebe
  2013-01-17 16:21                           ` flashcache Gandalf Corvotempesta
  1 sibling, 1 reply; 34+ messages in thread
From: Stefan Priebe @ 2013-01-17 16:20 UTC (permalink / raw)
  To: Atchley, Scott
  Cc: Gandalf Corvotempesta, Mark Nelson, Joseph Glanville, Sage Weil,
	ceph-devel

Hi,

Am 17.01.2013 17:12, schrieb Atchley, Scott:
> On Jan 17, 2013, at 11:01 AM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote:
>
>> 2013/1/17 Atchley, Scott <atchleyes@ornl.gov>:
>>> Yes. It should get close to 1 GB/s where 1GbE is limited to about 125 MB/s. Lower latency? Probably since most Ethernet drivers set interrupt coalescing by default. Intel e1000 driver, for example, have a cluster mode that reduces (or turns off) interrupt coalescing. I don't know if ceph is latency sensitive or not.
>>
>> Sorry, I meant 10GbE.
>
> 10GbE should get close to 1.2 GB/s compared to 1 GB/s for IB SDR. Latency again depends on the Ethernet driver.

We're using bonded active/active 2x10GbE with Intel ixgbe and i'm able 
to get 2.3GB/s.

Not sure how to measure latency effectively.

Stefan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 16:20                         ` flashcache Stefan Priebe
@ 2013-01-17 16:21                           ` Gandalf Corvotempesta
  2013-01-17 16:24                             ` flashcache Stefan Priebe
  0 siblings, 1 reply; 34+ messages in thread
From: Gandalf Corvotempesta @ 2013-01-17 16:21 UTC (permalink / raw)
  To: Stefan Priebe
  Cc: Atchley, Scott, Mark Nelson, Joseph Glanville, Sage Weil, ceph-devel

2013/1/17 Stefan Priebe <s.priebe@profihost.ag>:
> We're using bonded active/active 2x10GbE with Intel ixgbe and i'm able to
> get 2.3GB/s.

Which kind of switch do you use?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 16:21                           ` flashcache Gandalf Corvotempesta
@ 2013-01-17 16:24                             ` Stefan Priebe
  0 siblings, 0 replies; 34+ messages in thread
From: Stefan Priebe @ 2013-01-17 16:24 UTC (permalink / raw)
  To: Gandalf Corvotempesta
  Cc: Atchley, Scott, Mark Nelson, Joseph Glanville, Sage Weil, ceph-devel

Hi,

Am 17.01.2013 17:21, schrieb Gandalf Corvotempesta:
> 2013/1/17 Stefan Priebe <s.priebe@profihost.ag>:
>> We're using bonded active/active 2x10GbE with Intel ixgbe and i'm able to
>> get 2.3GB/s.
>
> Which kind of switch do you use?

HP 5920

Stefan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 15:47                   ` flashcache Atchley, Scott
@ 2013-01-17 16:39                     ` Andrey Korolyov
  2013-01-20  2:56                       ` flashcache Joseph Glanville
  0 siblings, 1 reply; 34+ messages in thread
From: Andrey Korolyov @ 2013-01-17 16:39 UTC (permalink / raw)
  To: Atchley, Scott
  Cc: Gandalf Corvotempesta, Mark Nelson, Joseph Glanville, Sage Weil,
	ceph-devel

On Thu, Jan 17, 2013 at 7:47 PM, Atchley, Scott <atchleyes@ornl.gov> wrote:
> On Jan 17, 2013, at 10:07 AM, Andrey Korolyov <andrey@xdel.ru> wrote:
>
>> On Thu, Jan 17, 2013 at 7:00 PM, Atchley, Scott <atchleyes@ornl.gov> wrote:
>>> On Jan 17, 2013, at 9:48 AM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote:
>>>
>>>> 2013/1/17 Atchley, Scott <atchleyes@ornl.gov>:
>>>>> IB DDR should get you close to 2 GB/s with IPoIB. I have gotten our IB QDR PCI-E Gen. 2 up to 2.8 GB/s measured via netperf with lots of tuning. Since it uses the traditional socket stack through the kernel, CPU usage will be as high (or higher if QDR) than 10GbE.
>>>>
>>>> Which kind of tuning? Do you have a paper about this?
>>>
>>> No, I followed the Mellanox tuning guide and modified their interrupt affinity scripts.
>>
>> Did you tried to bind interrupts only to core to which QPI link
>> belongs in reality and measure difference with spread-over-all-cores
>> binding?
>
> This is the modified part. I bound the mlx4-async handler to core 0 and the mlx4-ib-1-0 handle to core 1 for our machines.
>
>>>> But, actually, is possible to use ceph with IPoIB in a stable way or
>>>> is this experimental ?
>>>
>>> IPoIB appears as a traditional Ethernet device to Linux and can be used as such.
>>
>> Not exactly, this summer kernel added additional driver for fully
>> featured L2(ib ethernet driver), before that it was quite painful to
>> do any possible failover using ipoib.
>
> I assume it is now an EoIB driver. Does it replace the IPoIB driver?
>
Nope, it is upper-layer thing: https://lwn.net/Articles/509448/

>>>> I don't know if i support for rsocket that is experimental/untested
>>>> and IPoIB is a stable workaroud or what else.
>>>
>>> IPoIB is much more used and pretty stable, while rsockets is new with limited testing. That said, more people using it will help Sean improve it.
>>>
>>> Ideally, we would like support for zero-copy and reduced CPU usage (via OS-bypass) and with more interconnects than just InfiniBand. :-)
>>>
>>>> And is a dual controller needed on each OSD node? Ceph is able to
>>>> handle OSD network failures? This is really important to know. It
>>>> change the whole network topology.
>>>
>>> I will let others answer this.
>>>
>>> Scott--
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 16:39                     ` flashcache Andrey Korolyov
@ 2013-01-20  2:56                       ` Joseph Glanville
  2013-01-21 23:57                         ` flashcache John Nielsen
  0 siblings, 1 reply; 34+ messages in thread
From: Joseph Glanville @ 2013-01-20  2:56 UTC (permalink / raw)
  To: Andrey Korolyov
  Cc: Atchley, Scott, Gandalf Corvotempesta, Mark Nelson, Sage Weil,
	ceph-devel

>> I assume it is now an EoIB driver. Does it replace the IPoIB driver?
>>
> Nope, it is upper-layer thing: https://lwn.net/Articles/509448/

Aye, its effectively a NAT translation layer that strips Ethernet
headers and grafts on IPoIB headers, thus using the same wire protocol
and allowing communication from EoIB to IPoIB.

However this approach is a little dirty and has been nacked by the
netdev community so we aren't likely to see it in the mainline
kernel.. basically ever.

Joseph.


-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-20  2:56                       ` flashcache Joseph Glanville
@ 2013-01-21 23:57                         ` John Nielsen
  2013-01-30 20:10                           ` flashcache Joseph Glanville
  0 siblings, 1 reply; 34+ messages in thread
From: John Nielsen @ 2013-01-21 23:57 UTC (permalink / raw)
  To: Joseph Glanville
  Cc: Andrey Korolyov, Atchley, Scott, Gandalf Corvotempesta,
	Mark Nelson, Sage Weil, ceph-devel

On Jan 19, 2013, at 7:56 PM, Joseph Glanville <joseph.glanville@orionvm.com.au> wrote:

>>> I assume it is now an EoIB driver. Does it replace the IPoIB driver?
>>> 
>> Nope, it is upper-layer thing: https://lwn.net/Articles/509448/
> 
> Aye, its effectively a NAT translation layer that strips Ethernet
> headers and grafts on IPoIB headers, thus using the same wire protocol
> and allowing communication from EoIB to IPoIB.
> 
> However this approach is a little dirty and has been nacked by the
> netdev community so we aren't likely to see it in the mainline
> kernel.. basically ever.

Just to clarify:

EoIB has been around for a while (at least in the Mellanox software, not sure about mainline). It uses the mlx4_vnic module and is a true Ethernet encapsulation over InfiniBand. Unfortunately the newer Mellanox switches won't support it any more and the ones that to have entered "Limited Support." (Not to be confused with mlx4_en, which just turns a ConnectX card into a 10G Ethernet NIC.)

IPoIB is IP over InfiniBand without Ethernet (the data link layer is straight InfiniBand).

eIPoIB is (or will be, maybe) Ethernet over IP over InfiniBand. It is intended to work with both Linux bridging and regular IB switches that support IPoIB. (Allowing e.g. unmodified KVM guests on hypervisors connected to an IPoIB fabric.) Both Joseph's comments and the LWN link above are referring to eIPoIB. Last I heard (from a pretty direct source in the last couple of weeks) Mellanox is still working on this but doesn't have anything generally available yet. Here's hoping.. but the feedback on netdev was quite negative, to put it mildly.

JN


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-17 16:19                         ` flashcache Gandalf Corvotempesta
@ 2013-01-22 21:06                           ` Atchley, Scott
  2013-01-22 21:08                             ` flashcache Atchley, Scott
  0 siblings, 1 reply; 34+ messages in thread
From: Atchley, Scott @ 2013-01-22 21:06 UTC (permalink / raw)
  To: Gandalf Corvotempesta
  Cc: Mark Nelson, Joseph Glanville, Sage Weil, ceph-devel

On Jan 17, 2013, at 11:19 AM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote:

> 2013/1/17 Atchley, Scott <atchleyes@ornl.gov>:
>> 10GbE should get close to 1.2 GB/s compared to 1 GB/s for IB SDR. Latency again depends on the Ethernet driver.
> 
> 10GbE faster than IB SDR? Really ?

Define faster. Throughput or latency?

Throughput, yes. You can easily measure 1.2 GB/s using many 10GbE NICs.

Latency, (mostly) no. IB SDR is about 4 us while the best TCP performance I have measured over 10GbE is about 16 us (interrupt coalescing turned off, NAGLE off, etc). Some vendors (e.g. Myricom and SolarFlare) have userspace, OS-bypass socket libraries for their NICs that get this down to 3-6 us.

Scott

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-22 21:06                           ` flashcache Atchley, Scott
@ 2013-01-22 21:08                             ` Atchley, Scott
  0 siblings, 0 replies; 34+ messages in thread
From: Atchley, Scott @ 2013-01-22 21:08 UTC (permalink / raw)
  To: Gandalf Corvotempesta
  Cc: Mark Nelson, Joseph Glanville, Sage Weil, ceph-devel

On Jan 22, 2013, at 4:06 PM, "Atchley, Scott" <atchleyes@ornl.gov> wrote:

> On Jan 17, 2013, at 11:19 AM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote:
> 
>> 2013/1/17 Atchley, Scott <atchleyes@ornl.gov>:
>>> 10GbE should get close to 1.2 GB/s compared to 1 GB/s for IB SDR. Latency again depends on the Ethernet driver.
>> 
>> 10GbE faster than IB SDR? Really ?
> 
> Define faster. Throughput or latency?
> 
> Throughput, yes. You can easily measure 1.2 GB/s using many 10GbE NICs.
> 
> Latency, (mostly) no. IB SDR is about 4 us while the best TCP performance I have measured over 10GbE is about 16 us (interrupt coalescing turned off, NAGLE off, etc). Some vendors (e.g. Myricom and SolarFlare) have userspace, OS-bypass socket libraries for their NICs that get this down to 3-6 us.

Argh, this is native IB for SDR latency. Latency of sockets on top of IB SDR will be closer to 16-20 us.

Scott

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: flashcache
  2013-01-21 23:57                         ` flashcache John Nielsen
@ 2013-01-30 20:10                           ` Joseph Glanville
  0 siblings, 0 replies; 34+ messages in thread
From: Joseph Glanville @ 2013-01-30 20:10 UTC (permalink / raw)
  To: John Nielsen
  Cc: Andrey Korolyov, Atchley, Scott, Gandalf Corvotempesta,
	Mark Nelson, Sage Weil, ceph-devel

On 22 January 2013 10:57, John Nielsen <lists@jnielsen.net> wrote:
> On Jan 19, 2013, at 7:56 PM, Joseph Glanville <joseph.glanville@orionvm.com.au> wrote:
>
>>>> I assume it is now an EoIB driver. Does it replace the IPoIB driver?
>>>>
>>> Nope, it is upper-layer thing: https://lwn.net/Articles/509448/
>>
>> Aye, its effectively a NAT translation layer that strips Ethernet
>> headers and grafts on IPoIB headers, thus using the same wire protocol
>> and allowing communication from EoIB to IPoIB.
>>
>> However this approach is a little dirty and has been nacked by the
>> netdev community so we aren't likely to see it in the mainline
>> kernel.. basically ever.
>
> Just to clarify:
>
> EoIB has been around for a while (at least in the Mellanox software, not sure about mainline). It uses the mlx4_vnic module and is a true Ethernet encapsulation over InfiniBand. Unfortunately the newer Mellanox switches won't support it any more and the ones that to have entered "Limited Support." (Not to be confused with mlx4_en, which just turns a ConnectX card into a 10G Ethernet NIC.)
>
> IPoIB is IP over InfiniBand without Ethernet (the data link layer is straight InfiniBand).
>
> eIPoIB is (or will be, maybe) Ethernet over IP over InfiniBand. It is intended to work with both Linux bridging and regular IB switches that support IPoIB. (Allowing e.g. unmodified KVM guests on hypervisors connected to an IPoIB fabric.) Both Joseph's comments and the LWN link above are referring to eIPoIB. Last I heard (from a pretty direct source in the last couple of weeks) Mellanox is still working on this but doesn't have anything generally available yet. Here's hoping.. but the feedback on netdev was quite negative, to put it mildly.
>
> JN
>

Apolgies, I misread EoIB as eIPoIB as others were discussing IPoIB
appearing as an Ethernet device.

Personally I would like to see a pure software implementation of EoIB
ala IPoIB, using the subnet manager to manage addressing etc rather
than trying to implement a NAT style solution.
The usefulness of communicating to IPoIB devices via the same
interface as IPoIB doesn't sufficiently offset the dirtiness of NAT
IMO.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2013-01-30 20:10 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-16 21:22 flashcache Gandalf Corvotempesta
2013-01-16 21:29 ` flashcache Sage Weil
2013-01-16 21:42   ` flashcache Gandalf Corvotempesta
2013-01-16 21:46     ` flashcache Sage Weil
2013-01-16 21:55       ` flashcache Mark Nelson
2013-01-16 21:59         ` flashcache Gandalf Corvotempesta
2013-01-16 21:57       ` flashcache Gandalf Corvotempesta
2013-01-16 21:53   ` flashcache Mark Nelson
2013-01-16 22:04     ` flashcache Gandalf Corvotempesta
2013-01-17  5:47     ` flashcache Stefan Priebe - Profihost AG
2013-01-17 13:34       ` flashcache Mark Nelson
2013-01-17  9:46     ` flashcache Gandalf Corvotempesta
2013-01-17 13:32       ` flashcache Joseph Glanville
2013-01-17 13:37         ` flashcache Mark Nelson
2013-01-17 13:44           ` flashcache Gandalf Corvotempesta
2013-01-17 14:30           ` flashcache Atchley, Scott
2013-01-17 14:48             ` flashcache Gandalf Corvotempesta
2013-01-17 15:00               ` flashcache Atchley, Scott
2013-01-17 15:07                 ` flashcache Andrey Korolyov
2013-01-17 15:47                   ` flashcache Atchley, Scott
2013-01-17 16:39                     ` flashcache Andrey Korolyov
2013-01-20  2:56                       ` flashcache Joseph Glanville
2013-01-21 23:57                         ` flashcache John Nielsen
2013-01-30 20:10                           ` flashcache Joseph Glanville
2013-01-17 15:14                 ` flashcache Gandalf Corvotempesta
2013-01-17 15:50                   ` flashcache Atchley, Scott
2013-01-17 16:01                     ` flashcache Gandalf Corvotempesta
2013-01-17 16:12                       ` flashcache Atchley, Scott
2013-01-17 16:19                         ` flashcache Gandalf Corvotempesta
2013-01-22 21:06                           ` flashcache Atchley, Scott
2013-01-22 21:08                             ` flashcache Atchley, Scott
2013-01-17 16:20                         ` flashcache Stefan Priebe
2013-01-17 16:21                           ` flashcache Gandalf Corvotempesta
2013-01-17 16:24                             ` flashcache Stefan Priebe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.