All of lore.kernel.org
 help / color / mirror / Atom feed
* OSD public / cluster network isolation using VRF:s
@ 2015-12-03 20:13 Martin Millnert
  2015-12-03 21:03 ` wido
  2015-12-03 21:25 ` Gregory Farnum
  0 siblings, 2 replies; 14+ messages in thread
From: Martin Millnert @ 2015-12-03 20:13 UTC (permalink / raw)
  To: Ceph Development

Hi,

we're deploying Ceph on Linux for multiple purposes.
We want to build network isolation in our L3 DC network using VRF:s.

In the case of Ceph this means that we are separating the Ceph public
network from the Ceph cluster network, in this manner, into separate
network routing domains (for those who do not know what a VRF is).

Furthermore, we're also running (per-VRF) dynamically routed L3 all the
way to the hosts (OSPF from ToR switch), and need to separate route
tables on the hosts properly. This is done using "ip rule" today.
We use VLANs to separate the VRF:s from each other between ToR and
hosts, so there is no problem to determine which VRF an incoming packet
to a host belongs to (iif $dev).

The problem is selecting the proper route table for outbound packets
from the host.

There is current work in progress for a redesign [1] of the old VRF [2]
design in the Linux Kernel. At least in the new design, there is an
intended way of placing processes within a VRF such that, similar to
network namespaces, the processes are unaware that they are in fact
living within a VRF.

This would work for a process such as the 'mon', which only lives in the
public network.

But it doesn't work for the OSD, which uses separate sockets for public
and cluster networks.

There is however a real simple solution:
1. Use something similar to 
   setsockopt(sockfd, SOL_SOCKET, SO_MARK, puborclust_val, sizeof(one))
   (untested)
2. set up "ip rule" for outbound traffic to select an appropriate route
table based on the MARK value of "puborclust_val" above.

AFAIK BSD doesn't have SO_MARK specifically, but this is a quite simple
option that adds a lot of utility for us, and, I imagine others.

I'm willing to write it and test it too. But before doing that, I'm
interested in feedback. Would obviously prefer it to be merged.

Regards,
Martin Millnert

[1] https://lwn.net/Articles/632522/
[2] https://www.kernel.org/doc/Documentation/networking/vrf.txt


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: OSD public / cluster network isolation using VRF:s
  2015-12-03 20:13 OSD public / cluster network isolation using VRF:s Martin Millnert
@ 2015-12-03 21:03 ` wido
  2015-12-03 21:30   ` Sage Weil
  2015-12-07 13:31   ` Martin Millnert
  2015-12-03 21:25 ` Gregory Farnum
  1 sibling, 2 replies; 14+ messages in thread
From: wido @ 2015-12-03 21:03 UTC (permalink / raw)
  To: Martin Millnert; +Cc: Ceph Development



> Op 3 dec. 2015 om 21:14 heeft Martin Millnert <martin@millnert.se> het volgende geschreven:
> 
> Hi,
> 
> we're deploying Ceph on Linux for multiple purposes.
> We want to build network isolation in our L3 DC network using VRF:s.
> 
> In the case of Ceph this means that we are separating the Ceph public
> network from the Ceph cluster network, in this manner, into separate
> network routing domains (for those who do not know what a VRF is).
> 
> Furthermore, we're also running (per-VRF) dynamically routed L3 all the
> way to the hosts (OSPF from ToR switch), and need to separate route
> tables on the hosts properly. This is done using "ip rule" today.
> We use VLANs to separate the VRF:s from each other between ToR and
> hosts, so there is no problem to determine which VRF an incoming packet
> to a host belongs to (iif $dev).
> 
> The problem is selecting the proper route table for outbound packets
> from the host.
> 
> There is current work in progress for a redesign [1] of the old VRF [2]
> design in the Linux Kernel. At least in the new design, there is an
> intended way of placing processes within a VRF such that, similar to
> network namespaces, the processes are unaware that they are in fact
> living within a VRF.
> 

Why all the trouble and complexity? I personally always try to avoid the two networks and run with one. Also in large L3 envs.

I like the idea that one machine has one IP I have to monitor.

I would rethink about what a cluster network really adds. Imho it only adds complexity.


> This would work for a process such as the 'mon', which only lives in the
> public network.
> 
> But it doesn't work for the OSD, which uses separate sockets for public
> and cluster networks.
> 
> There is however a real simple solution:
> 1. Use something similar to 
>   setsockopt(sockfd, SOL_SOCKET, SO_MARK, puborclust_val, sizeof(one))
>   (untested)
> 2. set up "ip rule" for outbound traffic to select an appropriate route
> table based on the MARK value of "puborclust_val" above.
> 
> AFAIK BSD doesn't have SO_MARK specifically, but this is a quite simple
> option that adds a lot of utility for us, and, I imagine others.
> 
> I'm willing to write it and test it too. But before doing that, I'm
> interested in feedback. Would obviously prefer it to be merged.
> 
> Regards,
> Martin Millnert
> 
> [1] https://lwn.net/Articles/632522/
> [2] https://www.kernel.org/doc/Documentation/networking/vrf.txt
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: OSD public / cluster network isolation using VRF:s
  2015-12-03 20:13 OSD public / cluster network isolation using VRF:s Martin Millnert
  2015-12-03 21:03 ` wido
@ 2015-12-03 21:25 ` Gregory Farnum
  2015-12-07 13:36   ` Martin Millnert
  1 sibling, 1 reply; 14+ messages in thread
From: Gregory Farnum @ 2015-12-03 21:25 UTC (permalink / raw)
  To: Martin Millnert; +Cc: Ceph Development

On Thu, Dec 3, 2015 at 12:13 PM, Martin Millnert <martin@millnert.se> wrote:
> Hi,
>
> we're deploying Ceph on Linux for multiple purposes.
> We want to build network isolation in our L3 DC network using VRF:s.
>
> In the case of Ceph this means that we are separating the Ceph public
> network from the Ceph cluster network, in this manner, into separate
> network routing domains (for those who do not know what a VRF is).
>
> Furthermore, we're also running (per-VRF) dynamically routed L3 all the
> way to the hosts (OSPF from ToR switch), and need to separate route
> tables on the hosts properly. This is done using "ip rule" today.
> We use VLANs to separate the VRF:s from each other between ToR and
> hosts, so there is no problem to determine which VRF an incoming packet
> to a host belongs to (iif $dev).
>
> The problem is selecting the proper route table for outbound packets
> from the host.
>
> There is current work in progress for a redesign [1] of the old VRF [2]
> design in the Linux Kernel. At least in the new design, there is an
> intended way of placing processes within a VRF such that, similar to
> network namespaces, the processes are unaware that they are in fact
> living within a VRF.
>
> This would work for a process such as the 'mon', which only lives in the
> public network.
>
> But it doesn't work for the OSD, which uses separate sockets for public
> and cluster networks.
>
> There is however a real simple solution:
> 1. Use something similar to
>    setsockopt(sockfd, SOL_SOCKET, SO_MARK, puborclust_val, sizeof(one))
>    (untested)
> 2. set up "ip rule" for outbound traffic to select an appropriate route
> table based on the MARK value of "puborclust_val" above.
>
> AFAIK BSD doesn't have SO_MARK specifically, but this is a quite simple
> option that adds a lot of utility for us, and, I imagine others.
>
> I'm willing to write it and test it too. But before doing that, I'm
> interested in feedback. Would obviously prefer it to be merged.

I'm probably just being dense here, but I don't quite understand what
all this is trying to accomplish. It looks like it's essentially
trying to set up VLANs (with different rules) over a single physical
network interface, that is still represented to userspace as a single
device with a single IP. Is that right?

What's the point of doing that with Ceph?
-Greg

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: OSD public / cluster network isolation using VRF:s
  2015-12-03 21:03 ` wido
@ 2015-12-03 21:30   ` Sage Weil
  2015-12-07 13:41     ` Martin Millnert
  2015-12-07 13:31   ` Martin Millnert
  1 sibling, 1 reply; 14+ messages in thread
From: Sage Weil @ 2015-12-03 21:30 UTC (permalink / raw)
  To: wido; +Cc: Martin Millnert, Ceph Development

On Thu, 3 Dec 2015, wido@42on.com wrote:
> Why all the trouble and complexity? I personally always try to avoid the 
> two networks and run with one. Also in large L3 envs.
> 
> I like the idea that one machine has one IP I have to monitor.
> 
> I would rethink about what a cluster network really adds. Imho it only 
> adds complexity.

FWIW I tend to agree.  There are probably some network deployments where 
it makes sense, but for most people I think it just adds complexity.  
Maybe it makes it easy to utilize dual interfaces, but my guess is you're 
better off bonding them if you can.

Note that on a largish cluster the public/client traffic is all 
north-south, while the backend traffic is also mostly north-south to the 
top-of-rack and then east-west.  I.e., within the rack, almost everything 
is north-south, and client and replication traffic don't look that 
different.

sage

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: OSD public / cluster network isolation using VRF:s
  2015-12-03 21:03 ` wido
  2015-12-03 21:30   ` Sage Weil
@ 2015-12-07 13:31   ` Martin Millnert
  1 sibling, 0 replies; 14+ messages in thread
From: Martin Millnert @ 2015-12-07 13:31 UTC (permalink / raw)
  To: wido; +Cc: Ceph Development

Wido, 

thanks for your feedback.

On Thu, 2015-12-03 at 22:03 +0100, wido@42on.com wrote:
> 
> > Op 3 dec. 2015 om 21:14 heeft Martin Millnert <martin@millnert.se> het volgende geschreven:
> > 
> > Hi,
> > 
> > we're deploying Ceph on Linux for multiple purposes.
> > We want to build network isolation in our L3 DC network using VRF:s.
<snip>
> Why all the trouble and complexity? I personally always try to avoid
> the two networks and run with one. Also in large L3 envs.
> 
> I like the idea that one machine has one IP I have to monitor.
> 
> I would rethink about what a cluster network really adds. Imho it only
> adds complexity.

There is one main reason behind separation,  i.e. using cluster network:
simple network level traffic classification.

We have machines where we need to be able to guarantee a minimum amount
of osd-osd replication traffic on the network links (CoS). And it seems
like "nice-to-have" feature in general. 
An assumption here is that osd-osd "pinging" would happen on the cluster
network if configured.

A possible workaround I imagine would be if replication and osd-osd,
osd-mon traffic would receive different values in the ToS field than
client traffic. Not immediately obvious to me how one listening socket
would manage the distinction.  

/Martin


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: OSD public / cluster network isolation using VRF:s
  2015-12-03 21:25 ` Gregory Farnum
@ 2015-12-07 13:36   ` Martin Millnert
  2015-12-07 14:48     ` Gregory Farnum
  0 siblings, 1 reply; 14+ messages in thread
From: Martin Millnert @ 2015-12-07 13:36 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ceph Development

Greg,

see below.

On Thu, 2015-12-03 at 13:25 -0800, Gregory Farnum wrote:
> On Thu, Dec 3, 2015 at 12:13 PM, Martin Millnert <martin@millnert.se> wrote:
> > Hi,
> >
> > we're deploying Ceph on Linux for multiple purposes.
> > We want to build network isolation in our L3 DC network using VRF:s.
> >
> > In the case of Ceph this means that we are separating the Ceph public
> > network from the Ceph cluster network, in this manner, into separate
> > network routing domains (for those who do not know what a VRF is).
> >
> > Furthermore, we're also running (per-VRF) dynamically routed L3 all the
> > way to the hosts (OSPF from ToR switch), and need to separate route
> > tables on the hosts properly. This is done using "ip rule" today.
> > We use VLANs to separate the VRF:s from each other between ToR and
> > hosts, so there is no problem to determine which VRF an incoming packet
> > to a host belongs to (iif $dev).
> >
> > The problem is selecting the proper route table for outbound packets
> > from the host.
> >
> > There is current work in progress for a redesign [1] of the old VRF [2]
> > design in the Linux Kernel. At least in the new design, there is an
> > intended way of placing processes within a VRF such that, similar to
> > network namespaces, the processes are unaware that they are in fact
> > living within a VRF.
> >
> > This would work for a process such as the 'mon', which only lives in the
> > public network.
> >
> > But it doesn't work for the OSD, which uses separate sockets for public
> > and cluster networks.
> >
> > There is however a real simple solution:
> > 1. Use something similar to
> >    setsockopt(sockfd, SOL_SOCKET, SO_MARK, puborclust_val, sizeof(one))
> >    (untested)
> > 2. set up "ip rule" for outbound traffic to select an appropriate route
> > table based on the MARK value of "puborclust_val" above.
> >
> > AFAIK BSD doesn't have SO_MARK specifically, but this is a quite simple
> > option that adds a lot of utility for us, and, I imagine others.
> >
> > I'm willing to write it and test it too. But before doing that, I'm
> > interested in feedback. Would obviously prefer it to be merged.
> 
> I'm probably just being dense here, but I don't quite understand what
> all this is trying to accomplish. It looks like it's essentially
> trying to set up VLANs (with different rules) over a single physical
> network interface, that is still represented to userspace as a single
> device with a single IP. Is that right?

That's almost what it is, with two differences:
 1) there are separated route tables per VLAN,
 2) Each VLAN interface (public, cluster) has its own address. 

With separate route tables,  there's a general problem of picking the
correct table on outbound connections.

> What's the point of doing that with Ceph?

Classification & prioritization of ceph network traffic. In our case,
prioritization of cluster traffic over client traffic. See my email to
Wido.

/Martin


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: OSD public / cluster network isolation using VRF:s
  2015-12-03 21:30   ` Sage Weil
@ 2015-12-07 13:41     ` Martin Millnert
  2015-12-07 14:10       ` Sage Weil
  0 siblings, 1 reply; 14+ messages in thread
From: Martin Millnert @ 2015-12-07 13:41 UTC (permalink / raw)
  To: Sage Weil; +Cc: wido, Ceph Development

Sage,

thanks for your feedback, please see below:

On Thu, 2015-12-03 at 13:30 -0800, Sage Weil wrote:
> On Thu, 3 Dec 2015, wido@42on.com wrote:
> > Why all the trouble and complexity? I personally always try to avoid the 
> > two networks and run with one. Also in large L3 envs.
> > 
> > I like the idea that one machine has one IP I have to monitor.
> > 
> > I would rethink about what a cluster network really adds. Imho it only 
> > adds complexity.
> 
> FWIW I tend to agree.  There are probably some network deployments where 
> it makes sense, but for most people I think it just adds complexity.  
> Maybe it makes it easy to utilize dual interfaces, but my guess is you're 
> better off bonding them if you can.

I'll add to the response to Wido that in our case it's not separated
physical interfaces. (While providing good QoS, you would lose statmux
gains and redundancy). ToR <-> host in our case is bonded or
equivalent. 

> Note that on a largish cluster the public/client traffic is all 
> north-south, while the backend traffic is also mostly north-south to the 
> top-of-rack and then east-west.  I.e., within the rack, almost everything 
> is north-south, and client and replication traffic don't look that 
> different.

This problem domain is one of the larger challenges. I worry about
network timeouts for critical cluster traffic in one of the clusters due
to hosts having 2x1GbE. I.e. in our case I want to
prioritize/guarantee/reserve a minimum amount of bandwidth for cluster
health traffic primarily, and secondarily cluster replication. Client
write replication should then be least prioritized.

To support this I need our network equipment to perform the CoS job, and
in order to do that at some level in the stack I need to be able to
classify traffic. And furthermore, I'd like to do this with as little
added state as possible.

/Martin


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: OSD public / cluster network isolation using VRF:s
  2015-12-07 13:41     ` Martin Millnert
@ 2015-12-07 14:10       ` Sage Weil
  2015-12-07 15:50         ` Martin Millnert
  2015-12-14 18:31         ` Kyle Bader
  0 siblings, 2 replies; 14+ messages in thread
From: Sage Weil @ 2015-12-07 14:10 UTC (permalink / raw)
  To: Martin Millnert; +Cc: wido, Ceph Development

On Mon, 7 Dec 2015, Martin Millnert wrote:
> > Note that on a largish cluster the public/client traffic is all 
> > north-south, while the backend traffic is also mostly north-south to the 
> > top-of-rack and then east-west.  I.e., within the rack, almost everything 
> > is north-south, and client and replication traffic don't look that 
> > different.
> 
> This problem domain is one of the larger challenges. I worry about
> network timeouts for critical cluster traffic in one of the clusters due
> to hosts having 2x1GbE. I.e. in our case I want to
> prioritize/guarantee/reserve a minimum amount of bandwidth for cluster
> health traffic primarily, and secondarily cluster replication. Client
> write replication should then be least prioritized.

One word of caution here: the health traffic should really be the 
same path and class of service as the inter-osd traffic, or else it 
will not identify failures.  e.g., if the health traffic is prioritized, 
and lower-priority traffic is starved/dropped, we won't notice.
 
> To support this I need our network equipment to perform the CoS job, and
> in order to do that at some level in the stack I need to be able to
> classify traffic. And furthermore, I'd like to do this with as little
> added state as possible.

I seem to recall a conversation a year or so ago about tagging 
stream/sockets so that the network layer could do this.  I don't think 
we got anywhere, though...

sage


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: OSD public / cluster network isolation using VRF:s
  2015-12-07 13:36   ` Martin Millnert
@ 2015-12-07 14:48     ` Gregory Farnum
  2015-12-07 15:31       ` Martin Millnert
  0 siblings, 1 reply; 14+ messages in thread
From: Gregory Farnum @ 2015-12-07 14:48 UTC (permalink / raw)
  To: Martin Millnert; +Cc: Ceph Development

On Mon, Dec 7, 2015 at 5:36 AM, Martin Millnert <martin@millnert.se> wrote:
> Greg,
>
> see below.
>
> On Thu, 2015-12-03 at 13:25 -0800, Gregory Farnum wrote:
>> On Thu, Dec 3, 2015 at 12:13 PM, Martin Millnert <martin@millnert.se> wrote:
>> > Hi,
>> >
>> > we're deploying Ceph on Linux for multiple purposes.
>> > We want to build network isolation in our L3 DC network using VRF:s.
>> >
>> > In the case of Ceph this means that we are separating the Ceph public
>> > network from the Ceph cluster network, in this manner, into separate
>> > network routing domains (for those who do not know what a VRF is).
>> >
>> > Furthermore, we're also running (per-VRF) dynamically routed L3 all the
>> > way to the hosts (OSPF from ToR switch), and need to separate route
>> > tables on the hosts properly. This is done using "ip rule" today.
>> > We use VLANs to separate the VRF:s from each other between ToR and
>> > hosts, so there is no problem to determine which VRF an incoming packet
>> > to a host belongs to (iif $dev).
>> >
>> > The problem is selecting the proper route table for outbound packets
>> > from the host.
>> >
>> > There is current work in progress for a redesign [1] of the old VRF [2]
>> > design in the Linux Kernel. At least in the new design, there is an
>> > intended way of placing processes within a VRF such that, similar to
>> > network namespaces, the processes are unaware that they are in fact
>> > living within a VRF.
>> >
>> > This would work for a process such as the 'mon', which only lives in the
>> > public network.
>> >
>> > But it doesn't work for the OSD, which uses separate sockets for public
>> > and cluster networks.
>> >
>> > There is however a real simple solution:
>> > 1. Use something similar to
>> >    setsockopt(sockfd, SOL_SOCKET, SO_MARK, puborclust_val, sizeof(one))
>> >    (untested)
>> > 2. set up "ip rule" for outbound traffic to select an appropriate route
>> > table based on the MARK value of "puborclust_val" above.
>> >
>> > AFAIK BSD doesn't have SO_MARK specifically, but this is a quite simple
>> > option that adds a lot of utility for us, and, I imagine others.
>> >
>> > I'm willing to write it and test it too. But before doing that, I'm
>> > interested in feedback. Would obviously prefer it to be merged.
>>
>> I'm probably just being dense here, but I don't quite understand what
>> all this is trying to accomplish. It looks like it's essentially
>> trying to set up VLANs (with different rules) over a single physical
>> network interface, that is still represented to userspace as a single
>> device with a single IP. Is that right?
>
> That's almost what it is, with two differences:
>  1) there are separated route tables per VLAN,
>  2) Each VLAN interface (public, cluster) has its own address.

Okay, but if each interface has its own interface, why do you need
Ceph to do anything at all? You can specify the public and cluster
addresses, they'll bind to the appropriate interface, and then you can
do stuff based on the interface/VLAN it's part of. Right?
-Greg

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: OSD public / cluster network isolation using VRF:s
  2015-12-07 14:48     ` Gregory Farnum
@ 2015-12-07 15:31       ` Martin Millnert
  2015-12-11  2:38         ` Gregory Farnum
  0 siblings, 1 reply; 14+ messages in thread
From: Martin Millnert @ 2015-12-07 15:31 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ceph Development

On Mon, 2015-12-07 at 06:48 -0800, Gregory Farnum wrote:
<snip>
> >> I'm probably just being dense here, but I don't quite understand what
> >> all this is trying to accomplish. It looks like it's essentially
> >> trying to set up VLANs (with different rules) over a single physical
> >> network interface, that is still represented to userspace as a single
> >> device with a single IP. Is that right?
> >
> > That's almost what it is, with two differences:
> >  1) there are separated route tables per VLAN,
> >  2) Each VLAN interface (public, cluster) has its own address.
> 
> Okay, but if each interface has its own interface, why do you need
> Ceph to do anything at all? You can specify the public and cluster
> addresses, they'll bind to the appropriate interface, and then you can
> do stuff based on the interface/VLAN it's part of. Right?
> -Greg

Yes. And in the generic case: almost good enough.

In the case I'm discussing, with also separate Linux kernel routing
tables, we need to steer the route lookups that happens once the tcp
stack has performed its packetization, into the correct table.

Depending on how Ceph behaves with interface/IP binding for outbound
connections, this may be easy!
I.e. if Ceph binds to the specific address, not only on the listening
socket, but also when creating outbound sockets, we can create "ip
rule"'s that uses the source address and AFAIU "we're home" - at this
level.
Do you know if this is how Ceph manages the sockets in this case?

But if we instead end up with the kernel trying to figure which address
to use ( https://tools.ietf.org/html/rfc6724 ), it gets a whole lot
trickier.

For monitors that live only on the public network (as per
documentation), the situation is simpler; we can mark traffic outside of
Ceph using e.g. iptables.

/Martin


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: OSD public / cluster network isolation using VRF:s
  2015-12-07 14:10       ` Sage Weil
@ 2015-12-07 15:50         ` Martin Millnert
  2015-12-07 17:01           ` Robert LeBlanc
  2015-12-14 18:31         ` Kyle Bader
  1 sibling, 1 reply; 14+ messages in thread
From: Martin Millnert @ 2015-12-07 15:50 UTC (permalink / raw)
  To: Sage Weil; +Cc: wido, Ceph Development

On Mon, 2015-12-07 at 06:10 -0800, Sage Weil wrote:
> On Mon, 7 Dec 2015, Martin Millnert wrote:
> > > Note that on a largish cluster the public/client traffic is all 
> > > north-south, while the backend traffic is also mostly north-south to the 
> > > top-of-rack and then east-west.  I.e., within the rack, almost everything 
> > > is north-south, and client and replication traffic don't look that 
> > > different.
> > 
> > This problem domain is one of the larger challenges. I worry about
> > network timeouts for critical cluster traffic in one of the clusters due
> > to hosts having 2x1GbE. I.e. in our case I want to
> > prioritize/guarantee/reserve a minimum amount of bandwidth for cluster
> > health traffic primarily, and secondarily cluster replication. Client
> > write replication should then be least prioritized.
> 
> One word of caution here: the health traffic should really be the 
> same path and class of service as the inter-osd traffic, or else it 
> will not identify failures. 

Indeed - complete starvation is never good. We're considering reserving
parts of the bandwidth (Where the class of service implementation in the
networking gear does the job of spending unallocated bandwidth, etc, as
per the whole packet scheduling logic.
TX time slots never go idle if there are non-empty queues.)

Something like: 
 1) "Reserve 5% bandwith to 'osd-mon'
 2) "Reserve 40% bandwidth to 'osd-osd' (repairs when unhealthy)"
 3) "Reserve 30% bandwidth to 'osd-osd' (other)"
 4) "Reserve 25% bandwidth to 'client-osd' traffic"

Our goal is that client traffic *should* lose some packets here and
there when there is more load towards a host than it has bandwidth for,
a little bit more often than it happens to more critical traffic. Health
takes precedence over function, but not on an "all or nothing" basis. I
suppose 2 and 3 may be impossible to distinguish.

But most important of all is, the way I understand Ceph-under-stress,
that we want to actively avoid start flipping OSD's up/down and ending
up with an oscillating/unstable cluster, that starts to move data
around, simply because a host is under pressure (i.e. 100 nodes writing
to 1, and similar scenarios).

> e.g., if the health traffic is prioritized, 
> and lower-priority traffic is starved/dropped, we won't notice.

To truly notice drops - we need information from the network layer,
either host stack side (where we can have it per-socket) or from the
network side, i.e. the switches etc, right?
We'll monitor the different hardware queues in our network devices.
Socket statistics can be received at a host-wide scale from the Linux
network stack, and well, per socket given some modifications to Ceph I
suppose (I push netstat's statistics into influxdb).
(I'm rusty on how/what per-socket metrics can be logged today in vanilla
kernel and assume we need application support.)

The bigger overarching issue for us is what happens under stress in
different situations and how to maximize time spent in state "normal" of
the cluster.

> > To support this I need our network equipment to perform the CoS job, and
> > in order to do that at some level in the stack I need to be able to
> > classify traffic. And furthermore, I'd like to do this with as little
> > added state as possible.
> 
> I seem to recall a conversation a year or so ago about tagging 
> stream/sockets so that the network layer could do this.  I don't think 
> we got anywhere, though...

It'd be interesting to look into what were the ideas back then - I'll
take a look over the archives.

Thanks,
Martin


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: OSD public / cluster network isolation using VRF:s
  2015-12-07 15:50         ` Martin Millnert
@ 2015-12-07 17:01           ` Robert LeBlanc
  0 siblings, 0 replies; 14+ messages in thread
From: Robert LeBlanc @ 2015-12-07 17:01 UTC (permalink / raw)
  To: Martin Millnert; +Cc: Ceph Development

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

We did some work on prioritizing Ceph traffic and this is what I came up with.

#!/bin/sh

#set -x

if [ $1 == "bond0" ]; then

        INTERFACES="enp7s0f0 enp7s0f1"

        for i in $INTERFACES; do
                # Clear what might be there
                tc qdisc del dev $i root

                # Add priority queue at the root of the interface
                tc qdisc add dev $i root handle 1: prio

                # Add sfq to each priority band to give each destination
                # a chance to get traffic
                tc qdisc add dev $i parent 1:1 handle 10: sfq
                tc qdisc add dev $i parent 1:2 handle 20: sfq
                tc qdisc add dev $i parent 1:3 handle 30: sfq
        done

        # Flush the POSTROUTING chain
        iptables -t mangle -F POSTROUTING

        # Don't mess with the loopback device
        iptables -t mangle -A POSTROUTING -o lo -j ACCEPT

        # Remark the Ceph heartbeat packets
        iptables -t mangle -A POSTROUTING -m dscp --dscp 0x30 -j DSCP
--set-dscp 0x2e

        # Traffic destined for the monitors should get priority
        iptables -t mangle -A POSTROUTING -p tcp --dport 6789 -j DSCP
--set-dscp 0x2e

        # All traffic going out the management interface is high priority
        iptables -t mangle -A POSTROUTING -o bond0.202 -j DSCP --set-dscp 0x2e

        # Send the high priority traffic to the tc 1:1 queue of the adapter
        iptables -t mangle -A POSTROUTING -m dscp --dscp 0x2e -j
CLASSIFY --set-class 0001:0001

        # Stop processing high priority traffic so it doesn't get messed up
        iptables -t mangle -A POSTROUTING -m dscp --dscp 0x2e -j ACCEPT

        # Set the replication traffic to low priority, it will only be on the
        # cluster network VLAN 401. Heartbeats were taken care of already
        iptables -t mangle -A POSTROUTING -o bond0.401 -j DSCP --set-dscp 0x08

        # Send the replication traffic to the tc 1:3 queue of the adapter
        iptables -t mangle -A POSTROUTING -m dscp --dscp 0x08 -j
CLASSIFY --set-class 0001:0003

        # Stop processing low priority traffic
        iptables -t mangle -A POSTROUTING -m dscp --dscp 0x08 -j ACCEPT

        # Whatever is left is best effort or storage traffic. We don't need
        # to mark it because it will get the default DSCP of 0. Just send it
        # to the middle tc class 1:2
        iptables -t mangle -A POSTROUTING -j CLASSIFY --set-class 0001:0002
fi

In the switches, we mark CoS based on the DSCP tag since we are not
able to easily mark the L2 CoS in Linux. Even though we are using the
scavenger class here on the Linux box for replication, I believe we
are marking DSCP 0x08 to the same class as DSCP 0x0. All Ceph traffic
has CoS priority higher than the VM traffic (different VLANs but same
physical switches). We didn't have much luck with replication traffic
lower than client traffic and it works well enough for what we wanted
with client/replication at the same priority. We have created
saturation tests and even though the cluster performance degrades a
lot, we did not have the flapping of OSD nodes that others have
mentioned in similar situations. We also have configured 12 reporters
for 10 OSDs per host so I'm sure that helps as well.

Newer versions of Ceph will automatically set the DSCP of heartbeat
packets, but we wanted a different DSCP value so we just remark them.
I was going to test setting client traffic lower than replication, but
we didn't have a pressing need to after our testing, the cluster would
have degraded about the same either way. We just wanted to prevent the
OSD flapping and this got us there.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.3.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWZbtZCRDmVDuy+mK58QAAO2EP/ROKlPERu9CkMX7WldOh
PvrF/Xc1QenWmqgB9SDjjbpcA/xuyrzUXhlYL7iOjnx4yc+Y4M7KHDSjz2M9
7WZ3UDIOqXJ5LiMmzfwlD+k0+iZ8yYdMpeWrMG/bK4em/DUpa4g1IBz5iasE
CXti9dUPBN8fNnsDYY4svCP55QWCnVMV1m4Fsqp4Pa1VNSPwfdgBpzDgbnNR
qLKXaA2MbRP82li1ywlz3GFVnwPFLvkcbHnaOgO/WnfIM/LsftXLTk5bOqdV
wmXNc2ApbY6ZRKzYRUZYuuj6VMfGnU+qYpuTFrGiUESfRYNgc6vK3d3Uzeln
FmV4IXchfwFvtLH05PX2aOGN5MsL+iN9pjHOuqdrWoHlEclxfQBP7HTQJbPZ
hQgaCqoyRPaTwAX26OMj8oGNn1z/TtwYvDCZsNj40LHMjzOzDeAGauAG0b5e
GusSFAjw4REVzk1DvuuMUOE5yEpchqqoTSwsA7cBYKIr/HpnsQN1YwKHfFWQ
Ztx2dZSYrVU99dU1DEGATW8S92oyNC+He+eyz8YoMjdhLV7cm4O2QY0D4uQF
po31utoWi67WpTr1LCnzwdsAC4gHi/dJ6LGRqn/zyXfk+WQ+8h3I29Xoh6DE
MQyVKaAVHUE6xYy1RAoRT5WJTg1ZvQwLk7p3V1twOtjrOCzjwFQ8hUyMOqSx
1IT8
=ihTl
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Dec 7, 2015 at 8:50 AM, Martin Millnert <martin@millnert.se> wrote:
> On Mon, 2015-12-07 at 06:10 -0800, Sage Weil wrote:
>> On Mon, 7 Dec 2015, Martin Millnert wrote:
>> > > Note that on a largish cluster the public/client traffic is all
>> > > north-south, while the backend traffic is also mostly north-south to the
>> > > top-of-rack and then east-west.  I.e., within the rack, almost everything
>> > > is north-south, and client and replication traffic don't look that
>> > > different.
>> >
>> > This problem domain is one of the larger challenges. I worry about
>> > network timeouts for critical cluster traffic in one of the clusters due
>> > to hosts having 2x1GbE. I.e. in our case I want to
>> > prioritize/guarantee/reserve a minimum amount of bandwidth for cluster
>> > health traffic primarily, and secondarily cluster replication. Client
>> > write replication should then be least prioritized.
>>
>> One word of caution here: the health traffic should really be the
>> same path and class of service as the inter-osd traffic, or else it
>> will not identify failures.
>
> Indeed - complete starvation is never good. We're considering reserving
> parts of the bandwidth (Where the class of service implementation in the
> networking gear does the job of spending unallocated bandwidth, etc, as
> per the whole packet scheduling logic.
> TX time slots never go idle if there are non-empty queues.)
>
> Something like:
>  1) "Reserve 5% bandwith to 'osd-mon'
>  2) "Reserve 40% bandwidth to 'osd-osd' (repairs when unhealthy)"
>  3) "Reserve 30% bandwidth to 'osd-osd' (other)"
>  4) "Reserve 25% bandwidth to 'client-osd' traffic"
>
> Our goal is that client traffic *should* lose some packets here and
> there when there is more load towards a host than it has bandwidth for,
> a little bit more often than it happens to more critical traffic. Health
> takes precedence over function, but not on an "all or nothing" basis. I
> suppose 2 and 3 may be impossible to distinguish.
>
> But most important of all is, the way I understand Ceph-under-stress,
> that we want to actively avoid start flipping OSD's up/down and ending
> up with an oscillating/unstable cluster, that starts to move data
> around, simply because a host is under pressure (i.e. 100 nodes writing
> to 1, and similar scenarios).
>
>> e.g., if the health traffic is prioritized,
>> and lower-priority traffic is starved/dropped, we won't notice.
>
> To truly notice drops - we need information from the network layer,
> either host stack side (where we can have it per-socket) or from the
> network side, i.e. the switches etc, right?
> We'll monitor the different hardware queues in our network devices.
> Socket statistics can be received at a host-wide scale from the Linux
> network stack, and well, per socket given some modifications to Ceph I
> suppose (I push netstat's statistics into influxdb).
> (I'm rusty on how/what per-socket metrics can be logged today in vanilla
> kernel and assume we need application support.)
>
> The bigger overarching issue for us is what happens under stress in
> different situations and how to maximize time spent in state "normal" of
> the cluster.
>
>> > To support this I need our network equipment to perform the CoS job, and
>> > in order to do that at some level in the stack I need to be able to
>> > classify traffic. And furthermore, I'd like to do this with as little
>> > added state as possible.
>>
>> I seem to recall a conversation a year or so ago about tagging
>> stream/sockets so that the network layer could do this.  I don't think
>> we got anywhere, though...
>
> It'd be interesting to look into what were the ideas back then - I'll
> take a look over the archives.
>
> Thanks,
> Martin
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: OSD public / cluster network isolation using VRF:s
  2015-12-07 15:31       ` Martin Millnert
@ 2015-12-11  2:38         ` Gregory Farnum
  0 siblings, 0 replies; 14+ messages in thread
From: Gregory Farnum @ 2015-12-11  2:38 UTC (permalink / raw)
  To: Martin Millnert; +Cc: Ceph Development

On Mon, Dec 7, 2015 at 7:31 AM, Martin Millnert <martin@millnert.se> wrote:
> On Mon, 2015-12-07 at 06:48 -0800, Gregory Farnum wrote:
> <snip>
>> >> I'm probably just being dense here, but I don't quite understand what
>> >> all this is trying to accomplish. It looks like it's essentially
>> >> trying to set up VLANs (with different rules) over a single physical
>> >> network interface, that is still represented to userspace as a single
>> >> device with a single IP. Is that right?
>> >
>> > That's almost what it is, with two differences:
>> >  1) there are separated route tables per VLAN,
>> >  2) Each VLAN interface (public, cluster) has its own address.
>>
>> Okay, but if each interface has its own interface, why do you need
>> Ceph to do anything at all? You can specify the public and cluster
>> addresses, they'll bind to the appropriate interface, and then you can
>> do stuff based on the interface/VLAN it's part of. Right?
>> -Greg
>
> Yes. And in the generic case: almost good enough.
>
> In the case I'm discussing, with also separate Linux kernel routing
> tables, we need to steer the route lookups that happens once the tcp
> stack has performed its packetization, into the correct table.
>
> Depending on how Ceph behaves with interface/IP binding for outbound
> connections, this may be easy!
> I.e. if Ceph binds to the specific address, not only on the listening
> socket, but also when creating outbound sockets, we can create "ip
> rule"'s that uses the source address and AFAIU "we're home" - at this
> level.
> Do you know if this is how Ceph manages the sockets in this case?
>
> But if we instead end up with the kernel trying to figure which address
> to use ( https://tools.ietf.org/html/rfc6724 ), it gets a whole lot
> trickier.

Mmm, yeah, I think it makes the kernel pick each time. But if you've
got the same IP on both I would hope that's not a problem. I haven't
dug into it though. :/
-Greg



>
> For monitors that live only on the public network (as per
> documentation), the situation is simpler; we can mark traffic outside of
> Ceph using e.g. iptables.
>
> /Martin
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: OSD public / cluster network isolation using VRF:s
  2015-12-07 14:10       ` Sage Weil
  2015-12-07 15:50         ` Martin Millnert
@ 2015-12-14 18:31         ` Kyle Bader
  1 sibling, 0 replies; 14+ messages in thread
From: Kyle Bader @ 2015-12-14 18:31 UTC (permalink / raw)
  To: Sage Weil; +Cc: Martin Millnert, wido, Ceph Development

On Mon, Dec 7, 2015 at 6:10 AM, Sage Weil <sage@newdream.net> wrote:
> On Mon, 7 Dec 2015, Martin Millnert wrote:
>> > Note that on a largish cluster the public/client traffic is all
>> > north-south, while the backend traffic is also mostly north-south to the
>> > top-of-rack and then east-west.  I.e., within the rack, almost everything
>> > is north-south, and client and replication traffic don't look that
>> > different.
>>
>> This problem domain is one of the larger challenges. I worry about
>> network timeouts for critical cluster traffic in one of the clusters due
>> to hosts having 2x1GbE. I.e. in our case I want to
>> prioritize/guarantee/reserve a minimum amount of bandwidth for cluster
>> health traffic primarily, and secondarily cluster replication. Client
>> write replication should then be least prioritized.
>
> One word of caution here: the health traffic should really be the
> same path and class of service as the inter-osd traffic, or else it
> will not identify failures.  e.g., if the health traffic is prioritized,
> and lower-priority traffic is starved/dropped, we won't notice.
>
>> To support this I need our network equipment to perform the CoS job, and
>> in order to do that at some level in the stack I need to be able to
>> classify traffic. And furthermore, I'd like to do this with as little
>> added state as possible.
>
> I seem to recall a conversation a year or so ago about tagging
> stream/sockets so that the network layer could do this.  I don't think
> we got anywhere, though...

We talked about it, I think this was the resulting issue that was opened:

http://tracker.ceph.com/issues/12260

-- 

Kyle Bader

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-12-14 18:31 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-03 20:13 OSD public / cluster network isolation using VRF:s Martin Millnert
2015-12-03 21:03 ` wido
2015-12-03 21:30   ` Sage Weil
2015-12-07 13:41     ` Martin Millnert
2015-12-07 14:10       ` Sage Weil
2015-12-07 15:50         ` Martin Millnert
2015-12-07 17:01           ` Robert LeBlanc
2015-12-14 18:31         ` Kyle Bader
2015-12-07 13:31   ` Martin Millnert
2015-12-03 21:25 ` Gregory Farnum
2015-12-07 13:36   ` Martin Millnert
2015-12-07 14:48     ` Gregory Farnum
2015-12-07 15:31       ` Martin Millnert
2015-12-11  2:38         ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.