netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [v4 PATCH 0/2] NETFILTER new target module, HMARK
@ 2011-11-25  9:36 Hans Schillstrom
  2011-11-25  9:36 ` [v4 PATCH 1/2] NETFILTER module xt_hmark, new target for HASH based fwmark Hans Schillstrom
  2011-11-25  9:36 ` [v4 PATCH 2/2] NETFILTER userspace part for target HMARK Hans Schillstrom
  0 siblings, 2 replies; 19+ messages in thread
From: Hans Schillstrom @ 2011-11-25  9:36 UTC (permalink / raw)
  To: kaber, pablo, jengelh, netfilter-devel, netdev; +Cc: hans.schillstrom

From: Hans Schillstrom <hans.schillstrom@ericsson.com>

The target allows you to create rules in the "raw" and "mangle" tables
which alter the netfilter mark (nfmark) field within a given range.
First a 32 bit hash value is generated then modulus by <limit> and
finally an offset is added before it's written to nfmark.
Prior to routing, the nfmark can influence the routing method (see
"Use netfilter MARK value as routing key") and can also be used by
other subsystems to change their behavior.

The mark match can also be used to match nfmark produced by this module.
See the kernel module for more info.

REVISION
Version 4
	Split of IPv6 and IPv4, use IP_CT_IS_REPLY, as Pablo suggested.
	removed one pskb_may_pull()
	xtoption parse used in the user space part.

Version 3
        Handling of SCTP for IPv6 added.

Version 2
	NAT Added for IPv4
	IPv6 ICMP handling enhanced.
	Usage example added

Version 1
	Initial RFC


We (Ericsson) use hmark in-front of ipvs as a pre-loadbalancer and
handles up to 70 ipvs running in parallel in clusters.
However hmark is not restricted to run in front of IPVS it can also be used as
"poor mans" load balancer.
With this version is also NAT supported as an option, with very high flows
you might not want to use conntrack.

The idea is to generate a direction independent fw mark range to use as input to
the routing (i.e. ip rule add fwmark ...).
Pretty straight forward and simple.


Example:
                                      App Server (Real Server)

                                           +---------+
                                        -->| Service |
     Gateway A                             +---------+
                          /
            +----------+ /     +----+      +---------+
--- if -A---| selector |---->  |ipvs|  --->| Service |
            +----------+ \     +----+      +---------+
                          \
                               +----+      +---------+
                               |ipvs|   -->| Service |
                               +----+      +---------+
      Gateway C
            +----------+ /     +----+
--- if-B ---| selector | --->  |ipvs|
            +----------+ \     +----+      +---------+
                                           | Service |
                                           +---------+
                          /
            +----------+ /     +----+     ..
--- if-B ---| selector | --->  |ipvs|      +---------+
            +----------+ \     +----+      | Service |
                          \                +---------+
#
# Example with four ipvs loadbalancers
#
iptables -t mangle -I PREROUTING -d $IPADDR -j HMARK --hmark-mod 4 --hmark-offs 100

ip rule add fwmark 100 table 100
ip rule add fwmark 101 table 101
ip rule add fwmark 102 table 102
ip rule add fwmark 103 table 103

ip ro ad table 100 default via x.y.z.1 dev bond1
ip ro ad table 101 default via x.y.z.2 dev bond1
ip ro ad table 102 default via x.y.z.3 dev bond1
ip ro ad table 103 default via x.y.z.4 dev bond1


If conntrack doesn't handle the return path,
do the oposite with HMARK and send it back right to ipvs.

Another exmaple of usage could be if you have cluster originated connections
and want to spread the connections over a number of interfaces
(NAT will complpicate things for you in this case)



                     \  Blade 1
                      \ +----------+      +---------+
                    <-- | selector | <--- | Service |
                      / +----------+      +---------+
                     /
   +------+
-- | Gw-A |          \  Blade 2
   +------+           \ +----------+      +---------+
   +------+         <-- | selector | <--- | Service |
-- | Gw-B |           / +----------+      +---------+
   +------+          /
   +------+
-- | Gw-C |          \
   +------+           \ +----------+      +---------+
                    <-- | selector | <--- | Service |
                      / +----------+      +---------+
                     /

                     \  Blande -n
                      \ +----------+      +---------+
                    <-- | selector | <--- | Service |
                      / +----------+      +---------+
                     /

Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>


^ permalink raw reply	[flat|nested] 19+ messages in thread
* Re[2]:  [v4 PATCH 1/2] NETFILTER module xt_hmark, new target for HASH based fwmark
@ 2011-11-28  9:36 Hans Schillstrom
  2011-11-30 15:27 ` Patrick McHardy
  0 siblings, 1 reply; 19+ messages in thread
From: Hans Schillstrom @ 2011-11-28  9:36 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: kaber, jengelh, netfilter-devel, netdev, hans.schillstrom

>
>On Fri, Nov 25, 2011 at 10:36:26AM +0100, Hans Schillstrom wrote:
>> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
>> index 3f0258d..9e4d4f9 100644
>> --- a/include/net/ipv6.h
>> +++ b/include/net/ipv6.h
>> @@ -39,6 +39,7 @@
>>  #define NEXTHDR_ICMP		58	/* ICMP for IPv6. */
>>  #define NEXTHDR_NONE		59	/* No next header */
>>  #define NEXTHDR_DEST		60	/* Destination options header. */
>> +#define NEXTHDR_SCTP		132	/* Stream Control Transport Protocol */
>>  #define NEXTHDR_MOBILITY	135	/* Mobility header. */
>>  
>>  #define NEXTHDR_MAX		255
>
>This has to go in a separated patch. Please, send it to netdev. I
>think davem can pick that for 3.2-rc

I will send this tiny one to Dave

[snip]

>> +static int get_inner_hdr(struct sk_buff *skb, int iphsz, int nhoff)
>> +{
>> +	const struct icmphdr *icmph;
>> +	struct icmphdr _ih;
>> +	struct iphdr *iph = NULL;
>> +
>> +	/* Not enough header? */
>> +	icmph = skb_header_pointer(skb, nhoff + iphsz, sizeof(_ih), &_ih);
>> +	if (icmph == NULL)
>> +		return nhoff;
>> +
>> +	if (icmph->type > NR_ICMP_TYPES)
>> +		return nhoff;
>> +
>> +	/* Error message? */
>> +	if (icmph->type != ICMP_DEST_UNREACH &&
>> +	    icmph->type != ICMP_SOURCE_QUENCH &&
>> +	    icmph->type != ICMP_TIME_EXCEEDED &&
>> +	    icmph->type != ICMP_PARAMETERPROB &&
>> +	    icmph->type != ICMP_REDIRECT)
>> +		return nhoff;
>> +	/* Checkin full IP header plus 8 bytes of protocol to
>> +	 * avoid additional coding at protocol handlers.
>> +	 */
>> +	if (!pskb_may_pull(skb, nhoff + iphsz + sizeof(_ih) + 8))
>> +		return nhoff;
>
>skb_header_pointer again here, if conntrack is enabled, we can benefit
>from handling fragments.
>
Yes, but I can't asume that conntrack is there.

[snip]

>> +/*
>> + * Calc hash value, special casre is taken on icmp and fragmented messages
>> + * i.e. fragmented messages don't use ports.
>> + */
>> +__u32 hmark_v6(struct sk_buff *skb, const struct xt_action_param *par)
>> +{
>> +	struct xt_hmark_info *info = (struct xt_hmark_info *)par->targinfo;
>> +	int nhoff, poff, hdrlen;
>> +	u32 addr1, addr2, hash;
>> +	struct ipv6hdr *ip6;
>> +	u8 nexthdr;
>> +	int frag = 0, ip6hdrlvl = 0;	/* Header level */
>> +	struct ipv6_opt_hdr _hdr, *hp;
>> +	union {
>> +		u32 v32;
>> +		u16 v16[2];
>> +	} ports;
>> +
>> +	ports.v32 = 0;
>> +	nhoff = skb_network_offset(skb);
>> +
>> +hdr_new:
>> +	/* Get header info */
>> +	ip6 = (struct ipv6hdr *) (skb->data + nhoff);
>> +	nexthdr = ip6->nexthdr;
>> +	hdrlen = sizeof(struct ipv6hdr);
>> +	hp = skb_header_pointer(skb, nhoff + hdrlen, sizeof(_hdr), &_hdr);
>
>you have to check return value of skb_header_pointer here.

This  is actually where next header will start if any, not the current one that is processed. 
It's not time for checking it now. The check is done before using it further down.

>
>> +	while (nexthdr) {
>> +		switch (nexthdr) {
>> +		case IPPROTO_ICMPV6:
>> +			/* ICMP Error then move ptr to inner header */
>> +			if (get_inner6_hdr(skb, &nhoff, hdrlen)) {
>> +				ip6hdrlvl++;
>> +				if (!pskb_may_pull(skb, sizeof(_hdr) + nhoff))
>> +					return XT_CONTINUE;
>> +				goto hdr_new;
>> +			}
>> +			nhoff += hdrlen;
>> +			goto hdr_rdy;
>> +
>> +		case NEXTHDR_FRAGMENT:
>> +			if (!ip6hdrlvl) /* Do not use ports if fragmented */
>> +				frag = 1;
>> +			break;
>> +
>> +		/* End of hdr traversing cont. with ports and hash calc. */
>> +		case NEXTHDR_IPV6:	/* Do not process tunnels */
>> +		case NEXTHDR_TCP:
>> +		case NEXTHDR_UDP:
>> +		case NEXTHDR_ESP:
>> +		case NEXTHDR_AUTH:
>> +		case NEXTHDR_SCTP:
>> +		case NEXTHDR_NONE:	/* Last hdr of something unknown */
>> +			nhoff += hdrlen;
>> +			goto hdr_rdy;
>> +		default:
>> +			return XT_CONTINUE;
>> +		}
>> +		if (!hp)
>> +			return XT_CONTINUE;
>> +		nhoff += hdrlen;	/* eat current header */
>> +		nexthdr =  hp->nexthdr;	/* Next header */
>> +		hdrlen = ipv6_optlen(hp);
>> +		hp = skb_header_pointer(skb, nhoff + hdrlen, sizeof(_hdr),
>> +					&_hdr);
>
>same here.
>
>> +		if (!pskb_may_pull(skb, nhoff))
>
>why this after skb_header_pointer?

hp is used for next turn, but you got a point here  I'll swap them 

>
>[... trimmed off ...]
>>       poff = proto_ports_offset(ip_proto);
>>       nhoff += ip->ihl * 4 + poff;
>>       if (frag || poff < 0 || !pskb_may_pull(skb, nhoff + 4))
>>               goto noports;
>>
>>       ports.v32 = * (__force u32 *) (skb->data + nhoff);
>>       if (ip_proto == IPPROTO_ESP || ip_proto == IPPROTO_AH) {
>>               ports.v32 = (ports.v32 & info->spimask) | nfo->spiset;
>>       } else {
>>               if (snatport)   /* Replace nat'ed port(s) */
>>                       ports.v16[1] = snatport;
>>               if (dnatport)
>>                       ports.v16[0] = dnatport;
>>               ports.v32 = (ports.v32 & info->pmask.v32) |
>>                               info->pset.v32;
>>               if (ports.v16[1] < ports.v16[0])
>>                       swap(ports.v16[0], ports.v16[1]);
>>       }
>>
>>noports:
>>       ip_proto &= info->prmask;
>>       /* get a consistent hash (same value on both flow directions)/
>>       if (addr2 < addr1)
>>               swap(addr1, addr2);
>>
>>       hash = jhash_3words(addr1, addr2, ports.v32, info->hashrnd) ^ p_proto;
>>       if (info->hmod)
>>               skb->mark = (hash % info->hmod) + info->hoffs;
>>       return XT_CONTINUE;
>> }
>
>Hm, I think the fragmentation handling is broken.
>
>Say that the first fragment contains the transport header
>header, then the mark is calculated based on the address and ports.
>Then, later on fragments will receive the mark based on the network
>header only. They may have different marks.

I think the documentation need improvments :-)
Fragments  never try to use ports.
"    if (frag)
         goto noports;"

>
>If you don't want to use conntrack in your setup and you want to handle
>fragments, then you have to configure HMARK to calculate the hashing
>based on the network addresses. If you want to fully support fragments,
>then enable conntrack and you can configure HMARK to calculate the
>hashing based on network address + transport bits.
>
>Fix this by removing the fragmentation handling, then assume that
>people can select between two hashing configuration for HMARK. One
>based for network address which is fragment-safe, one that uses the
>transport layer information, that requires conntrack. Otherwise, I
>don't see a sane way to handle this situation.

Correct me if I'm wrong here, 
If conntrack is enabled hmark don't see the packet until it is reassembled and 
in that case the fragmentation header is removed.

So, with conntrack HMARK will operate on full packets not fragments
without conntrack ports will not be used on any fragment.

>
>I think this has to be documented in the iptables manpage for HMARK.

The documentation needs update here, this is not crystal clear.

Thanks
Hans


^ permalink raw reply	[flat|nested] 19+ messages in thread
* Re[2]:  [v4 PATCH 1/2] NETFILTER module xt_hmark, new target for HASH based fwmark
@ 2011-12-01 11:05 Hans Schillstrom
  2011-12-01 11:24 ` Patrick McHardy
  0 siblings, 1 reply; 19+ messages in thread
From: Hans Schillstrom @ 2011-12-01 11:05 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: pablo, jengelh, netfilter-devel, netdev, hans.schillstrom

>On 12/01/2011 01:25 AM, Hans Schillstrom wrote:
>> On Wednesday, November 30, 2011 16:51:35 Patrick McHardy wrote:
>>> On 11/25/2011 10:36 AM, Hans Schillstrom wrote:
>>>> +
>>>> +hdr_new:
>>>> +	/* Get header info */
>>>> +	ip6 = (struct ipv6hdr *) (skb->data + nhoff);
>>>> +	nexthdr = ip6->nexthdr;
>>>> +	hdrlen = sizeof(struct ipv6hdr);
>>>> +	hp = skb_header_pointer(skb, nhoff + hdrlen, sizeof(_hdr),&_hdr);
>>>> +
>>>> +	while (nexthdr) {
>>>> +		switch (nexthdr) {
>>>> +		case IPPROTO_ICMPV6:
>>>> +			/* ICMP Error then move ptr to inner header */
>>>> +			if (get_inner6_hdr(skb,&nhoff, hdrlen)) {
>>> This doesn't look right. You assume the ICMPv6 header is following
>>> the IPv6 header with any other headers in between. If there are
>>> other headers, hdrlen will contain the length of the last header.
>>
>> RFC-4443  "Every ICMPv6 message is preceded by an IPv6 header and zero or more IPv6 extension headers."
>> hdrlen is actually previous header length in bytes, to be correct.
>> nhoff is the sum of processed headers.
>> So in case of an icmp the nhoff will be updated, and hdrlen preset to ipv6hdr size
>
>Right, I missed that you're using nhoff + hdrlen in
>get_inner6_hdr().
>
>>>> +				ip6hdrlvl++;
>>>> +				if (!pskb_may_pull(skb, sizeof(_hdr) + nhoff))
>>>> +					return XT_CONTINUE;
>>>> +				goto hdr_new;
>>>> +			}
>>>> +			nhoff += hdrlen;
>>>> +			goto hdr_rdy;
>>>> +
>>>> +		case NEXTHDR_FRAGMENT:
>>>> +			if (!ip6hdrlvl) /* Do not use ports if fragmented */
>>>> +				frag = 1;
>>> Shouldn't you also check for fragment offset == 0 here?
>> According to the RFC "Initialized to zero for transmission; ignored on reception"
>
>No, what I meant is that for the first fragment, you do
>have the upper layer header available. But as we already
>discussed for a stable identifier you want to ignore it
>anyways.
>
>>>> +		case NEXTHDR_TCP:
>>>> +		case NEXTHDR_UDP:
>>>> +		case NEXTHDR_ESP:
>>>> +		case NEXTHDR_AUTH:
>>> Don't you want to use the port numbers if only authentication
>>> without encryption is used?
>> with esp or ah the SPI will be used instead of ports.
>> Useful or not I don't know since they are asymmetric in terms of a flow.
>
>Yes, but with AH you could either use the ESP SPI or if no ESP
>is used the port numbers of the upper layer protocol.
>

The intention was to treat ESP & AH in the same way,
but as you say why not use the upper layer 
 
>>> And final question, why not simply use ipv6_skip_exthdr()?
>> problems with fragments...
>
>So the probem is that it will return the transport layer protocol
>header for fragments with frag_off == 0? We also have ipv6_find_hdr()
>which we could modify to indicate this in the frag_off pointer.

ipv6_find_hdr() will do the trick with a light modification
What about a wrapper like:

int __ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
		  int target, unsigned short *fragoff,  int *fragflg)
{
...
		if (nexthdr == NEXTHDR_FRAGMENT) {
			unsigned short _frag_off;
			__be16 *fp;

			if (fragflg)
			        fragflg = 1;
			fp = skb_header_pointer(skb,
						start+offsetof(struct frag_hdr,
							       frag_off),
						sizeof(_frag_off),
						&_frag_off);

...
}

int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
		  int target, unsigned short *fragoff) 
{
        return __ipv6_find_hdr(skb, offset, terget, fragoff, NULL);
}

^ permalink raw reply	[flat|nested] 19+ messages in thread
* Re[2]:  [v4 PATCH 1/2] NETFILTER module xt_hmark, new target for HASH based fwmark
@ 2011-12-01 11:39 Hans Schillstrom
  2011-12-01 11:46 ` Patrick McHardy
  0 siblings, 1 reply; 19+ messages in thread
From: Hans Schillstrom @ 2011-12-01 11:39 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: pablo, jengelh, netfilter-devel, netdev, hans.schillstrom

t: Re: [v4 PATCH 1/2] NETFILTER module xt_hmark, new target for HASH based fwmark
>
>On 12/01/2011 12:05 PM, Hans Schillstrom wrote:
>>>>> And final question, why not simply use ipv6_skip_exthdr()?
>>>> problems with fragments...
>>> So the probem is that it will return the transport layer protocol
>>> header for fragments with frag_off == 0? We also have ipv6_find_hdr()
>>> which we could modify to indicate this in the frag_off pointer.
>> ipv6_find_hdr() will do the trick with a light modification
>> What about a wrapper like:
>>
>> int __ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
>> 		  int target, unsigned short *fragoff,  int *fragflg)
>> {
>> ...
>> 		if (nexthdr == NEXTHDR_FRAGMENT) {
>> 			unsigned short _frag_off;
>> 			__be16 *fp;
>>
>> 			if (fragflg)
>> 			        fragflg = 1;
>> 			fp = skb_header_pointer(skb,
>> 						start+offsetof(struct frag_hdr,
>> 							       frag_off),
>> 						sizeof(_frag_off),
>> 						&_frag_off);
>>
>> ...
>> }
>>
>> int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
>> 		  int target, unsigned short *fragoff)
>> {
>>          return __ipv6_find_hdr(skb, offset, terget, fragoff, NULL);
>> }
>
>Hmm that would require to change all current callers. 

Nope, ipv6_find_hdr()  looks the same,
__ipv6_find_hdr() have an extra param.

> I was more thinking of unconditionally setting *frag_off in case of
>fragments, then you can initialize it to some impossible value
>like 0xffff and determine the presence of a fragment header
>based on its value after calling ipv6_find_hdr().

That's another way  :-)

Which one do you prefer ?

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2011-12-01 11:46 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-25  9:36 [v4 PATCH 0/2] NETFILTER new target module, HMARK Hans Schillstrom
2011-11-25  9:36 ` [v4 PATCH 1/2] NETFILTER module xt_hmark, new target for HASH based fwmark Hans Schillstrom
2011-11-25 14:19   ` David Laight
2011-11-25 14:36     ` Eric Dumazet
2011-11-25 14:43   ` Eric Dumazet
2011-11-25 17:36   ` Pablo Neira Ayuso
2011-11-25 18:31     ` Jan Engelhardt
2011-11-30 15:51   ` Patrick McHardy
2011-12-01  0:25     ` Hans Schillstrom
2011-12-01 10:05       ` Patrick McHardy
2011-11-25  9:36 ` [v4 PATCH 2/2] NETFILTER userspace part for target HMARK Hans Schillstrom
2011-11-25 13:20   ` Jan Engelhardt
2011-11-25 14:04     ` Hans Schillström
2011-11-25 15:44       ` Jan Engelhardt
2011-11-28  9:36 Re[2]: [v4 PATCH 1/2] NETFILTER module xt_hmark, new target for HASH based fwmark Hans Schillstrom
2011-11-30 15:27 ` Patrick McHardy
2011-11-30 18:28   ` Pablo Neira Ayuso
2011-12-01  0:52     ` Hans Schillstrom
2011-12-01 11:05 Re[2]: " Hans Schillstrom
2011-12-01 11:24 ` Patrick McHardy
2011-12-01 11:39 Re[2]: " Hans Schillstrom
2011-12-01 11:46 ` Patrick McHardy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).