netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [v2 PATCH 0/2] NETFILTER new target module, HMARK
@ 2011-10-03 17:46 Hans Schillstrom
  2011-10-03 17:46 ` [v2 PATCH 1/2] NETFILTER module xt_hmark new target for HASH based fw Hans Schillstrom
  2011-10-03 17:46 ` [v2 PATCH 2/2] NETFILTER userspace part for target HMARK Hans Schillstrom
  0 siblings, 2 replies; 14+ messages in thread
From: Hans Schillstrom @ 2011-10-03 17:46 UTC (permalink / raw)
  To: kaber, pablo, jengelh, netfilter-devel, netdev; +Cc: hans, Hans Schillstrom

The target allows you to create rules in the "raw" and "mangle" tables
which alter the netfilter mark (nfmark) field within a given range.
First a 32 bit hash value is generated then modulus by <limit> and
finally an offset is added before it's written to nfmark.
Prior to routing, the nfmark can influence the routing method (see
"Use netfilter MARK value as routing key") and can also be used by
other subsystems to change their behaviour.

The mark match can also be used to match nfmark produced by this module.
See the kernel module for more info.

REVISION

Version 2
	NAT Added for IPv4
	IPv6 ICMP handling enhanced.
	Usage example added

Version 1
	Initial RFC


We (Ericsson) use hmark in-front of ipvs as a pre-loadbalancer and
handles up to 70 ipvs running in parallel in clusters.
However hmark is not restricted to run infront of IPVS it can also be used as
"poor mans" load balancer.
With this version is also NAT supported as an option, with very high flows
you might not want to use conntrack.

The idea is to generate a direction independent fw mark range to use as input to
the routing (i.e. ip rule add fwmark ...).
Pretty straight forward and simple.


Example:
                                      App Server (Real Server)

                                           +---------+
                                        -->| Service |
     Gateway A                             +---------+
                          /
            +----------+ /     +----+      +---------+
--- if -A---| selector |---->  |ipvs|  --->| Service |
            +----------+ \     +----+      +---------+
                          \
                               +----+      +---------+
                               |ipvs|   -->| Service |
                               +----+      +---------+
      Gateway C
            +----------+ /     +----+
--- if-B ---| selector | --->  |ipvs|
            +----------+ \     +----+      +---------+
                                           | Service |
                                           +---------+
                          /
            +----------+ /     +----+     ..
--- if-B ---| selector | --->  |ipvs|      +---------+
            +----------+ \     +----+      | Service |
                          \                +---------+
#
# Example with four ipvs loadbalancers
#
iptables -t mangle -I PREROUTING -d $IPADDR -j HMARK --hmark-mod 4 --hmark-offs 100

ip rule add fwmark 100 table 100
ip rule add fwmark 101 table 101
ip rule add fwmark 102 table 102
ip rule add fwmark 103 table 103

ip ro ad table 100 default via x.y.z.1 dev bond1
ip ro ad table 101 default via x.y.z.2 dev bond1
ip ro ad table 102 default via x.y.z.3 dev bond1
ip ro ad table 103 default via x.y.z.4 dev bond1


If conntrack doesn't handle the return path,
do the oposite with HMARK and send it back right to ipvs.

Another exmaple of usage could be if you have cluster originated connections
and want to spread the connections over a number of interfaces
(NAT will complpicate things for you in this case)



                     \  Blade 1
                      \ +----------+      +---------+
                    <-- | selector | <--- | Service |
                      / +----------+      +---------+
                     /
   +------+
-- | Gw-A |          \  Blade 2
   +------+           \ +----------+      +---------+
   +------+         <-- | selector | <--- | Service |
-- | Gw-B |           / +----------+      +---------+
   +------+          /
   +------+
-- | Gw-C |          \
   +------+           \ +----------+      +---------+
                    <-- | selector | <--- | Service |
                      / +----------+      +---------+
                     /

                     \  Blande -n
                      \ +----------+      +---------+
                    <-- | selector | <--- | Service |
                      / +----------+      +---------+
                     /

Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>


^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re[2]:  [v2 PATCH 1/2] NETFILTER module xt_hmark new target for HASH based fw
@ 2011-11-07 23:29 Hans Schillstrom
  2011-11-08 10:51 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 14+ messages in thread
From: Hans Schillstrom @ 2011-11-07 23:29 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Hans Schillstrom, kaber, jengelh, netfilter-devel, netdev

Hello Pablo
>
>Hi Hans,
>
>On Mon, Oct 03, 2011 at 07:46:42PM +0200, Hans Schillstrom wrote:
>> diff --git a/include/linux/netfilter/xt_hmark.h b/include/linux/netfilter/xt_hmark.h
>> new file mode 100644
>> index 0000000..6c1436a
>> --- /dev/null
>> +++ b/include/linux/netfilter/xt_hmark.h
[snip]
>>  
>> +config NETFILTER_XT_TARGET_HMARK
>
>New config option has to go in alphabetic order (this one should go
>after NETFILTER_XT_TARGET_HL).

Oops, I'll fix that

[snip]
>> +/*
>> + * ICMP, get inner header so calc can be made on the source message
>> + *       not the icmp header, i.e. same hash mark must be produced
>> + *       on an icmp error message.
>> + */
>> +static int get_inner_hdr(struct sk_buff *skb, int iphsz, int nhoff)
>
>This looks very similar to icmp_error in nf_conntrack_proto_icmp.c.
>Yours lacks of checksumming validation btw.

Thanks I'll add that.

>
>I'm trying to find some place where we can put this function to make
>it available for both nf_conntrack_ipv4 and your module (to avoid code
>redundancy), but I didn't find any so far.
>
>It would be nice to find some way to avoid duplicating code with
>similar functionality.
>

I do agree, I've also been searching for some icmp "lib" to use...

[snip] 
>
>extra space not required.

OK
>
>> +	/* Error message? */
>> +	if (icmph->type != ICMP_DEST_UNREACH &&
>> +	    icmph->type != ICMP_SOURCE_QUENCH &&
>> +	    icmph->type != ICMP_TIME_EXCEEDED &&
>> +	    icmph->type != ICMP_PARAMETERPROB &&
>> +	    icmph->type != ICMP_REDIRECT)
>> +		goto out;
>> +	/* Checkin full IP header plus 8 bytes of protocol to
>> +	 * avoid additional coding at protocol handlers.
>> +	 */
>> +	if (!pskb_may_pull(skb, nhoff + iphsz + sizeof(_ih) + 8))
>> +		goto out;
>
>We prefer skb_header_pointer instead. If conntrack is enabled, we can
>benefit from defragmention. 

In  our case conntrack will not be there

>Please, replace all pskb_may_pull by skb_header_pointer in this code.
>
>We can assume that the IP header is linear (not fragmented).

I ran in to this issue in IPv6 testing so I got a little bit "paranoid".
Are you sure that the embedded IP and L4 header in the ICMP msg also is unfragmented.  
Is this true for both IPv6 & IPv4 ?

>From what I remember  when I was testing IPv6  icmp and digged into the original header (on a 2.6.32 kernel)  
pskb_may_pull was needed.

[snip]

>> +/*
>> + * Calc hash value, special casre is taken on icmp and fragmented messages
>> + * i.e. fragmented messages don't use ports.
>> + */
>> +static __u32 get_hash(struct sk_buff *skb, struct xt_hmark_info *info)
>
>This function seems to big to me, please, split it into smaller
>chunks, like get_hash_ipv4, get_hash_ipv6 and get_hash_ports.
>

Good catch I'll do that,

>> +{
>> +	int nhoff, hash = 0, poff, proto, frag = 0;
>> +	struct iphdr *ip;
>> +	u8 ip_proto;
>> +	u32 addr1, addr2, ihl;
>> +	u16 snatport = 0, dnatport = 0;
>> +	union {
>> +		u32 v32;
>> +		u16 v16[2];
>> +	} ports;
>> +
>> +	nhoff = skb_network_offset(skb);
>> +	proto = skb->protocol;
>> +
>> +	if (!proto && skb->sk) {
>> +		if (skb->sk->sk_family == AF_INET)
>> +			proto = __constant_htons(ETH_P_IP);
>> +		else if (skb->sk->sk_family == AF_INET6)
>> +			proto = __constant_htons(ETH_P_IPV6);
>
>You already have the layer3 protocol number in xt_action_param. No
>need to use the socket information then.

When splitting get_hash()  above  will  be removed,  xt_action_param ->family will be used for selection.

[snip]
>> +
>> +		if (!ct || !nf_ct_is_confirmed(ct))
>
>You seem to (ab)use nf_ct_is_confirmed to make sure you're not in the
>original direction. Better use the direction that you get by means of
>nf_ct_get.
>
I'm not sure I follow you here ?

>> +			break;
>> +		otuple = &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple;
>> +		/* On the "return flow", to get the original address
>> +		 * i,e, replace the source address.
>> +		 */
>> +		if (ct->status & IPS_DST_NAT &&
>> +		    info->flags & XT_HMARK_USE_DNAT) {
>> +			rtuple = &ct->tuplehash[IP_CT_DIR_REPLY].tuple;
>> +			addr1 = (__force u32) otuple->dst.u3.in.s_addr;
>> +			dnatport = otuple->dst.u.udp.port;
>> +		}
>> +		/* On the "return flow", to get the original address
>> +		 * i,e, replace the destination address.
>> +		 */
>> +		if (ct->status & IPS_SRC_NAT &&
>> +		    info->flags & XT_HMARK_USE_SNAT) {
>> +			rtuple = &ct->tuplehash[IP_CT_DIR_REPLY].tuple;
>> +			addr2 = (__force u32) otuple->src.u3.in.s_addr;
>> +			snatport = otuple->src.u.udp.port;
>> +		}
>> +		break;
>> +	}

[snip]

>> +			case NEXTHDR_NONE:
>> +				nhoff += hdrlen;
>> +				goto hdr_rdy;
>> +			default:
>> +				goto done;
>
>This goto doesn't make too much sense to me, better return 0.

hmmm 
kind of left overs,  Actually all "goto done" can be replaced by return 0
  
[snip]

>> +done:
>> +	return 0;
>> +}
>
>I'll try to find more time to look into this. Specifically, I want to
>review the IPv6 bits more carefully.

The IPv6 header recursion is not obvious, and it's hard to test all cases :-)

I really appreciate you review

Thanks
Hans



^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re[2]:  [v2 PATCH 1/2] NETFILTER module xt_hmark new target for HASH based fw
@ 2011-11-08 15:12 Hans Schillstrom
  2011-11-09 14:39 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 14+ messages in thread
From: Hans Schillstrom @ 2011-11-08 15:12 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Hans Schillstrom, kaber, jengelh, netfilter-devel, netdev

>
>On Tue, Nov 08, 2011 at 12:29:53AM +0100, Hans Schillstrom wrote:
>> >We prefer skb_header_pointer instead. If conntrack is enabled, we can
>> >benefit from defragmention. 
>> 
>> In  our case conntrack will not be there
>
>Yes, but if conntrack is there, we benefit from fragment reassembly if
>you use skb_header_pointer.
>
>> >Please, replace all pskb_may_pull by skb_header_pointer in this code.
>> >
>> >We can assume that the IP header is linear (not fragmented).
>> 
>> I ran in to this issue in IPv6 testing so I got a little bit "paranoid".
>> Are you sure that the embedded IP and L4 header in the ICMP msg also is unfragmented.  
>> Is this true for both IPv6 & IPv4 ?
>
>No sorry. I was refering to normal IP header in one packet.
>
>> From what I remember  when I was testing IPv6  icmp and digged into the original header (on a 2.6.32 kernel)  
>> pskb_may_pull was needed.
>
>Yes, it is indeed needed.
>
>> [snip]

[snip]


>
>Welcome, let's see if we can get this into 3.3 since we cannot make it
>for 3.2.
>
>BTW, do you have some number of this running with and without
>conntrack? It would be interesting to have.

I didn't save them,  but I can make a new benchmark later on.

Regards
Hans





^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-11-16 10:50 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-03 17:46 [v2 PATCH 0/2] NETFILTER new target module, HMARK Hans Schillstrom
2011-10-03 17:46 ` [v2 PATCH 1/2] NETFILTER module xt_hmark new target for HASH based fw Hans Schillstrom
2011-11-07  0:52   ` Pablo Neira Ayuso
2011-11-07  3:36     ` Jan Engelhardt
2011-10-03 17:46 ` [v2 PATCH 2/2] NETFILTER userspace part for target HMARK Hans Schillstrom
2011-11-07  0:55   ` Pablo Neira Ayuso
2011-11-07 23:29 Re[2]: [v2 PATCH 1/2] NETFILTER module xt_hmark new target for HASH based fw Hans Schillstrom
2011-11-08 10:51 ` Pablo Neira Ayuso
2011-11-13 17:05   ` Pablo Neira Ayuso
2011-11-14  9:19     ` Hans Schillstrom
2011-11-14 11:38       ` Jan Engelhardt
2011-11-15 10:01         ` Pablo Neira Ayuso
2011-11-08 15:12 Re[2]: " Hans Schillstrom
2011-11-09 14:39 ` Pablo Neira Ayuso
2011-11-16  9:28   ` Hans Schillstrom
2011-11-16 10:50     ` Pablo Neira Ayuso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).