All of lore.kernel.org
 help / color / mirror / Atom feed
* dummy as IMQ replacement
@ 2005-01-30 22:12 Jamal Hadi Salim
  2005-01-31  8:20 ` Hasso Tepper
                   ` (5 more replies)
  0 siblings, 6 replies; 126+ messages in thread
From: Jamal Hadi Salim @ 2005-01-30 22:12 UTC (permalink / raw)
  To: netdev
  Cc: Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml, Andy Furniss,
	Damion de Soto

[-- Attachment #1: Type: text/plain, Size: 7026 bytes --]


This is in relation to providing functionality that IMQ was intending
to using the dummy device and tc actions. Ive copied as many people as i
could dig who i know may have interest in this.
Please forward this to any other list which may have interest
in the subject. It still needs some cleaning up; however, i dont wanna
sit on it for another year - and now that mirred is out there, this is a
good time.

Advantage over current IMQ; cleaner in particular in in SMP;
with a _lot_ less code.
Old Dummy device functionality is preserved while new one only
kicks in if you use actions. Didnt have to write a new device and finaly
made a real dumb device to be a little smarter ;->

IMQ USES
--------
As far as i know the reasons listed below is why people use IMQ. 
It would be nice to know of anything else that i missed because this
is the requirements list i used.

1) qdiscs/policies that are per device as opposed to system wide.
IMQ allows for sharing across multiple devices.

2) Allows for queueing incoming traffic for shaping instead of
dropping. I am not aware of any study that shows policing is 
worse than shaping in achieving the end goal of rate control.
I would be interested if anyone is experimenting. Nevertheless,
this is still an alternative as opposed to making a system wide
ingress change.

3) Very interesting use: if you are serving p2p you may wanna give 
preference to your own localy originated traffic (when responses come
back) vs someone using your system to do bittorent. So QoSing based on
state comes in as the solution. What people did to achive this was stick
the IMQ somewhere prelocal hook.
I think this is a pretty neat feature to have in Linux in general.
(i.e not just for IMQ).
But i wont go back to putting netfilter hooks in the device to satisfy
this.  I also dont think its worth it hacking dummy some more to be 
aware of say L3 info and play ip rule tricks to achieve this.
--> Instead the plan is to have a contrack related action. This action
will selectively either query/create contrack state on incoming packets.
Packets could then be redirected to dummy based on what happens -> eg 
on incoming packets; if we find they are of known state we could send to
a different queue than one which didnt have existing state. This
all however is dependent on whatever rules the admin enters.

What you can do with dummy currently with actions
--------------------------------------------------

Lets say you are policing packets from alias 192.168.200.200/32
you dont want those to exceed 100kbps going out.

tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \
match ip src 192.168.200.200/32 flowid 1:2 \
action police rate 100kbit burst 90k drop

If you run tcpdump on eth0 you will see all packets going out
with src 192.168.200.200/32 dropped or not
Extend the rule a little to see only the ones that made it out:

tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \
match ip src 192.168.200.200/32 flowid 1:2 \
action police rate 10kbit burst 90k drop \
action mirred egress mirror dev dummy0 

Now fire tcpdump on dummy0 to see only those packets ..
tcpdump -n -i dummy0 -x -e -t 

Essentially a good debugging/logging interface.

If you replace mirror with redirect, those packets will be
blackholed and will never make it out. This redirect behavior
changes with new patch (but not the mirror). 


What you can do with dummy and attached patch
----------------------------------------------

Essentially provide functionality that most people use IMQ;
sample below:

--------
export TC="/sbin/tc"

$TC qdisc add dev dummy0 root handle 1: prio 
$TC qdisc add dev dummy0 parent 1:1 handle 10: sfq
$TC qdisc add dev dummy0 parent 1:2 handle 20: tbf rate 20kbit buffer
1600 limit 3000
$TC qdisc add dev dummy0 parent 1:3 handle 30:
sfq                                
$TC filter add dev dummy0 protocol ip pref 1 parent 1: handle 1 fw
classid 1:1
$TC filter add dev dummy0 protocol ip pref 2 parent 1: handle 2 fw
classid 1:2

ifconfig dummy0 up

$TC qdisc add dev eth0 ingress

# redirect all IP packets arriving in eth0 to dummy0 
# use mark 1 --> puts them onto class 1:1
$TC filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
match u32 0 0 flowid 1:1 \
action ipt -j MARK --set-mark 1 \
action mirred egress redirect dev dummy0

--------


Run A Little test:

from another machine ping so that you have packets going into the box:
-----
[root@jzny action-tests]# ping 10.22
PING 10.22 (10.0.0.22): 56 data bytes
64 bytes from 10.0.0.22: icmp_seq=0 ttl=64 time=2.8 ms
64 bytes from 10.0.0.22: icmp_seq=1 ttl=64 time=0.6 ms
64 bytes from 10.0.0.22: icmp_seq=2 ttl=64 time=0.6 ms

--- 10.22 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.6/1.3/2.8 ms
[root@jzny action-tests]# 
-----
Now look at some stats:

---
[root@jmandrake]:~# $TC -s filter show parent ffff: dev eth0
filter protocol ip pref 10 u32 
filter protocol ip pref 10 u32 fh 800: ht divisor 1 
filter protocol ip pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0
flowid 1:1 
  match 00000000/00000000 at 0
        action order 1: tablename: mangle  hook: NF_IP_PRE_ROUTING 
        target MARK set 0x1  
        index 1 ref 1 bind 1 installed 4195sec  used 27sec 
         Sent 252 bytes 3 pkts (dropped 0, overlimits 0) 

        action order 2: mirred (Egress Redirect to device dummy0) stolen
        index 1 ref 1 bind 1 installed 165 sec used 27 sec
         Sent 252 bytes 3 pkts (dropped 0, overlimits 0) 

[root@jmandrake]:~# $TC -s qdisc
qdisc sfq 30: dev dummy0 limit 128p quantum 1514b 
 Sent 0 bytes 0 pkts (dropped 0, overlimits 0) 
qdisc tbf 20: dev dummy0 rate 20Kbit burst 1575b lat 2147.5s 
 Sent 210 bytes 3 pkts (dropped 0, overlimits 0) 
qdisc sfq 10: dev dummy0 limit 128p quantum 1514b 
 Sent 294 bytes 3 pkts (dropped 0, overlimits 0) 
qdisc prio 1: dev dummy0 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1
1
 Sent 504 bytes 6 pkts (dropped 0, overlimits 0) 
qdisc ingress ffff: dev eth0 ---------------- 
 Sent 308 bytes 5 pkts (dropped 0, overlimits 0) 

[root@jmandrake]:~# ifconfig dummy0
dummy0    Link encap:Ethernet  HWaddr 00:00:00:00:00:00  
          inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:6 errors:0 dropped:3 overruns:0 frame:0
          TX packets:3 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:32 
          RX bytes:504 (504.0 b)  TX bytes:252 (252.0 b)
-----

Dummy continues to behave like it always did.
You send it any packet not originating from the actions it will drop
them.
[In this case the three dropped packets were ipv6 ndisc].

My goal here is to start a discussion to see if people agree this is
a good replacement for IMQ or whether to go another path.
Clearly i would prefer to have this change in, but I am not religious 
and would listen to reason about how it should be done as long as no 
uneccessary clutter happens. 

Patch attached.

cheers,
jamal




[-- Attachment #2: dummy-act-2611rc1 --]
[-- Type: text/plain, Size: 7066 bytes --]

--- a/drivers/net/dummy.c.orig	2004-12-24 16:34:33.000000000 -0500
+++ b/drivers/net/dummy.c	2005-01-18 06:43:47.000000000 -0500
@@ -26,7 +26,14 @@
 			Nick Holloway, 27th May 1994
 	[I tweaked this explanation a little but that's all]
 			Alan Cox, 30th May 1994
+
 */
+/*
+	* This driver isnt abused enough ;->
+	* Here to add only _just_ a _feeew more_ features,
+	* 10 years after AC added comment above ;-> hehe - JHS
+*/
+
 
 #include <linux/config.h>
 #include <linux/module.h>
@@ -35,11 +42,128 @@
 #include <linux/etherdevice.h>
 #include <linux/init.h>
 #include <linux/moduleparam.h>
+#ifdef CONFIG_NET_CLS_ACT
+#include <net/pkt_sched.h> 
+#endif
+
+#define TX_TIMEOUT  (2*HZ)
+                                                                                
+#define TX_Q_LIMIT    32
+struct dummy_private {
+	struct net_device_stats stats;
+#ifdef CONFIG_NET_CLS_ACT
+	struct tasklet_struct   dummy_tasklet;
+	int     tasklet_pending;
+	/* mostly debug stats leave in for now */
+	unsigned long   stat_r1;
+	unsigned long   stat_r2;
+	unsigned long   stat_r3;
+	unsigned long   stat_r4;
+	unsigned long   stat_r5;
+	unsigned long   stat_r6;
+	unsigned long   stat_r7;
+	unsigned long   stat_r8;
+	struct sk_buff_head     rq;
+	struct sk_buff_head     tq;
+#endif
+};
+
+#ifdef CONFIG_NET_CLS_ACT
+static void ri_tasklet(unsigned long dev);
+#endif
+
 
 static int numdummies = 1;
 
 static int dummy_xmit(struct sk_buff *skb, struct net_device *dev);
 static struct net_device_stats *dummy_get_stats(struct net_device *dev);
+static void dummy_timeout(struct net_device *dev);
+static int dummy_open(struct net_device *dev);
+static int dummy_close(struct net_device *dev);
+
+static void dummy_timeout(struct net_device *dev) {
+
+	int cpu = smp_processor_id();
+
+	dev->trans_start = jiffies;
+	printk("%s: BUG tx timeout on CPU %d\n",dev->name,cpu);
+	if (spin_is_locked((&dev->xmit_lock)))
+		printk("xmit lock grabbed already\n");
+	if (spin_is_locked((&dev->queue_lock)))
+		printk("queue lock grabbed already\n");
+}
+
+#ifdef CONFIG_NET_CLS_ACT
+static void ri_tasklet(unsigned long dev) {
+
+	struct net_device *dv = (struct net_device *)dev;
+	struct dummy_private *dp = ((struct net_device *)dev)->priv;
+	struct net_device_stats *stats = &dp->stats;
+	struct sk_buff *skb = NULL;
+
+	dp->stat_r4 +=1;
+	if (NULL == (skb = skb_peek(&dp->tq))) {
+		dp->stat_r5 +=1;
+		if (spin_trylock(&dv->xmit_lock)) {
+			dp->stat_r8 +=1;
+			while (NULL != (skb = skb_dequeue(&dp->rq))) {
+				skb_queue_tail(&dp->tq, skb);
+			}
+			spin_unlock(&dv->xmit_lock);
+		} else {
+	/* reschedule */
+			dp->stat_r1 +=1;
+			goto resched;
+		}
+	}
+
+	while (NULL != (skb = skb_dequeue(&dp->tq))) {
+		__u32 from = G_TC_FROM(skb->tc_verd);
+
+		skb->tc_verd = 0;
+		skb->tc_verd = SET_TC_NCLS(skb->tc_verd);
+		stats->tx_packets++;
+		stats->tx_bytes+=skb->len;
+		if (from & AT_EGRESS) {
+			dp->stat_r6 +=1;
+			dev_queue_xmit(skb);
+		} else if (from & AT_INGRESS) {
+
+			dp->stat_r7 +=1;
+			netif_rx(skb);
+		} else {
+			/* if netfilt is compiled in and packet is
+			tagged, we could reinject the packet back
+			this would make it do remaining 10%
+			of what current IMQ does  
+			if someone really really insists then
+			this is the spot .. jhs */
+			dev_kfree_skb(skb);
+			stats->tx_dropped++;
+		}
+	}
+
+	if (spin_trylock(&dv->xmit_lock)) {
+		dp->stat_r3 +=1;
+		if (NULL == (skb = skb_peek(&dp->rq))) {
+			dp->tasklet_pending = 0;
+		if (netif_queue_stopped(dv))
+			//netif_start_queue(dv);
+			netif_wake_queue(dv);
+		} else {
+			dp->stat_r2 +=1;
+			spin_unlock(&dv->xmit_lock);
+			goto resched;
+		}
+		spin_unlock(&dv->xmit_lock);
+		} else {
+resched:
+			dp->tasklet_pending = 1;
+			tasklet_schedule(&dp->dummy_tasklet);
+		}
+
+}
+#endif
 
 static int dummy_set_address(struct net_device *dev, void *p)
 {
@@ -62,12 +186,17 @@
 	/* Initialize the device structure. */
 	dev->get_stats = dummy_get_stats;
 	dev->hard_start_xmit = dummy_xmit;
+	dev->tx_timeout = &dummy_timeout;
+	dev->watchdog_timeo = TX_TIMEOUT;
+	dev->open = &dummy_open;
+	dev->stop = &dummy_close;
+
 	dev->set_multicast_list = set_multicast_list;
 	dev->set_mac_address = dummy_set_address;
 
 	/* Fill in device structure with ethernet-generic values. */
 	ether_setup(dev);
-	dev->tx_queue_len = 0;
+	dev->tx_queue_len = TX_Q_LIMIT;
 	dev->change_mtu = NULL;
 	dev->flags |= IFF_NOARP;
 	dev->flags &= ~IFF_MULTICAST;
@@ -77,18 +206,64 @@
 
 static int dummy_xmit(struct sk_buff *skb, struct net_device *dev)
 {
-	struct net_device_stats *stats = netdev_priv(dev);
+	struct dummy_private *dp = ((struct net_device *)dev)->priv;
+	struct net_device_stats *stats = &dp->stats;
+	int ret = 0;
 
+	{
 	stats->tx_packets++;
 	stats->tx_bytes+=skb->len;
+	}
+#ifdef CONFIG_NET_CLS_ACT
+	__u32 from = G_TC_FROM(skb->tc_verd);
+	if (!from || !skb->input_dev ) {
+dropped:
+		 dev_kfree_skb(skb);
+		 stats->rx_dropped++;
+		 return ret;
+	} else {
+		if (skb->input_dev)
+			skb->dev = skb->input_dev;
+		else
+			printk("warning!!! no idev %s\n",skb->dev->name);
 
+		skb->input_dev = dev;
+		if (from & AT_INGRESS) {
+			skb_pull(skb, skb->dev->hard_header_len);
+		} else {
+			if (!(from & AT_EGRESS)) {
+				goto dropped;
+			}
+		}
+	}
+	if (skb_queue_len(&dp->rq) >= dev->tx_queue_len) {
+		netif_stop_queue(dev);
+	}
+	dev->trans_start = jiffies;
+	skb_queue_tail(&dp->rq, skb);
+	if (!dp->tasklet_pending) {
+		dp->tasklet_pending = 1;
+		tasklet_schedule(&dp->dummy_tasklet);
+	}
+
+#else
+	stats->rx_dropped++;
 	dev_kfree_skb(skb);
-	return 0;
+#endif
+	return ret;
 }
 
 static struct net_device_stats *dummy_get_stats(struct net_device *dev)
 {
-	return netdev_priv(dev);
+	struct dummy_private *dp = ((struct net_device *)dev)->priv;
+	struct net_device_stats *stats = &dp->stats;
+#ifdef CONFIG_NET_CLS_ACT_DEB
+	printk("tasklets stats %ld:%ld:%ld:%ld:%ld:%ld:%ld:%ld \n",
+		dp->stat_r1,dp->stat_r2,dp->stat_r3,dp->stat_r4,
+		dp->stat_r5,dp->stat_r6,dp->stat_r7,dp->stat_r8);
+#endif
+
+	return stats;
 }
 
 static struct net_device **dummies;
@@ -97,12 +272,41 @@
 module_param(numdummies, int, 0);
 MODULE_PARM_DESC(numdummies, "Number of dummy pseudo devices");
 
+static int dummy_close(struct net_device *dev)
+{
+
+#ifdef CONFIG_NET_CLS_ACT
+	struct dummy_private *dp = ((struct net_device *)dev)->priv;
+
+	tasklet_kill(&dp->dummy_tasklet);
+	skb_queue_purge(&dp->rq);
+	skb_queue_purge(&dp->tq);
+#endif
+	netif_stop_queue(dev);
+	return 0;
+}
+
+static int dummy_open(struct net_device *dev)
+{
+
+#ifdef CONFIG_NET_CLS_ACT
+	struct dummy_private *dp = ((struct net_device *)dev)->priv;
+
+	tasklet_init(&dp->dummy_tasklet, ri_tasklet, (unsigned long)dev);
+	skb_queue_head_init(&dp->rq);
+	skb_queue_head_init(&dp->tq);
+#endif
+	netif_start_queue(dev);
+	return 0;
+}
+
+
 static int __init dummy_init_one(int index)
 {
 	struct net_device *dev_dummy;
 	int err;
 
-	dev_dummy = alloc_netdev(sizeof(struct net_device_stats),
+	dev_dummy = alloc_netdev(sizeof(struct dummy_private),
 				 "dummy%d", dummy_setup);
 
 	if (!dev_dummy)

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-30 22:12 dummy as IMQ replacement Jamal Hadi Salim
@ 2005-01-31  8:20 ` Hasso Tepper
  2005-01-31 12:25   ` jamal
  2005-01-31 13:58 ` Thomas Graf
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 126+ messages in thread
From: Hasso Tepper @ 2005-01-31  8:20 UTC (permalink / raw)
  To: hadi
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

Jamal Hadi Salim wrote:
> 2) Allows for queueing incoming traffic for shaping instead of
> dropping. I am not aware of any study that shows policing is
> worse than shaping in achieving the end goal of rate control.
> I would be interested if anyone is experimenting. Nevertheless,
> this is still an alternative as opposed to making a system wide
> ingress change.

Policing didn't work with IPv6 last time I checked.


-- 
Hasso Tepper
Elion Enterprises Ltd.
WAN administrator

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31  8:20 ` Hasso Tepper
@ 2005-01-31 12:25   ` jamal
  2005-01-31 12:38     ` Hasso Tepper
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-01-31 12:25 UTC (permalink / raw)
  To: Hasso Tepper
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

On Mon, 2005-01-31 at 03:20, Hasso Tepper wrote:
> Jamal Hadi Salim wrote:
> > 2) Allows for queueing incoming traffic for shaping instead of
> > dropping. I am not aware of any study that shows policing is
> > worse than shaping in achieving the end goal of rate control.
> > I would be interested if anyone is experimenting. Nevertheless,
> > this is still an alternative as opposed to making a system wide
> > ingress change.
> 
> Policing didn't work with IPv6 last time I checked.

Really? I take it this is using the u32 classifier?
What filter did you use?

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 12:25   ` jamal
@ 2005-01-31 12:38     ` Hasso Tepper
  2005-01-31 12:47       ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Hasso Tepper @ 2005-01-31 12:38 UTC (permalink / raw)
  To: hadi
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

jamal wrote:
> On Mon, 2005-01-31 at 03:20, Hasso Tepper wrote:
> > Policing didn't work with IPv6 last time I checked.
>
> Really? I take it this is using the u32 classifier?
> What filter did you use?

http://mailman.ds9a.nl/pipermail/lartc/2004q2/012422.html

Got one answer to this in private that "AFAIK it isn't implemented yet".

-- 
Hasso Tepper
Elion Enterprises Ltd.
WAN administrator

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 12:38     ` Hasso Tepper
@ 2005-01-31 12:47       ` jamal
  2005-01-31 13:02         ` Hasso Tepper
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-01-31 12:47 UTC (permalink / raw)
  To: Hasso Tepper
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

On Mon, 2005-01-31 at 07:38, Hasso Tepper wrote:
> jamal wrote:
> > On Mon, 2005-01-31 at 03:20, Hasso Tepper wrote:
> > > Policing didn't work with IPv6 last time I checked.
> >
> > Really? I take it this is using the u32 classifier?
> > What filter did you use?
> 
> http://mailman.ds9a.nl/pipermail/lartc/2004q2/012422.html
> 
> Got one answer to this in private that "AFAIK it isn't implemented yet".

This?

tc filter add dev eth1.101 parent ffff: protocol all prio 50 handle \
0x101 fw police rate 1024kbit burst 60k drop flowid :101

What are you trying to do? Are you also trying to rate limit ARPs etc
in one shot?

Does this even get hit at all? tc -s would show you stats. I suspect
for one it is not being hit.
Maybe you are trying to use iptables marks that happen
a long time after the ingress has seen the packets (which would 
explain why it is not being hit)? This would be true kernels > 2.6.8 
but not before ..
In other words, it may be a config issue.
If you tell me what it is you are trying to do i could try and set it
up when i come back from work today.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 12:47       ` jamal
@ 2005-01-31 13:02         ` Hasso Tepper
  2005-01-31 13:28           ` Thomas Graf
  2005-01-31 13:39           ` jamal
  0 siblings, 2 replies; 126+ messages in thread
From: Hasso Tepper @ 2005-01-31 13:02 UTC (permalink / raw)
  To: hadi
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

jamal wrote:
> On Mon, 2005-01-31 at 07:38, Hasso Tepper wrote:
> > jamal wrote:
> > > On Mon, 2005-01-31 at 03:20, Hasso Tepper wrote:
> > > > Policing didn't work with IPv6 last time I checked.
> > >
> > > Really? I take it this is using the u32 classifier?
> > > What filter did you use?
> >
> > http://mailman.ds9a.nl/pipermail/lartc/2004q2/012422.html
> >
> > Got one answer to this in private that "AFAIK it isn't implemented
> > yet".
>
> This?
>
> tc filter add dev eth1.101 parent ffff: protocol all prio 50 handle \
> 0x101 fw police rate 1024kbit burst 60k drop flowid :101
>
> What are you trying to do? Are you also trying to rate limit ARPs etc
> in one shot?

All traffic coming from eth1.101 interface.

> Does this even get hit at all? tc -s would show you stats. I suspect
> for one it is not being hit.

As far as I remember situation was exactly as I described. This worked for 
IPv4 traffic, but not for IPv6 traffic.

> Maybe you are trying to use iptables marks that happen
> a long time after the ingress has seen the packets (which would
> explain why it is not being hit)? This would be true kernels > 2.6.8
> but not before ..

This test was done with 2.6.6.

> In other words, it may be a config issue.

Would be nice ;).

> If you tell me what it is you are trying to do i could try and set it
> up when i come back from work today.

I'd like to limit _all_ traffic coming in from one particular interface to 
the one common limit. No matter what traffic it is - IPv4 or IPv6. Sum of 
traffic should be the one I specify.


-- 
Hasso Tepper
Elion Enterprises Ltd.
WAN administrator

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 13:02         ` Hasso Tepper
@ 2005-01-31 13:28           ` Thomas Graf
  2005-01-31 13:45             ` jamal
  2005-01-31 13:39           ` jamal
  1 sibling, 1 reply; 126+ messages in thread
From: Thomas Graf @ 2005-01-31 13:28 UTC (permalink / raw)
  To: Hasso Tepper
  Cc: hadi, netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

> > > http://mailman.ds9a.nl/pipermail/lartc/2004q2/012422.html

It depends on whether you have CONFIG_NET_CLS_ACT enabled or not.
If so, the ingress qdisc is hit before PREROUTING and thus can't
see the mark for a good reason. Simply removing the dependcy on
the mark resolves the issue for you.

If you don't have CONFIG_NET_CLS_ACT enabled you would see the
mark if the ingress qdisc would register on the IPv6 PREROUTING
hook but apparently it doesn't.

The patch below should fix it, it is completely untested though.

--- linux-2.6.11-rc2-bk8.orig/net/sched/sch_ingress.c	2005-01-30 21:19:51.000000000 +0100
+++ linux-2.6.11-rc2-bk8/net/sched/sch_ingress.c	2005-01-31 14:23:08.000000000 +0100
@@ -271,6 +271,14 @@
 	.priority       = NF_IP_PRI_FILTER + 1,
 };
 
+static struct nf_hook_ops ing6_ops = {
+	.hook           = ing_hook,
+	.owner		= THIS_MODULE,
+	.pf             = PF_INET6,
+	.hooknum        = NF_IP6_PRE_ROUTING,
+	.priority       = NF_IP6_PRI_FILTER + 1,
+};
+
 #endif
 #endif
 
@@ -296,6 +304,11 @@
 			printk("ingress qdisc registration error \n");
 			return -EINVAL;
 		}
+		if (nf_register_hook(&ing6_ops) < 0) {
+			nf_unregister_hook(&ing_ops);
+			printk("ingress ipv6 qdisc registration error \n");
+			return -EINVAL;
+		}
 		nf_registered++;
 	}
 #endif
@@ -408,8 +421,10 @@
 	unregister_qdisc(&ingress_qdisc_ops);
 #ifndef CONFIG_NET_CLS_ACT
 #ifdef CONFIG_NETFILTER
-	if (nf_registered)
+	if (nf_registered) {
 		nf_unregister_hook(&ing_ops);
+		nf_unregister_hook(&ing6_ops);
+	}
 #endif
 #endif
 }

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 13:02         ` Hasso Tepper
  2005-01-31 13:28           ` Thomas Graf
@ 2005-01-31 13:39           ` jamal
  2005-01-31 14:14             ` Hasso Tepper
  1 sibling, 1 reply; 126+ messages in thread
From: jamal @ 2005-01-31 13:39 UTC (permalink / raw)
  To: Hasso Tepper
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

On Mon, 2005-01-31 at 08:02, Hasso Tepper wrote:
> jamal wrote:
[..]
> > What are you trying to do? Are you also trying to rate limit ARPs etc
> > in one shot?
> 
> All traffic coming from eth1.101 interface.
> 

eth1.101 is an alias? You may have issues there. Maybe not if the
attach to that interface worked.

>
> > tc filter add dev eth1.101 parent ffff: protocol all prio 50 handle \
> > 0x101 fw police rate 1024kbit burst 60k drop flowid :101
> >
> > Does this even get hit at all? tc -s would show you stats. I suspect
> > for one it is not being hit.
> 
> As far as I remember situation was exactly as I described. This worked for 
> IPv4 traffic, but not for IPv6 traffic.
>

Off hand i cant see why .. Unless the ipv6 packets didnt get marked
properly but the v4 ones did?

> > Maybe you are trying to use iptables marks that happen
> > a long time after the ingress has seen the packets (which would
> > explain why it is not being hit)? This would be true kernels > 2.6.8
> > but not before ..
> 
> This test was done with 2.6.6.

Ok, in that case iptables prerouting would have come before ingress; so 
as long as you marked the packets with iptables it should work fine.

> > In other words, it may be a config issue.
> 
> Would be nice ;).

I am still thinking it is. What are your iptables v6 markers?

> > If you tell me what it is you are trying to do i could try and set it
> > up when i come back from work today.
> 
> I'd like to limit _all_ traffic coming in from one particular interface to 
> the one common limit. No matter what traffic it is - IPv4 or IPv6. Sum of 
> traffic should be the one I specify.

Theres other ways to do it[1] but if theres a bug in this it needs
fixing.

cheers,
jamal

[1]
Example, you could do this:

tc filter add dev eth1 parent ffff: protocol ip prio 1 \
u32 match u32 0 0 flowid 1:15 \
action police index 1 rate 1024kbit burst 60k drop index 1

Note the use of "index 1" to select a policer.

Then repeat replacing ip with ip6; make sure that "index 1" for policer
stays. You could do this to share also across devices.

Example, on egress of eth0 also use the same 1Mbps

tc filter add dev eth0 parent 1:0 protocol ip prio 6 u32 \
match ip src 10.0.0.21/32 flowid 1:16 \
action police index 1 rate 1024kbit burst 60k drop index 1

Now with new action stuff you could instead just have said:
tc actions add \
action police index 1 rate 1024kbit burst 60k drop index 1

And then later just referenced it without having to repeat the rate
like so:
filter add dev eth0 parent ffff: protocol ip prio 6 u32 match ip src \
10.0.0.21/32 flowid 1:16 \
action police index 1

Again, this does not excuse a bug if it exists ...

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 13:28           ` Thomas Graf
@ 2005-01-31 13:45             ` jamal
  2005-01-31 14:06               ` Thomas Graf
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-01-31 13:45 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Hasso Tepper, netdev, Nguyen Dinh Nam, Remus, Andre Tomt,
	syrius.ml, Andy Furniss, Damion de Soto


Yeah, that would fix it. Note however, that i am trying to highly
discourage use of iptables and i would rather let people who use
iptables to suffer;-> (sounds rude i know). At some point i plan to
remove the dependency on iptables altogether. So i am not sure whether i
should encourage pushing of this patch or not ;->
All this hooking in 100 hooks is one of the reasons i disliked IMQ as
well

cheers,
jamal

PS:- also note in 2.6.6 tc action was not yet in, so 
On Mon, 2005-01-31 at 08:28, Thomas Graf wrote:
> > > > http://mailman.ds9a.nl/pipermail/lartc/2004q2/012422.html
> 
> It depends on whether you have CONFIG_NET_CLS_ACT enabled or not.
> If so, the ingress qdisc is hit before PREROUTING and thus can't
> see the mark for a good reason. Simply removing the dependcy on
> the mark resolves the issue for you.
> 
> If you don't have CONFIG_NET_CLS_ACT enabled you would see the
> mark if the ingress qdisc would register on the IPv6 PREROUTING
> hook but apparently it doesn't.
> 
> The patch below should fix it, it is completely untested though.
> 
> --- linux-2.6.11-rc2-bk8.orig/net/sched/sch_ingress.c	2005-01-30 21:19:51.000000000 +0100
> +++ linux-2.6.11-rc2-bk8/net/sched/sch_ingress.c	2005-01-31 14:23:08.000000000 +0100
> @@ -271,6 +271,14 @@
>  	.priority       = NF_IP_PRI_FILTER + 1,
>  };
>  
> +static struct nf_hook_ops ing6_ops = {
> +	.hook           = ing_hook,
> +	.owner		= THIS_MODULE,
> +	.pf             = PF_INET6,
> +	.hooknum        = NF_IP6_PRE_ROUTING,
> +	.priority       = NF_IP6_PRI_FILTER + 1,
> +};
> +
>  #endif
>  #endif
>  
> @@ -296,6 +304,11 @@
>  			printk("ingress qdisc registration error \n");
>  			return -EINVAL;
>  		}
> +		if (nf_register_hook(&ing6_ops) < 0) {
> +			nf_unregister_hook(&ing_ops);
> +			printk("ingress ipv6 qdisc registration error \n");
> +			return -EINVAL;
> +		}
>  		nf_registered++;
>  	}
>  #endif
> @@ -408,8 +421,10 @@
>  	unregister_qdisc(&ingress_qdisc_ops);
>  #ifndef CONFIG_NET_CLS_ACT
>  #ifdef CONFIG_NETFILTER
> -	if (nf_registered)
> +	if (nf_registered) {
>  		nf_unregister_hook(&ing_ops);
> +		nf_unregister_hook(&ing6_ops);
> +	}
>  #endif
>  #endif
>  }
> 
> 

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-30 22:12 dummy as IMQ replacement Jamal Hadi Salim
  2005-01-31  8:20 ` Hasso Tepper
@ 2005-01-31 13:58 ` Thomas Graf
  2005-01-31 14:19   ` jamal
  2005-01-31 16:27 ` Andre Correa
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 126+ messages in thread
From: Thomas Graf @ 2005-01-31 13:58 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

> 2) Allows for queueing incoming traffic for shaping instead of
> dropping. I am not aware of any study that shows policing is 
> worse than shaping in achieving the end goal of rate control.
> I would be interested if anyone is experimenting. Nevertheless,
> this is still an alternative as opposed to making a system wide
> ingress change.

Agreed, the problem should be solved on egress by delaying ACKs
so the other side's congestion control slows down. I still don't
have a solution which works for all ip stacks and ended up tuning
parameters based on TTL numbers guessing the operating system.

For me, the purpose of ingress policing is to apply some policy for
control datagrams and other unwanted traffic. One example would be
dropping echo requests comming from nmap which reduces egress
bandwidth consumption by 13% my border routers.

tc filter add dev $DEV parent ffff: protocol ip prio 10  \
    u32 match u32 0x10000 0xff0000 at 8                  \
        match u32 0x1c 0xffff at 0                       \
        match u32 0x8000000 0xf000000 at 20              \
    police mtu 1 drop flowid :1

I should convert this to actions at some point ;->

> --> Instead the plan is to have a contrack related action. This action
> will selectively either query/create contrack state on incoming packets.
> Packets could then be redirected to dummy based on what happens -> eg 
> on incoming packets; if we find they are of known state we could send to
> a different queue than one which didnt have existing state. This
> all however is dependent on whatever rules the admin enters.

We could also do it in the meta ematch but this relies on the packet
already having passed the conntrack code. How do you plan to do this
in ingress?


> tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \
> match ip src 192.168.200.200/32 flowid 1:2 \
> action police rate 10kbit burst 90k drop \
> action mirred egress mirror dev dummy0 

This is extremely useful. I'm not sure but I think you also had plans
to allow mirroring to userspace?

> My goal here is to start a discussion to see if people agree this is
> a good replacement for IMQ or whether to go another path.

Sounds good to me. No complains from my side. I'll have a closer look
at the patch later on.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 13:45             ` jamal
@ 2005-01-31 14:06               ` Thomas Graf
  2005-01-31 14:29                 ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Thomas Graf @ 2005-01-31 14:06 UTC (permalink / raw)
  To: jamal
  Cc: Hasso Tepper, netdev, Nguyen Dinh Nam, Remus, Andre Tomt,
	syrius.ml, Andy Furniss, Damion de Soto

> Yeah, that would fix it. Note however, that i am trying to highly
> discourage use of iptables and i would rather let people who use
> iptables to suffer;-> (sounds rude i know).  At some point i plan to
> remove the dependency on iptables altogether.

Heh, I think it isn't rude, giving people a little clap to join
the "good side" isn't that bad ;->

> So i am not sure whether i should encourage pushing of this patch or not ;->

I don't care that much, the patch is there, everyone can patch and
distributions can pick it up. I agree that we should remove the
dependency on iptables but I'd also like to see the dependency on the
action bits to go away at the same time.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 13:39           ` jamal
@ 2005-01-31 14:14             ` Hasso Tepper
  2005-01-31 14:25               ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Hasso Tepper @ 2005-01-31 14:14 UTC (permalink / raw)
  To: hadi
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

jamal wrote:
> On Mon, 2005-01-31 at 08:02, Hasso Tepper wrote:
> > All traffic coming from eth1.101 interface.
>
> eth1.101 is an alias? You may have issues there. Maybe not if the
> attach to that interface worked.

No, vlan.

> Theres other ways to do it[1] but if theres a bug in this it needs
> fixing.
>
> cheers,
> jamal
>
> [1]
> Example, you could do this:
>
> tc filter add dev eth1 parent ffff: protocol ip prio 1 \
> u32 match u32 0 0 flowid 1:15 \
> action police index 1 rate 1024kbit burst 60k drop index 1
>
> Note the use of "index 1" to select a policer.
>
> Then repeat replacing ip with ip6; make sure that "index 1" for policer
> stays. You could do this to share also across devices.
>
> Example, on egress of eth0 also use the same 1Mbps
>
> tc filter add dev eth0 parent 1:0 protocol ip prio 6 u32 \
> match ip src 10.0.0.21/32 flowid 1:16 \
> action police index 1 rate 1024kbit burst 60k drop index 1
>
> Now with new action stuff you could instead just have said:
> tc actions add \
> action police index 1 rate 1024kbit burst 60k drop index 1
>
> And then later just referenced it without having to repeat the rate
> like so:
> filter add dev eth0 parent ffff: protocol ip prio 6 u32 match ip src \
> 10.0.0.21/32 flowid 1:16 \
> action police index 1

Hmmm ... I even didn't know about index. Yes, something like that would do 
as well probably. I'll do some tests later today with this. Actions don't 
help me though as I'm using 2.4 kernel for production.


-- 
Hasso Tepper
Elion Enterprises Ltd.
WAN administrator

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 13:58 ` Thomas Graf
@ 2005-01-31 14:19   ` jamal
  2005-01-31 15:15     ` Thomas Graf
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-01-31 14:19 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

On Mon, 2005-01-31 at 08:58, Thomas Graf wrote:
> > 2) Allows for queueing incoming traffic for shaping instead of
> > dropping. I am not aware of any study that shows policing is 
> > worse than shaping in achieving the end goal of rate control.
> > I would be interested if anyone is experimenting. Nevertheless,
> > this is still an alternative as opposed to making a system wide
> > ingress change.
> 
> Agreed, the problem should be solved on egress by delaying ACKs
> so the other side's congestion control slows down. 

Or dropping packets. TCP will adjust itself either way; at least
thats true according to this formula [rfc3448] (originally derived from
Reno, but people are finding it works fine with all other variants of
TCP CC):

-----
The throughput equation is:

                                   s
   X =  ----------------------------------------------------------
        R*sqrt(2*b*p/3) + (t_RTO * (3*sqrt(3*b*p/8) * p * (1+32*p^2)))


Where:

      X is the transmit rate in bytes/second.
      s is the packet size in bytes.
      R is the round trip time in seconds.
      p is the loss event rate, between 0 and 1.0, of the number of loss
        events as a fraction of the number of packets transmitted.
      t_RTO is the TCP retransmission timeout value in seconds.
      b is the number of packets acknowledged by a single TCP
        acknowledgement.
----

dropping mucks with "p" and delaying ACKs (shaping) mucks with "R".
Plug into that formula either one and you see they affect the 
result for X the same way.
I am really hoping that someone will do experimental analysis - cant
believe no hungry students these days out there.

> I still don't
> have a solution which works for all ip stacks and ended up tuning
> parameters based on TTL numbers guessing the operating system.
> 
> For me, the purpose of ingress policing is to apply some policy for
> control datagrams and other unwanted traffic. One example would be
> dropping echo requests comming from nmap which reduces egress
> bandwidth consumption by 13% my border routers.
> 
> tc filter add dev $DEV parent ffff: protocol ip prio 10  \
>     u32 match u32 0x10000 0xff0000 at 8                  \
>         match u32 0x1c 0xffff at 0                       \
>         match u32 0x8000000 0xf000000 at 20              \
>     police mtu 1 drop flowid :1
> 
> I should convert this to actions at some point ;->
> 

You should ;->
And now you can actually _really_  drop, above will let some packets
through. More interestingly is you can now randomly drop or
determistically (drop every 10th packet)

> > --> Instead the plan is to have a contrack related action. This action
> > will selectively either query/create contrack state on incoming packets.
> > Packets could then be redirected to dummy based on what happens -> eg 
> > on incoming packets; if we find they are of known state we could send to
> > a different queue than one which didnt have existing state. This
> > all however is dependent on whatever rules the admin enters.
> 
> We could also do it in the meta ematch but this relies on the packet
> already having passed the conntrack code. How do you plan to do this
> in ingress?
> 

Something along the lines of what OBSD firewall does but selectively (If
i understood those OBSD fanatics at SUCON;-> correctly)..they track
at ingress before ip stack. The difference is we can allow selective 
tracking; something along the lines of:

tc filter add dev $DEV parent ffff: protocol ip prio 10  \
 u32 match u32 0x10000 0xff0000 at 8               \
action track \
action metamark here depending on whether we found contrack etc

I have the layout scribbeled on paper somewhere .. I will look it up
and provide more details

Track should just use iptables contracking code instead of reinventing
it. 

> 
> > tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \
> > match ip src 192.168.200.200/32 flowid 1:2 \
> > action police rate 10kbit burst 90k drop \
> > action mirred egress mirror dev dummy0 
> 
> This is extremely useful. I'm not sure but I think you also had plans
> to allow mirroring to userspace?
> 

Yes via mmaped packet sockets. The other way (induced by laziness, so i
dont have to write a single line of code) is to
have redirection to ring device that was posted a while back by someone
since it provides a bridge between mmaped packet socket like interface
and kernel. 

> > My goal here is to start a discussion to see if people agree this is
> > a good replacement for IMQ or whether to go another path.
> 
> Sounds good to me. No complains from my side. I'll have a closer look
> at the patch later on.

Thanks for looking 

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 14:14             ` Hasso Tepper
@ 2005-01-31 14:25               ` jamal
  2005-01-31 14:46                 ` Hasso Tepper
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-01-31 14:25 UTC (permalink / raw)
  To: Hasso Tepper
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

On Mon, 2005-01-31 at 09:14, Hasso Tepper wrote:
> jamal wrote:
> > On Mon, 2005-01-31 at 08:02, Hasso Tepper wrote:
> > > All traffic coming from eth1.101 interface.
> >
> > eth1.101 is an alias? You may have issues there. Maybe not if the
> > attach to that interface worked.
> 
> No, vlan.

That should be fine then

> > Theres other ways to do it[1] but if theres a bug in this it needs
> > fixing.

[..]
> > And then later just referenced it without having to repeat the rate
> > like so:
> > filter add dev eth0 parent ffff: protocol ip prio 6 u32 match ip src \
> > 10.0.0.21/32 flowid 1:16 \
> > action police index 1
> 
> Hmmm ... I even didn't know about index. Yes, something like that would do 
> as well probably. I'll do some tests later today with this. Actions don't 
> help me though as I'm using 2.4 kernel for production.

Theres an extra "index 1" in all those examples. remove the first one (I
am sure you will find out when experimenting what the correct syntax
is).
Unfortunately this "index" thing continues to be a big secret although i
have pointed it a few times. Bart should probably add it to his HOWTO.

All actions also have indices and are therefore shareable. 

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 14:06               ` Thomas Graf
@ 2005-01-31 14:29                 ` jamal
  0 siblings, 0 replies; 126+ messages in thread
From: jamal @ 2005-01-31 14:29 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Hasso Tepper, netdev, Nguyen Dinh Nam, Remus, Andre Tomt,
	syrius.ml, Andy Furniss, Damion de Soto

On Mon, 2005-01-31 at 09:06, Thomas Graf wrote:
> > Yeah, that would fix it. Note however, that i am trying to highly
> > discourage use of iptables and i would rather let people who use
> > iptables to suffer;-> (sounds rude i know).  At some point i plan to
> > remove the dependency on iptables altogether.
> 
> Heh, I think it isn't rude, giving people a little clap to join
> the "good side" isn't that bad ;->
> 

Unfortunately killing it totaly would break all sorts of scripts.
Wish we could do this though ;->
Go ahead and push the patch to Dave even - I am just gonna look the
other way;->

> > So i am not sure whether i should encourage pushing of this patch or not ;->
> 
> I don't care that much, the patch is there, everyone can patch and
> distributions can pick it up. I agree that we should remove the
> dependency on iptables but I'd also like to see the dependency on the
> action bits to go away at the same time.

Agreed. 
The main reason not to use the iptables bits is performance; also its
just the wrong spot in the stack.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 14:25               ` jamal
@ 2005-01-31 14:46                 ` Hasso Tepper
  2005-01-31 15:34                   ` jamal
  2005-01-31 18:00                   ` Lennert Buytenhek
  0 siblings, 2 replies; 126+ messages in thread
From: Hasso Tepper @ 2005-01-31 14:46 UTC (permalink / raw)
  To: hadi
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

jamal wrote:
> Unfortunately this "index" thing continues to be a big secret although i
> have pointed it a few times. Bart should probably add it to his HOWTO.

All this stuff deserves better documentation ;). Call me oldfashioned, but 
IMHO would be good start to have all keywords at least somewhat documented 
in iproute2 man pages. I know, writing documentation is boring task etc, 
but at least you should mention all features in man pages. This gives to 
someone at least chance to kick you "hey, what's this?" ;).

This is somewhat related to killing the chance to use iptables as well ... 
Iptables has better documentation and people use it just because of that.

Just checked ... "tc action help" in newest iproute2 returns nothing as 
well :(.


-- 
Hasso Tepper
Elion Enterprises Ltd.
WAN administrator

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 14:19   ` jamal
@ 2005-01-31 15:15     ` Thomas Graf
  2005-01-31 15:40       ` jamal
  2005-02-01  1:02       ` Andy Furniss
  0 siblings, 2 replies; 126+ messages in thread
From: Thomas Graf @ 2005-01-31 15:15 UTC (permalink / raw)
  To: jamal
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

> Or dropping packets. TCP will adjust itself either way; at least
> thats true according to this formula [rfc3448] (originally derived from
> Reno, but people are finding it works fine with all other variants of
> TCP CC):
> 
> -----
> The throughput equation is:
> 
>                                    s
>    X =  ----------------------------------------------------------
>         R*sqrt(2*b*p/3) + (t_RTO * (3*sqrt(3*b*p/8) * p * (1+32*p^2)))
> 
> 
> Where:
> 
>       X is the transmit rate in bytes/second.
>       s is the packet size in bytes.
>       R is the round trip time in seconds.
>       p is the loss event rate, between 0 and 1.0, of the number of loss
>         events as a fraction of the number of packets transmitted.
>       t_RTO is the TCP retransmission timeout value in seconds.
>       b is the number of packets acknowledged by a single TCP
>         acknowledgement.
> ----

Agreed, this was my first attempt and my current code is still based on
this. I'm trying to avoid a retransmit battle, therefore I try to
delay packets if possible with the hope that it's either just a peak
or the slow down is fast enough. I use a simplified RED and
tcp_xmit_retransmit_queue() input to avoid flick flack effects which
works pretty well for bulky streams. A burst buffer takes care
of interactive traffic with peaks but this doesn't work perfectly fine
yet. Overall, my attempt works pretty well if the other side uses
reno/bic and quite well for westwood and vegas. The problem is not that
it doesn't work at all but achieving a certain _stable_ rate is very
difficult, the delta of the requested and real rate is up to 25% depending
on the constancy of the rtt and wether they follow one of the proposed
tcp cc algorithms. The cc guessing code helps a bit but isn't very
accurate.

> Something along the lines of what OBSD firewall does but selectively (If
> i understood those OBSD fanatics at SUCON;-> correctly)..they track
> at ingress before ip stack. The difference is we can allow selective 
> tracking; something along the lines of:

This means we'd have to do the most important sanity cehcks ourselves
like checksum and ip header consistencity. Which basically means a
duplication of ip_rcv() and ipv6_rcv().

> tc filter add dev $DEV parent ffff: protocol ip prio 10  \
>  u32 match u32 0x10000 0xff0000 at 8               \
> action track \
> action metamark here depending on whether we found contrack etc
> 
> I have the layout scribbeled on paper somewhere .. I will look it up
> and provide more details
> 
> Track should just use iptables contracking code instead of reinventing
> it.

This is exactly my thinking as well but I'd do it as ematch. Given
we pass the netfilter conntrack code we'd then have access to the
meta data of it such as direction, state and other attributes.

tc filter add dev $DEV parent ffff: protocol ip prio 10  \
     u32 match u32 0x10000 0xff0000 at 8               \
         and conntrack \
	 and meta nf_state eq ESTABLISHED \
	 and meta nf_status eq SEEN_REPLY \
   action metamark here depending on whether we found contrack etc

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 14:46                 ` Hasso Tepper
@ 2005-01-31 15:34                   ` jamal
  2005-01-31 18:00                   ` Lennert Buytenhek
  1 sibling, 0 replies; 126+ messages in thread
From: jamal @ 2005-01-31 15:34 UTC (permalink / raw)
  To: Hasso Tepper
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

On Mon, 2005-01-31 at 09:46, Hasso Tepper wrote:
> jamal wrote:
> > Unfortunately this "index" thing continues to be a big secret although i
> > have pointed it a few times. Bart should probably add it to his HOWTO.
> 
> All this stuff deserves better documentation ;). Call me oldfashioned, but 
> IMHO would be good start to have all keywords at least somewhat documented 
> in iproute2 man pages. I know, writing documentation is boring task etc, 
> but at least you should mention all features in man pages. This gives to 
> someone at least chance to kick you "hey, what's this?" ;).
> 

But 50 emails later noone documented anything .. And every summer
someone tries to write a qdisc to achive this ;->

> This is somewhat related to killing the chance to use iptables as well ... 
> Iptables has better documentation and people use it just because of that.
> 
> Just checked ... "tc action help" in newest iproute2 returns nothing as 
> well :(.

Should work. Are you sure you got the latest one?
Also make sure you compile tc action in the scheduler part of kernel
config.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 15:15     ` Thomas Graf
@ 2005-01-31 15:40       ` jamal
  2005-01-31 15:59         ` Thomas Graf
  2005-01-31 20:28         ` David S. Miller
  2005-02-01  1:02       ` Andy Furniss
  1 sibling, 2 replies; 126+ messages in thread
From: jamal @ 2005-01-31 15:40 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

On Mon, 2005-01-31 at 10:15, Thomas Graf wrote:

> Agreed, this was my first attempt and my current code is still based on
> this. I'm trying to avoid a retransmit battle, therefore I try to
> delay packets if possible with the hope that it's either just a peak
> or the slow down is fast enough. I use a simplified RED and
> tcp_xmit_retransmit_queue() input to avoid flick flack effects which
> works pretty well for bulky streams. A burst buffer takes care
> of interactive traffic with peaks but this doesn't work perfectly fine
> yet. Overall, my attempt works pretty well if the other side uses
> reno/bic and quite well for westwood and vegas. The problem is not that
> it doesn't work at all but achieving a certain _stable_ rate is very
> difficult, the delta of the requested and real rate is up to 25% depending
> on the constancy of the rtt and wether they follow one of the proposed
> tcp cc algorithms. The cc guessing code helps a bit but isn't very
> accurate.

My experience is that you end up dropping no more than a packet in a
burst with policing before TCP adjusts. Also depending on the gap
between bursts, that may be the only packet you drop altogether.
In long flows such as file transfers, avergae of one packet ever gets
dropped.

> > Something along the lines of what OBSD firewall does but selectively (If
> > i understood those OBSD fanatics at SUCON;-> correctly)..they track
> > at ingress before ip stack. The difference is we can allow selective 
> > tracking; something along the lines of:
> 
> This means we'd have to do the most important sanity cehcks ourselves
> like checksum and ip header consistencity. Which basically means a
> duplication of ip_rcv() and ipv6_rcv().
> 

checksum and other validity of ip header will have to be written as an
action if needed. Infact csum is on my list of mini actions. I could
decide to change something on egress of outgoing ip packet in pedit
and would therefore require to recompute csum.

> > tc filter add dev $DEV parent ffff: protocol ip prio 10  \
> >  u32 match u32 0x10000 0xff0000 at 8               \
> > action track \
> > action metamark here depending on whether we found contrack etc
> > 
> > I have the layout scribbeled on paper somewhere .. I will look it up
> > and provide more details
> > 
> > Track should just use iptables contracking code instead of reinventing
> > it.
> 
> This is exactly my thinking as well but I'd do it as ematch. Given
> we pass the netfilter conntrack code we'd then have access to the
> meta data of it such as direction, state and other attributes.
> 
> tc filter add dev $DEV parent ffff: protocol ip prio 10  \
>      u32 match u32 0x10000 0xff0000 at 8               \
>          and conntrack \
> 	 and meta nf_state eq ESTABLISHED \
> 	 and meta nf_status eq SEEN_REPLY \
>    action metamark here depending on whether we found contrack etc

Ok, I think both approaches are correct. ematch does the check/get
essentially; and action will create the set/tracking if needed.
For the example i gave, you are absolutely correct, ematch is
sufficient.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 15:40       ` jamal
@ 2005-01-31 15:59         ` Thomas Graf
  2005-01-31 16:40           ` jamal
  2005-01-31 20:28         ` David S. Miller
  1 sibling, 1 reply; 126+ messages in thread
From: Thomas Graf @ 2005-01-31 15:59 UTC (permalink / raw)
  To: jamal
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

> My experience is that you end up dropping no more than a packet in a
> burst with policing before TCP adjusts. Also depending on the gap
> between bursts, that may be the only packet you drop altogether.
> In long flows such as file transfers, avergae of one packet ever gets
> dropped.

I mostly agree but not completely. It's definitely true that most of
the problems I'm fighting today are causes by the attempt to be too
perfect in calculating. Going a step backwards solves most of the
problems and probably works just fine for most cases. One of the main
problem I'm facing here are big file transfers on low latency links with
modified ip stacks to allow for a "faster" slow start (those are the
reason why I'm trying to do this). An attempt to drop only a few
packets results in a stronger incremenal growth. I'm not quite sure
why that happens yet but a more aggresive policing stategy helped a
lot. I agree that if we plan to put something like this into mainline
those problem domains should be separated to not overcomlicate the
whole thing.

> checksum and other validity of ip header will have to be written as an
> action if needed. Infact csum is on my list of mini actions. I could
> decide to change something on egress of outgoing ip packet in pedit
> and would therefore require to recompute csum.

Sounds good. We'll need to address this anyway, the classifiers rely
on the ip header being valid which is no longer assured.

> Ok, I think both approaches are correct. ematch does the check/get
> essentially; and action will create the set/tracking if needed.
> For the example i gave, you are absolutely correct, ematch is
> sufficient.

Right, so we can do something like the meta ematch/action split. What
attributes to you intend to be modifieable? A neat thing would be
to overwrite the state and thus assign a packet to another connection
which could be used to reimplement fast nat together with pedit.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-30 22:12 dummy as IMQ replacement Jamal Hadi Salim
  2005-01-31  8:20 ` Hasso Tepper
  2005-01-31 13:58 ` Thomas Graf
@ 2005-01-31 16:27 ` Andre Correa
  2005-01-31 16:51   ` Jamal Hadi Salim
  2005-01-31 22:39 ` Andy Furniss
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 126+ messages in thread
From: Andre Correa @ 2005-01-31 16:27 UTC (permalink / raw)
  To: hadi
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	andre.correa, Andy Furniss, Damion de Soto


Hi all,

it turned an year since we (me and some cool folks) got the original IMQ 
from "death". During this year we updated kernel and iptables patches 
for every available version, created some new features (like hooking 
after and before NAT, multiple IMQ devices, solved modules problems, 
etc), and helped lots of users in our mailling list. The wish list grew, 
we created a site/FAQ/WiKi. We are still missing "dumb device" 
functionality. Our site is www.linuximq.net

Complicated or not, clean or not, its being working in some interresting 
scenarios with lots of load on it. I feel fine for being able to help 
the community somehow with it. Found no time yet to check Jamal's new 
patches but we would use dummy as the base for "real device" 
functionality development.

At least its nice to find we are discussing how to do it, not anymore if 
IMQ functionality is needed, cause it really is.

Going one way or another we should not let users alone again with nobody 
taking care of this like it happened before. I plan keeping IMQ updated 
with new kernel versions as usual.

Jamal, when you say "to replace" you mean it may get into vanila kernel? 
Do you plan keeping it updated from now on?

Either way, can we call this new thing something else, because actual 
users may not want to migrate, so both should work together. A user 
should be able to patch a kernel with both.

We (at linuximq.net) would be more then happy to help with it.

Andre



Jamal Hadi Salim wrote:
> This is in relation to providing functionality that IMQ was intending
> to using the dummy device and tc actions. Ive copied as many people as i
> could dig who i know may have interest in this.
> Please forward this to any other list which may have interest
> in the subject. It still needs some cleaning up; however, i dont wanna
> sit on it for another year - and now that mirred is out there, this is a
> good time.
> 
> Advantage over current IMQ; cleaner in particular in in SMP;
> with a _lot_ less code.
> Old Dummy device functionality is preserved while new one only
> kicks in if you use actions. Didnt have to write a new device and finaly
> made a real dumb device to be a little smarter ;->
> 
> IMQ USES
> --------
> As far as i know the reasons listed below is why people use IMQ. 
> It would be nice to know of anything else that i missed because this
> is the requirements list i used.
> 
> 1) qdiscs/policies that are per device as opposed to system wide.
> IMQ allows for sharing across multiple devices.
> 
> 2) Allows for queueing incoming traffic for shaping instead of
> dropping. I am not aware of any study that shows policing is 
> worse than shaping in achieving the end goal of rate control.
> I would be interested if anyone is experimenting. Nevertheless,
> this is still an alternative as opposed to making a system wide
> ingress change.
> 
> 3) Very interesting use: if you are serving p2p you may wanna give 
> preference to your own localy originated traffic (when responses come
> back) vs someone using your system to do bittorent. So QoSing based on
> state comes in as the solution. What people did to achive this was stick
> the IMQ somewhere prelocal hook.
> I think this is a pretty neat feature to have in Linux in general.
> (i.e not just for IMQ).
> But i wont go back to putting netfilter hooks in the device to satisfy
> this.  I also dont think its worth it hacking dummy some more to be 
> aware of say L3 info and play ip rule tricks to achieve this.
> --> Instead the plan is to have a contrack related action. This action
> will selectively either query/create contrack state on incoming packets.
> Packets could then be redirected to dummy based on what happens -> eg 
> on incoming packets; if we find they are of known state we could send to
> a different queue than one which didnt have existing state. This
> all however is dependent on whatever rules the admin enters.
> 
> What you can do with dummy currently with actions
> --------------------------------------------------
> 
> Lets say you are policing packets from alias 192.168.200.200/32
> you dont want those to exceed 100kbps going out.
> 
> tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \
> match ip src 192.168.200.200/32 flowid 1:2 \
> action police rate 100kbit burst 90k drop
> 
> If you run tcpdump on eth0 you will see all packets going out
> with src 192.168.200.200/32 dropped or not
> Extend the rule a little to see only the ones that made it out:
> 
> tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \
> match ip src 192.168.200.200/32 flowid 1:2 \
> action police rate 10kbit burst 90k drop \
> action mirred egress mirror dev dummy0 
> 
> Now fire tcpdump on dummy0 to see only those packets ..
> tcpdump -n -i dummy0 -x -e -t 
> 
> Essentially a good debugging/logging interface.
> 
> If you replace mirror with redirect, those packets will be
> blackholed and will never make it out. This redirect behavior
> changes with new patch (but not the mirror). 
> 
> 
> What you can do with dummy and attached patch
> ----------------------------------------------
> 
> Essentially provide functionality that most people use IMQ;
> sample below:
> 
> --------
> export TC="/sbin/tc"
> 
> $TC qdisc add dev dummy0 root handle 1: prio 
> $TC qdisc add dev dummy0 parent 1:1 handle 10: sfq
> $TC qdisc add dev dummy0 parent 1:2 handle 20: tbf rate 20kbit buffer
> 1600 limit 3000
> $TC qdisc add dev dummy0 parent 1:3 handle 30:
> sfq                                
> $TC filter add dev dummy0 protocol ip pref 1 parent 1: handle 1 fw
> classid 1:1
> $TC filter add dev dummy0 protocol ip pref 2 parent 1: handle 2 fw
> classid 1:2
> 
> ifconfig dummy0 up
> 
> $TC qdisc add dev eth0 ingress
> 
> # redirect all IP packets arriving in eth0 to dummy0 
> # use mark 1 --> puts them onto class 1:1
> $TC filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
> match u32 0 0 flowid 1:1 \
> action ipt -j MARK --set-mark 1 \
> action mirred egress redirect dev dummy0
> 
> --------
> 
> 
> Run A Little test:
> 
> from another machine ping so that you have packets going into the box:
> -----
> [root@jzny action-tests]# ping 10.22
> PING 10.22 (10.0.0.22): 56 data bytes
> 64 bytes from 10.0.0.22: icmp_seq=0 ttl=64 time=2.8 ms
> 64 bytes from 10.0.0.22: icmp_seq=1 ttl=64 time=0.6 ms
> 64 bytes from 10.0.0.22: icmp_seq=2 ttl=64 time=0.6 ms
> 
> --- 10.22 ping statistics ---
> 3 packets transmitted, 3 packets received, 0% packet loss
> round-trip min/avg/max = 0.6/1.3/2.8 ms
> [root@jzny action-tests]# 
> -----
> Now look at some stats:
> 
> ---
> [root@jmandrake]:~# $TC -s filter show parent ffff: dev eth0
> filter protocol ip pref 10 u32 
> filter protocol ip pref 10 u32 fh 800: ht divisor 1 
> filter protocol ip pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0
> flowid 1:1 
>   match 00000000/00000000 at 0
>         action order 1: tablename: mangle  hook: NF_IP_PRE_ROUTING 
>         target MARK set 0x1  
>         index 1 ref 1 bind 1 installed 4195sec  used 27sec 
>          Sent 252 bytes 3 pkts (dropped 0, overlimits 0) 
> 
>         action order 2: mirred (Egress Redirect to device dummy0) stolen
>         index 1 ref 1 bind 1 installed 165 sec used 27 sec
>          Sent 252 bytes 3 pkts (dropped 0, overlimits 0) 
> 
> [root@jmandrake]:~# $TC -s qdisc
> qdisc sfq 30: dev dummy0 limit 128p quantum 1514b 
>  Sent 0 bytes 0 pkts (dropped 0, overlimits 0) 
> qdisc tbf 20: dev dummy0 rate 20Kbit burst 1575b lat 2147.5s 
>  Sent 210 bytes 3 pkts (dropped 0, overlimits 0) 
> qdisc sfq 10: dev dummy0 limit 128p quantum 1514b 
>  Sent 294 bytes 3 pkts (dropped 0, overlimits 0) 
> qdisc prio 1: dev dummy0 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1
> 1
>  Sent 504 bytes 6 pkts (dropped 0, overlimits 0) 
> qdisc ingress ffff: dev eth0 ---------------- 
>  Sent 308 bytes 5 pkts (dropped 0, overlimits 0) 
> 
> [root@jmandrake]:~# ifconfig dummy0
> dummy0    Link encap:Ethernet  HWaddr 00:00:00:00:00:00  
>           inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link
>           UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
>           RX packets:6 errors:0 dropped:3 overruns:0 frame:0
>           TX packets:3 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:32 
>           RX bytes:504 (504.0 b)  TX bytes:252 (252.0 b)
> -----
> 
> Dummy continues to behave like it always did.
> You send it any packet not originating from the actions it will drop
> them.
> [In this case the three dropped packets were ipv6 ndisc].
> 
> My goal here is to start a discussion to see if people agree this is
> a good replacement for IMQ or whether to go another path.
> Clearly i would prefer to have this change in, but I am not religious 
> and would listen to reason about how it should be done as long as no 
> uneccessary clutter happens. 
> 
> Patch attached.
> 
> cheers,
> jamal
> 
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> --- a/drivers/net/dummy.c.orig	2004-12-24 16:34:33.000000000 -0500
> +++ b/drivers/net/dummy.c	2005-01-18 06:43:47.000000000 -0500
> @@ -26,7 +26,14 @@
>  			Nick Holloway, 27th May 1994
>  	[I tweaked this explanation a little but that's all]
>  			Alan Cox, 30th May 1994
> +
>  */
> +/*
> +	* This driver isnt abused enough ;->
> +	* Here to add only _just_ a _feeew more_ features,
> +	* 10 years after AC added comment above ;-> hehe - JHS
> +*/
> +
>  
>  #include <linux/config.h>
>  #include <linux/module.h>
> @@ -35,11 +42,128 @@
>  #include <linux/etherdevice.h>
>  #include <linux/init.h>
>  #include <linux/moduleparam.h>
> +#ifdef CONFIG_NET_CLS_ACT
> +#include <net/pkt_sched.h> 
> +#endif
> +
> +#define TX_TIMEOUT  (2*HZ)
> +                                                                                
> +#define TX_Q_LIMIT    32
> +struct dummy_private {
> +	struct net_device_stats stats;
> +#ifdef CONFIG_NET_CLS_ACT
> +	struct tasklet_struct   dummy_tasklet;
> +	int     tasklet_pending;
> +	/* mostly debug stats leave in for now */
> +	unsigned long   stat_r1;
> +	unsigned long   stat_r2;
> +	unsigned long   stat_r3;
> +	unsigned long   stat_r4;
> +	unsigned long   stat_r5;
> +	unsigned long   stat_r6;
> +	unsigned long   stat_r7;
> +	unsigned long   stat_r8;
> +	struct sk_buff_head     rq;
> +	struct sk_buff_head     tq;
> +#endif
> +};
> +
> +#ifdef CONFIG_NET_CLS_ACT
> +static void ri_tasklet(unsigned long dev);
> +#endif
> +
>  
>  static int numdummies = 1;
>  
>  static int dummy_xmit(struct sk_buff *skb, struct net_device *dev);
>  static struct net_device_stats *dummy_get_stats(struct net_device *dev);
> +static void dummy_timeout(struct net_device *dev);
> +static int dummy_open(struct net_device *dev);
> +static int dummy_close(struct net_device *dev);
> +
> +static void dummy_timeout(struct net_device *dev) {
> +
> +	int cpu = smp_processor_id();
> +
> +	dev->trans_start = jiffies;
> +	printk("%s: BUG tx timeout on CPU %d\n",dev->name,cpu);
> +	if (spin_is_locked((&dev->xmit_lock)))
> +		printk("xmit lock grabbed already\n");
> +	if (spin_is_locked((&dev->queue_lock)))
> +		printk("queue lock grabbed already\n");
> +}
> +
> +#ifdef CONFIG_NET_CLS_ACT
> +static void ri_tasklet(unsigned long dev) {
> +
> +	struct net_device *dv = (struct net_device *)dev;
> +	struct dummy_private *dp = ((struct net_device *)dev)->priv;
> +	struct net_device_stats *stats = &dp->stats;
> +	struct sk_buff *skb = NULL;
> +
> +	dp->stat_r4 +=1;
> +	if (NULL == (skb = skb_peek(&dp->tq))) {
> +		dp->stat_r5 +=1;
> +		if (spin_trylock(&dv->xmit_lock)) {
> +			dp->stat_r8 +=1;
> +			while (NULL != (skb = skb_dequeue(&dp->rq))) {
> +				skb_queue_tail(&dp->tq, skb);
> +			}
> +			spin_unlock(&dv->xmit_lock);
> +		} else {
> +	/* reschedule */
> +			dp->stat_r1 +=1;
> +			goto resched;
> +		}
> +	}
> +
> +	while (NULL != (skb = skb_dequeue(&dp->tq))) {
> +		__u32 from = G_TC_FROM(skb->tc_verd);
> +
> +		skb->tc_verd = 0;
> +		skb->tc_verd = SET_TC_NCLS(skb->tc_verd);
> +		stats->tx_packets++;
> +		stats->tx_bytes+=skb->len;
> +		if (from & AT_EGRESS) {
> +			dp->stat_r6 +=1;
> +			dev_queue_xmit(skb);
> +		} else if (from & AT_INGRESS) {
> +
> +			dp->stat_r7 +=1;
> +			netif_rx(skb);
> +		} else {
> +			/* if netfilt is compiled in and packet is
> +			tagged, we could reinject the packet back
> +			this would make it do remaining 10%
> +			of what current IMQ does  
> +			if someone really really insists then
> +			this is the spot .. jhs */
> +			dev_kfree_skb(skb);
> +			stats->tx_dropped++;
> +		}
> +	}
> +
> +	if (spin_trylock(&dv->xmit_lock)) {
> +		dp->stat_r3 +=1;
> +		if (NULL == (skb = skb_peek(&dp->rq))) {
> +			dp->tasklet_pending = 0;
> +		if (netif_queue_stopped(dv))
> +			//netif_start_queue(dv);
> +			netif_wake_queue(dv);
> +		} else {
> +			dp->stat_r2 +=1;
> +			spin_unlock(&dv->xmit_lock);
> +			goto resched;
> +		}
> +		spin_unlock(&dv->xmit_lock);
> +		} else {
> +resched:
> +			dp->tasklet_pending = 1;
> +			tasklet_schedule(&dp->dummy_tasklet);
> +		}
> +
> +}
> +#endif
>  
>  static int dummy_set_address(struct net_device *dev, void *p)
>  {
> @@ -62,12 +186,17 @@
>  	/* Initialize the device structure. */
>  	dev->get_stats = dummy_get_stats;
>  	dev->hard_start_xmit = dummy_xmit;
> +	dev->tx_timeout = &dummy_timeout;
> +	dev->watchdog_timeo = TX_TIMEOUT;
> +	dev->open = &dummy_open;
> +	dev->stop = &dummy_close;
> +
>  	dev->set_multicast_list = set_multicast_list;
>  	dev->set_mac_address = dummy_set_address;
>  
>  	/* Fill in device structure with ethernet-generic values. */
>  	ether_setup(dev);
> -	dev->tx_queue_len = 0;
> +	dev->tx_queue_len = TX_Q_LIMIT;
>  	dev->change_mtu = NULL;
>  	dev->flags |= IFF_NOARP;
>  	dev->flags &= ~IFF_MULTICAST;
> @@ -77,18 +206,64 @@
>  
>  static int dummy_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
> -	struct net_device_stats *stats = netdev_priv(dev);
> +	struct dummy_private *dp = ((struct net_device *)dev)->priv;
> +	struct net_device_stats *stats = &dp->stats;
> +	int ret = 0;
>  
> +	{
>  	stats->tx_packets++;
>  	stats->tx_bytes+=skb->len;
> +	}
> +#ifdef CONFIG_NET_CLS_ACT
> +	__u32 from = G_TC_FROM(skb->tc_verd);
> +	if (!from || !skb->input_dev ) {
> +dropped:
> +		 dev_kfree_skb(skb);
> +		 stats->rx_dropped++;
> +		 return ret;
> +	} else {
> +		if (skb->input_dev)
> +			skb->dev = skb->input_dev;
> +		else
> +			printk("warning!!! no idev %s\n",skb->dev->name);
>  
> +		skb->input_dev = dev;
> +		if (from & AT_INGRESS) {
> +			skb_pull(skb, skb->dev->hard_header_len);
> +		} else {
> +			if (!(from & AT_EGRESS)) {
> +				goto dropped;
> +			}
> +		}
> +	}
> +	if (skb_queue_len(&dp->rq) >= dev->tx_queue_len) {
> +		netif_stop_queue(dev);
> +	}
> +	dev->trans_start = jiffies;
> +	skb_queue_tail(&dp->rq, skb);
> +	if (!dp->tasklet_pending) {
> +		dp->tasklet_pending = 1;
> +		tasklet_schedule(&dp->dummy_tasklet);
> +	}
> +
> +#else
> +	stats->rx_dropped++;
>  	dev_kfree_skb(skb);
> -	return 0;
> +#endif
> +	return ret;
>  }
>  
>  static struct net_device_stats *dummy_get_stats(struct net_device *dev)
>  {
> -	return netdev_priv(dev);
> +	struct dummy_private *dp = ((struct net_device *)dev)->priv;
> +	struct net_device_stats *stats = &dp->stats;
> +#ifdef CONFIG_NET_CLS_ACT_DEB
> +	printk("tasklets stats %ld:%ld:%ld:%ld:%ld:%ld:%ld:%ld \n",
> +		dp->stat_r1,dp->stat_r2,dp->stat_r3,dp->stat_r4,
> +		dp->stat_r5,dp->stat_r6,dp->stat_r7,dp->stat_r8);
> +#endif
> +
> +	return stats;
>  }
>  
>  static struct net_device **dummies;
> @@ -97,12 +272,41 @@
>  module_param(numdummies, int, 0);
>  MODULE_PARM_DESC(numdummies, "Number of dummy pseudo devices");
>  
> +static int dummy_close(struct net_device *dev)
> +{
> +
> +#ifdef CONFIG_NET_CLS_ACT
> +	struct dummy_private *dp = ((struct net_device *)dev)->priv;
> +
> +	tasklet_kill(&dp->dummy_tasklet);
> +	skb_queue_purge(&dp->rq);
> +	skb_queue_purge(&dp->tq);
> +#endif
> +	netif_stop_queue(dev);
> +	return 0;
> +}
> +
> +static int dummy_open(struct net_device *dev)
> +{
> +
> +#ifdef CONFIG_NET_CLS_ACT
> +	struct dummy_private *dp = ((struct net_device *)dev)->priv;
> +
> +	tasklet_init(&dp->dummy_tasklet, ri_tasklet, (unsigned long)dev);
> +	skb_queue_head_init(&dp->rq);
> +	skb_queue_head_init(&dp->tq);
> +#endif
> +	netif_start_queue(dev);
> +	return 0;
> +}
> +
> +
>  static int __init dummy_init_one(int index)
>  {
>  	struct net_device *dev_dummy;
>  	int err;
>  
> -	dev_dummy = alloc_netdev(sizeof(struct net_device_stats),
> +	dev_dummy = alloc_netdev(sizeof(struct dummy_private),
>  				 "dummy%d", dummy_setup);
>  
>  	if (!dev_dummy)



-----------------------------------------------------------------------
Confidentiality Notice: This e-mail communication and any attachments 
may contain confidential and privileged information for the use of the 
designated recipients named above. If you are not the intended 
recipient, you are hereby notified that you have received this 
communication in error and that any review, disclosure, dissemination, 
distribution or copying of it or its contents is prohibited. If you have 
received this communication in error, please notify me immediately by 
replying to this message and deleting it from your computer.

Thank you.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 15:59         ` Thomas Graf
@ 2005-01-31 16:40           ` jamal
  2005-01-31 18:15             ` Thomas Graf
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-01-31 16:40 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

On Mon, 2005-01-31 at 10:59, Thomas Graf wrote:
> > My experience is that you end up dropping no more than a packet in a
> > burst with policing before TCP adjusts. Also depending on the gap
> > between bursts, that may be the only packet you drop altogether.
> > In long flows such as file transfers, avergae of one packet ever gets
> > dropped.
> 
> I mostly agree but not completely. It's definitely true that most of
> the problems I'm fighting today are causes by the attempt to be too
> perfect in calculating. Going a step backwards solves most of the
> problems and probably works just fine for most cases. One of the main
> problem I'm facing here are big file transfers on low latency links with
> modified ip stacks to allow for a "faster" slow start (those are the
> reason why I'm trying to do this). An attempt to drop only a few
> packets results in a stronger incremenal growth. I'm not quite sure
> why that happens yet but a more aggresive policing stategy helped a
> lot. I agree that if we plan to put something like this into mainline
> those problem domains should be separated to not overcomlicate the
> whole thing.
> 

Would be interesting to combine policing and random dropping to see 
what happens. 
I think this is something you should be able to find a student to abuse
so they can write a paper ;-> Probably Linux may not even be the right
place to do it to start with - rather simulations until you get it right
then code it into Linux.

> Sounds good. We'll need to address this anyway, the classifiers rely
> on the ip header being valid which is no longer assured.

true - i was thinking of restoring stateless NAT at this level as well.
So csum would be needed. The csum could be programmed to either
validate only or recompute; those are the only two arguements to it that
i could think of. I suppose first thing is to put out the eaction patch
then add this action. I will try to sneak in some time this week and
write the eaction.

> > Ok, I think both approaches are correct. ematch does the check/get
> > essentially; and action will create the set/tracking if needed.
> > For the example i gave, you are absolutely correct, ematch is
> > sufficient.
> 
> Right, so we can do something like the meta ematch/action split. What
> attributes to you intend to be modifieable? 

Essentially on ingress create state; i have to find my notes to give you
precise answer. But one of the parameters was to select the level of
state tracking (such as "track IP only" - not sure how doable that is
with contrack)
 
> A neat thing would be
> to overwrite the state and thus assign a packet to another connection
> which could be used to reimplement fast nat together with pedit.

Stateless NAT doesnt really need contracking. pedit (taught to speak
english) + eaction csum should do it.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 16:27 ` Andre Correa
@ 2005-01-31 16:51   ` Jamal Hadi Salim
  0 siblings, 0 replies; 126+ messages in thread
From: Jamal Hadi Salim @ 2005-01-31 16:51 UTC (permalink / raw)
  To: andre.correa
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

On Mon, 2005-01-31 at 11:27, Andre Correa wrote:
> Hi all,
> 
> it turned an year since we (me and some cool folks) got the original IMQ 
> from "death". During this year we updated kernel and iptables patches 
> for every available version, created some new features (like hooking 
> after and before NAT, multiple IMQ devices, solved modules problems, 
> etc), and helped lots of users in our mailling list. The wish list grew, 
> we created a site/FAQ/WiKi. We are still missing "dumb device" 
> functionality. Our site is www.linuximq.net
> 

nice. Since you are deeply involved i think you can help put closure to
this.

> Complicated or not, clean or not, its being working in some interresting 
> scenarios with lots of load on it. I feel fine for being able to help 
> the community somehow with it. Found no time yet to check Jamal's new 
> patches but we would use dummy as the base for "real device" 
> functionality development.
> 
> At least its nice to find we are discussing how to do it, not anymore if 
> IMQ functionality is needed, cause it really is.
> 

The people have spoken i suppose is the right way to describe it.
If a lot of people use it, then its existence is justified. The problem
has been misrepresenation of why its needed.

> Going one way or another we should not let users alone again with nobody 
> taking care of this like it happened before. I plan keeping IMQ updated 
> with new kernel versions as usual.
> 
> Jamal, when you say "to replace" you mean it may get into vanila kernel? 
> Do you plan keeping it updated from now on?
> 

The plan is to make that small update to the kernel to achieve the
funtionality that IMQ provides. It doesnt have to be me who does the
updating thereafter; you can own this for example. The goal is to meet
those requirements with little noise in the kernel. If we have things in
the kernel, then there should be no need to maintain separate patches. 

> Either way, can we call this new thing something else, because actual 
> users may not want to migrate, so both should work together. A user 
> should be able to patch a kernel with both.

I dont have an issue with renaming but i dont see any overwhelming
reason to do it on a new device when dummy seems to be sufficient.
Take a look at that patch and see what functionality is missing.
Forget about the iptable hooks. See the thread of discussion
to see how the plan to meet those requirements looks like - see if
something is missing. Please read the text i posted - it is verbose but
would give a good explanation.

> We (at linuximq.net) would be more then happy to help with it.

Like i said you guys can own this - just wanna reduce cruft in the
kernel.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 14:46                 ` Hasso Tepper
  2005-01-31 15:34                   ` jamal
@ 2005-01-31 18:00                   ` Lennert Buytenhek
  2005-01-31 20:08                     ` jamal
  1 sibling, 1 reply; 126+ messages in thread
From: Lennert Buytenhek @ 2005-01-31 18:00 UTC (permalink / raw)
  To: Hasso Tepper
  Cc: hadi, netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

On Mon, Jan 31, 2005 at 04:46:14PM +0200, Hasso Tepper wrote:

> This is somewhat related to killing the chance to use iptables as well ... 
> Iptables has better documentation and people use it just because of that.

I'm afraid I have to agree on this one.  The idea behind iptables is
easy to grasp, whereas tc isn't totally obvious, and all tc 'tutorials'
out there just give you a long list of commands to type in but don't
really explain you what goes on under the hood.

And you can't just expect everyone to "Go look at the source."


--L

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 16:40           ` jamal
@ 2005-01-31 18:15             ` Thomas Graf
  2005-01-31 20:18               ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Thomas Graf @ 2005-01-31 18:15 UTC (permalink / raw)
  To: jamal
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

> Would be interesting to combine policing and random dropping to see 
> what happens. 

Indeed.

> Probably Linux may not even be the right place to do it to start with
> - rather simulations until you get it right then code it into Linux.

I haven't left ns sim so far, my calculations are based on some of the
linux specific cc modifications though, i.e. I modified ns sim a bit
to provide the same information. I'm thinking about moving to umlsim,
it should provide a better real world simulation.

> true - i was thinking of restoring stateless NAT at this level as well.
> So csum would be needed. The csum could be programmed to either
> validate only or recompute; those are the only two arguements to it that
> i could think of. I suppose first thing is to put out the eaction patch
> then add this action. I will try to sneak in some time this week and
> write the eaction.

Sounds good, we could put up a ematch csum for validation and a eaction
for recomputation. I'll wait for your code to show up.

> > Right, so we can do something like the meta ematch/action split. What
> > attributes to you intend to be modifieable? 
> 
> Essentially on ingress create state; i have to find my notes to give you
> precise answer. But one of the parameters was to select the level of
> state tracking (such as "track IP only" - not sure how doable that is
> with contrack)

So you want to have a generic conntrack action capable of dynamically
taking whatever information into account that the user requests? This
remembers me of the esfq effort which could benefit from this, it
extends sfq to take the definition for a flow as a parameter. We could
share some code here.

> Stateless NAT doesnt really need contracking. pedit (taught to speak
> english) + eaction csum should do it.

Right, given we don't need any reverse translation. Still it would be
neat to set the conntrack attributes so one could use iptables later
on, I'm not sure how doable this is though.

Something different...

This sounds all very good but I think we're still sucessfully ignoring
one of the most important points, usability. Most modifications over
the last few months have complicated things, introduced different behaviour
depending on compile time options and userspace tools which are either
outdated or having features being completely undocumented. Some of the
recent additions don't even show up in the usage text of iproute2. So
I think we should at least part time focus a little more on the big
picture and make things consitent and more useable. At least 50% of the
functionaility currently in mainline is completely unused because nobody
knows about it. I'm in no way against any of the recent additions but
maybe we can also put some more effort into usability.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 18:00                   ` Lennert Buytenhek
@ 2005-01-31 20:08                     ` jamal
  0 siblings, 0 replies; 126+ messages in thread
From: jamal @ 2005-01-31 20:08 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Hasso Tepper, netdev, Nguyen Dinh Nam, Remus, Andre Tomt,
	syrius.ml, Andy Furniss, Damion de Soto

On Mon, 2005-01-31 at 13:00, Lennert Buytenhek wrote:
> On Mon, Jan 31, 2005 at 04:46:14PM +0200, Hasso Tepper wrote:
> 
> > This is somewhat related to killing the chance to use iptables as well ... 
> > Iptables has better documentation and people use it just because of that.
> 
> I'm afraid I have to agree on this one.  

Well, if you look at the 2 requirements behind IMQ, has nothing todo
with iptables i.e does not at all require presence of iptables.
So motivation is to meet those requirements not kill iptables.

> The idea behind iptables is
> easy to grasp, whereas tc isn't totally obvious, and all tc 'tutorials'
> out there just give you a long list of commands to type in but don't
> really explain you what goes on under the hood.
> 
> And you can't just expect everyone to "Go look at the source."

Agreed, tc is less usable and has a lot less people puking code at it.
The usability part has to be fixed. And i think you will see that with
ematch and eaction code showing up. Credit goes to Bart and co and their
website for putting a lot of docs together. Usability certainly needs to
improve!

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 18:15             ` Thomas Graf
@ 2005-01-31 20:18               ` jamal
  2005-01-31 22:53                 ` Thomas Graf
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-01-31 20:18 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

On Mon, 2005-01-31 at 13:15, Thomas Graf wrote:

> I haven't left ns sim so far, my calculations are based on some of the
> linux specific cc modifications though, i.e. I modified ns sim a bit
> to provide the same information. I'm thinking about moving to umlsim,
> it should provide a better real world simulation.
> 

The problem is if theres any bugs in the way algos are implemented in
Linux you are influenced by that truth. Starting with ns and then
validating on Linux is a great way to do it.

> Sounds good, we could put up a ematch csum for validation and a eaction
> for recomputation. I'll wait for your code to show up.
> 

Cross your fingers; worst case by weekend i should get something out.

> > Essentially on ingress create state; i have to find my notes to give you
> > precise answer. But one of the parameters was to select the level of
> > state tracking (such as "track IP only" - not sure how doable that is
> > with contrack)
> 
> So you want to have a generic conntrack action capable of dynamically
> taking whatever information into account that the user requests? This
> remembers me of the esfq effort which could benefit from this, it
> extends sfq to take the definition for a flow as a parameter. We could
> share some code here.
> 

I dont think contrack was designed for this kind of effort. If we totaly
fail to do it using contrack then we could go a different path.
sfq already stores some rough view of the state; not sure if it can
benefit from this.

> > Stateless NAT doesnt really need contracking. pedit (taught to speak
> > english) + eaction csum should do it.
> 
> Right, given we don't need any reverse translation. Still it would be
> neat to set the conntrack attributes so one could use iptables later
> on, I'm not sure how doable this is though.
> 

If you are NATing (stateless) you should enter rules for both
directions; Maybe we could write a wrapper where user only enters
outgoing rule and that automatically generates the incoming rule as
well.

> Something different...
> 
> This sounds all very good but I think we're still sucessfully ignoring
> one of the most important points, usability. 

Absolutely.

> Most modifications over
> the last few months have complicated things, introduced different behaviour
> depending on compile time options and userspace tools which are either
> outdated or having features being completely undocumented. Some of the
> recent additions don't even show up in the usage text of iproute2. So
> I think we should at least part time focus a little more on the big
> picture and make things consitent and more useable. At least 50% of the
> functionaility currently in mainline is completely unused because nobody
> knows about it. I'm in no way against any of the recent additions but
> maybe we can also put some more effort into usability.

I think the eactions etc are adding a lot of value towards usability.
Hasso Tepper was ealrier complaining about this same issue. 
As an example, I think u32 and ematches would improve a great deal now
and be more understandable. True, work/time still needs to be invested.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 15:40       ` jamal
  2005-01-31 15:59         ` Thomas Graf
@ 2005-01-31 20:28         ` David S. Miller
  1 sibling, 0 replies; 126+ messages in thread
From: David S. Miller @ 2005-01-31 20:28 UTC (permalink / raw)
  To: hadi
  Cc: tgraf, netdev, nguyendinhnam, rmocius, andre, syrius.ml,
	andy.furniss, damion

On 31 Jan 2005 10:40:44 -0500
jamal <hadi@cyberus.ca> wrote:

> My experience is that you end up dropping no more than a packet in a
> burst with policing before TCP adjusts. Also depending on the gap
> between bursts, that may be the only packet you drop altogether.
> In long flows such as file transfers, avergae of one packet ever gets
> dropped.

Keep in mind that this does not help people with connection
heavy access patterns.  If you have a lot of people doing
small transactions, ACK pacing as well as data traffic
dropping is necessary.

The heart of TCP pacing is ACK rates.  All of it's data
sending is clocked via ACK arrival.

Therefore the best scheme seems to be ACK pacing along
with data dropping.  The ACK pacing is the "nice" policing
where as the data dropping is the big hammer.  Ideally, the
ACK pacing will produce the desired data rate and thus the
data dropping will not be necessary.

ACK pacing is more desirable also because of schemes such
as VEGAS congestion control which wish to test the limits
of a link without any data drops.  It's basic idea is that
"if my delay increases, yet my throughput does not, I am
 doing nothing more than eating router queue space and
 therefore have gone beyond the limits of this path, back off"

I know there are problems with VEGAS, but it is a good example
to use in showing that the way to tame TCP's data sending rate
is by controlling the ACKs not by dropping the data, as a first
order method of policing.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-30 22:12 dummy as IMQ replacement Jamal Hadi Salim
                   ` (2 preceding siblings ...)
  2005-01-31 16:27 ` Andre Correa
@ 2005-01-31 22:39 ` Andy Furniss
  2005-02-01 11:49   ` jamal
  2005-02-01 11:32 ` Andy Furniss
       [not found] ` <0fcf01c5077f$579e4b80$6e69690a@RIMAS>
  5 siblings, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-01-31 22:39 UTC (permalink / raw)
  To: hadi
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml, Damion de Soto

Jamal Hadi Salim wrote:
> This is in relation to providing functionality that IMQ was intending
> to using the dummy device and tc actions. Ive copied as many people as i
> could dig who i know may have interest in this.
> Please forward this to any other list which may have interest
> in the subject. It still needs some cleaning up; however, i dont wanna
> sit on it for another year - and now that mirred is out there, this is a
> good time.
> 
> Advantage over current IMQ; cleaner in particular in in SMP;
> with a _lot_ less code.
> Old Dummy device functionality is preserved while new one only
> kicks in if you use actions. Didnt have to write a new device and finaly
> made a real dumb device to be a little smarter ;->
> 
> IMQ USES
> --------
> As far as i know the reasons listed below is why people use IMQ. 
> It would be nice to know of anything else that i missed because this
> is the requirements list i used.
> 
> 1) qdiscs/policies that are per device as opposed to system wide.
> IMQ allows for sharing across multiple devices.
> 
> 2) Allows for queueing incoming traffic for shaping instead of
> dropping. I am not aware of any study that shows policing is 
> worse than shaping in achieving the end goal of rate control.

I would say the end goal is shaping not just rate control. Shaping 
meaning different things to different people and ingress shaping being 
different from egress.

For me it's from the wrong end of a relativly narrow (512kbit) 
bottleneck link that has a 600ms fifo at the other end. My aim to 
sacrifice as little bandwidth as possible while not adding latency 
bursts for gaming and per user bandwidth allocation (with sharing of 
unused) and sfq within that for bulk tcp traffic.

If I was shaping LAN traffic, then policers/drops would be OK for me - 
but for a slow link I think queueing and dropping are better/give more 
control eg. I get to use sfq which should not drop the one packet a 56k 
user has managed to send me in the face of lots of incoming from low 
latency high bandwidth servers.

Even if it's possible I bet few can easily get policers to setup the 
complex sharing/prioritisations that you can with HTB or HFSC.


> I would be interested if anyone is experimenting. Nevertheless,
> this is still an alternative as opposed to making a system wide
> ingress change.
> 
> 3) Very interesting use: if you are serving p2p you may wanna give 
> preference to your own localy originated traffic (when responses come
> back) vs someone using your system to do bittorent. So QoSing based on
> state comes in as the solution. What people did to achive this was stick
> the IMQ somewhere prelocal hook.
> I think this is a pretty neat feature to have in Linux in general.
> (i.e not just for IMQ).

I think flexibility is always good - tunnels, ipsec etc. may need it - I 
don't know from personal use, though.

> But i wont go back to putting netfilter hooks in the device to satisfy
> this.  I also dont think its worth it hacking dummy some more to be 
> aware of say L3 info and play ip rule tricks to achieve this.
> --> Instead the plan is to have a contrack related action. This action
> will selectively either query/create contrack state on incoming packets.

I don't understand exactly what you mean here - for my setup to work I 
need to see denatted addresses and mark (connbytes - it helps me be 
extra nasty to multiple simoultaneous connections in slowstart and 
prioritise browsing over bulk) in prerouting mangle. Of course if I can 
use netfilter to classify and save into contrack then I could do 
evrything in netfilter and then use something like connmark to save it 
per connection.


> Packets could then be redirected to dummy based on what happens -> eg 
> on incoming packets; if we find they are of known state we could send to
> a different queue than one which didnt have existing state. This
> all however is dependent on whatever rules the admin enters.


How does the admin enter the rules - netfilter or other?


Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 20:18               ` jamal
@ 2005-01-31 22:53                 ` Thomas Graf
  2005-02-01 12:02                   ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Thomas Graf @ 2005-01-31 22:53 UTC (permalink / raw)
  To: jamal
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

* jamal <1107202715.1075.559.camel@jzny.localdomain> 2005-01-31 15:18
> On Mon, 2005-01-31 at 13:15, Thomas Graf wrote:
> 
> > I haven't left ns sim so far, my calculations are based on some of the
> > linux specific cc modifications though, i.e. I modified ns sim a bit
> > to provide the same information. I'm thinking about moving to umlsim,
> > it should provide a better real world simulation.
> > 
> 
> The problem is if theres any bugs in the way algos are implemented in
> Linux you are influenced by that truth. Starting with ns and then
> validating on Linux is a great way to do it.

Absolutely, that's why I first went to ns sim but a nice theory is
worth nothing if it doesn't work in the real world.

> I dont think contrack was designed for this kind of effort. If we totaly
> fail to do it using contrack then we could go a different path.
> sfq already stores some rough view of the state; not sure if it can
> benefit from this.

I was thinking of the parameters to define what a flow consists of.
Extended SFQ basically allows you to define the hash function. I think
I misunderstood you before and you don't want allow adjustable
states on only a subset of the attributes, e.g. only L3 data.

> If you are NATing (stateless) you should enter rules for both
> directions; Maybe we could write a wrapper where user only enters
> outgoing rule and that automatically generates the incoming rule as
> well.

Agreed iff we don't enforce it.

> I think the eactions etc are adding a lot of value towards usability.
> Hasso Tepper was ealrier complaining about this same issue. 
> As an example, I think u32 and ematches would improve a great deal now
> and be more understandable. True, work/time still needs to be invested.

I'd guess that the basic classifier will make the race because the
documentation will be smaller due to the lack of parameters. ;->
But yes I agree, I think we're making small step forwards and hopefully
the network config shell/tool/whatever will ease the steps to configure
things. My primary goal is to allow using it without looking up
parameters all the time, given one is aware of the common terms and basic
concepts. I'll have some more time next week and will try to implement the
traffic control bits or at least some of them. The wind forecast is pretty
good for the next days so I won't have too much time. ;->

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 15:15     ` Thomas Graf
  2005-01-31 15:40       ` jamal
@ 2005-02-01  1:02       ` Andy Furniss
  2005-02-01 13:31         ` Thomas Graf
  1 sibling, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-02-01  1:02 UTC (permalink / raw)
  To: Thomas Graf
  Cc: jamal, netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Damion de Soto

Thomas Graf wrote:
>>Or dropping packets. TCP will adjust itself either way; at least
>>thats true according to this formula [rfc3448] (originally derived from
>>Reno, but people are finding it works fine with all other variants of
>>TCP CC):
>>
>>-----
>>The throughput equation is:
>>
>>                                   s
>>   X =  ----------------------------------------------------------
>>        R*sqrt(2*b*p/3) + (t_RTO * (3*sqrt(3*b*p/8) * p * (1+32*p^2)))
>>
>>
>>Where:
>>
>>      X is the transmit rate in bytes/second.
>>      s is the packet size in bytes.
>>      R is the round trip time in seconds.
>>      p is the loss event rate, between 0 and 1.0, of the number of loss
>>        events as a fraction of the number of packets transmitted.
>>      t_RTO is the TCP retransmission timeout value in seconds.
>>      b is the number of packets acknowledged by a single TCP
>>        acknowledgement.

WRT policers I never figured out where you would put the effects of 
playing with the burst size parameter and it's effects with few/many 
connections and any burstiness caused into an equasion like that.

>>----
> 
> 
> Agreed, this was my first attempt and my current code is still based on
> this. I'm trying to avoid a retransmit battle, therefore I try to
> delay packets if possible with the hope that it's either just a peak
> or the slow down is fast enough. I use a simplified RED and
> tcp_xmit_retransmit_queue() input to avoid flick flack effects which
> works pretty well for bulky streams. A burst buffer takes care
> of interactive traffic with peaks but this doesn't work perfectly fine
> yet. Overall, my attempt works pretty well if the other side uses
> reno/bic and quite well for westwood and vegas. The problem is not that
> it doesn't work at all but achieving a certain _stable_ rate is very
> difficult, the delta of the requested and real rate is up to 25% depending
> on the constancy of the rtt and wether they follow one of the proposed
> tcp cc algorithms. The cc guessing code helps a bit but isn't very
> accurate.
> 

This sounds cool. For me in someways I think it could be nicer (in the 
case of shaping from the wrong end of a slow link) to delay the real 
packets - that way the tcps of the clients get to see the smoothed 
version of the traffic and you can delay udp aswell.

How intelligent and how much, if any, per connection state do you/could 
you keep? I think being able to set a class that behaves as full before 
it is, removing the s from sfq, de piggybacking acks and singling out 
and handling slowstart connections specially could really help the world 
of shaping from the wrong end of slow links.

There's always playing with rwin, but maybe that's abit OTT :-)

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-30 22:12 dummy as IMQ replacement Jamal Hadi Salim
                   ` (3 preceding siblings ...)
  2005-01-31 22:39 ` Andy Furniss
@ 2005-02-01 11:32 ` Andy Furniss
       [not found] ` <0fcf01c5077f$579e4b80$6e69690a@RIMAS>
  5 siblings, 0 replies; 126+ messages in thread
From: Andy Furniss @ 2005-02-01 11:32 UTC (permalink / raw)
  To: hadi; +Cc: netdev

I sent two replies to this thread last night, which haven't shown up yet 
- did anyone get them?

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 22:39 ` Andy Furniss
@ 2005-02-01 11:49   ` jamal
  2005-02-01 14:53     ` Andy Furniss
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-02-01 11:49 UTC (permalink / raw)
  To: Andy Furniss
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml, Damion de Soto

On Mon, 2005-01-31 at 17:39, Andy Furniss wrote:
> Jamal Hadi Salim wrote:

> > 2) Allows for queueing incoming traffic for shaping instead of
> > dropping. I am not aware of any study that shows policing is 
> > worse than shaping in achieving the end goal of rate control.
> 
> I would say the end goal is shaping not just rate control. Shaping 
> meaning different things to different people and ingress shaping being 
> different from egress.

I know for a while the Bart howto was mislabeling the meaning of
policing - not sure about shaping. 
Shaping has a precise definition that involves a queue and a
non-working-conserving scheduler in order to rate control. This doesnt
matter where it happens (egress or ingress). Policing on the other hand
is work conserving etc.

> For me it's from the wrong end of a relativly narrow (512kbit) 
> bottleneck link that has a 600ms fifo at the other end. My aim to 
> sacrifice as little bandwidth as possible while not adding latency 
> bursts for gaming and per user bandwidth allocation (with sharing of 
> unused) and sfq within that for bulk tcp traffic.
>
> If I was shaping LAN traffic, then policers/drops would be OK for me - 
> but for a slow link I think queueing and dropping are better/give more 
> control eg. I get to use sfq which should not drop the one packet a 56k 
> user has managed to send me in the face of lots of incoming from low 
> latency high bandwidth servers.
>
> Even if it's possible I bet few can easily get policers to setup the 
> complex sharing/prioritisations that you can with HTB or HFSC.

sfq has a built in classifier that can efficiently separate those
flows so you can achieve semi-fairness; so its not the shaping perse
that helps, rather you ended up using a clever scheme that can isolate
flows and benefited from shaping as a result; the hashing function
should prove weak when a lot of flows start showing up.
You could write some interesting classifier (as an example steal the one
that sfq has) and achieve the same end results with policing. This would
be easier now with ematches .. 

> > But i wont go back to putting netfilter hooks in the device to satisfy
> > this.  I also dont think its worth it hacking dummy some more to be 
> > aware of say L3 info and play ip rule tricks to achieve this.
> > --> Instead the plan is to have a contrack related action. This action
> > will selectively either query/create contrack state on incoming packets.
> 
> I don't understand exactly what you mean here - for my setup to work I 
> need to see denatted addresses and mark (connbytes - it helps me be 
> extra nasty to multiple simoultaneous connections in slowstart and 
> prioritise browsing over bulk) in prerouting mangle. Of course if I can 
> use netfilter to classify and save into contrack then I could do 
> evrything in netfilter and then use something like connmark to save it 
> per connection.
> 

You may be refering to requirement #3 then? 
In other words what you are doing is best served by knowing the state?
Are pre/post routing sufficient as netfilter hooks for your case?

> > Packets could then be redirected to dummy based on what happens -> eg 
> > on incoming packets; if we find they are of known state we could send to
> > a different queue than one which didnt have existing state. This
> > all however is dependent on whatever rules the admin enters.
> 
> 
> How does the admin enter the rules - netfilter or other?
> 

Just like i showed in that post (It was long - so dont wanna cutnpaste
here).

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-01-31 22:53                 ` Thomas Graf
@ 2005-02-01 12:02                   ` jamal
  2005-02-01 12:51                     ` Thomas Graf
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-02-01 12:02 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

On Mon, 2005-01-31 at 17:53, Thomas Graf wrote:

> I was thinking of the parameters to define what a flow consists of.
> Extended SFQ basically allows you to define the hash function. I think
> I misunderstood you before and you don't want allow adjustable
> states on only a subset of the attributes, e.g. only L3 data.

Why bother putting extra classifier functionality into a qdisc? 
you should be able to rip off the classifier from sfq so you dont depend
on it; you can then select one of n queues (eaction meta set class 1:X
based on result of sfq classifier - or you can have it set the classids
based on resulting hash index) 

> > I think the eactions etc are adding a lot of value towards usability.
> > Hasso Tepper was ealrier complaining about this same issue. 
> > As an example, I think u32 and ematches would improve a great deal now
> > and be more understandable. True, work/time still needs to be invested.
> 
> I'd guess that the basic classifier will make the race because the
> documentation will be smaller due to the lack of parameters. ;->

Well, even if it is just being able to describe in english the u32
parameters and displaying them in english (by using a ID stored)
its already huge progress.

> But yes I agree, I think we're making small step forwards and hopefully
> the network config shell/tool/whatever will ease the steps to configure
> things. My primary goal is to allow using it without looking up
> parameters all the time, given one is aware of the common terms and basic
> concepts. 

Online easy to use help is always valuable.

> I'll have some more time next week and will try to implement the
> traffic control bits or at least some of them. The wind forecast is pretty
> good for the next days so I won't have too much time. ;->

Weather is also predicted to be good here for the week; we are planning to 
get out of our igloos and go tobagoning;->

cheers,
jamal
 

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-02-01 12:02                   ` jamal
@ 2005-02-01 12:51                     ` Thomas Graf
  2005-02-01 13:13                       ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Thomas Graf @ 2005-02-01 12:51 UTC (permalink / raw)
  To: jamal
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

> Why bother putting extra classifier functionality into a qdisc? 
> you should be able to rip off the classifier from sfq so you dont depend
> on it; you can then select one of n queues (eaction meta set class 1:X
> based on result of sfq classifier - or you can have it set the classids
> based on resulting hash index) 

Excellent idea, this would allow for various hash functions to be used
in a single sfq. We can use skb->tc_index for it so we can easly combine
it with a underlying dsmark. The hardest part is to find a intuitive
form to define the hash, it should be possible to for example define
a hash based on daddr + hproto only completely ignoring saddr. The
perutrbation must be made optional, sometimes the hash will not produce
any unwanted collisions (hash based on dscp for example) so modifying it
wouldn't make sense. We can fork sfq and make a gsfq which takes the
hash from tc_index and disabled perturbation if it is set to 0.
Thoughts?

> Well, even if it is just being able to describe in english the u32
> parameters and displaying them in english (by using a ID stored)
> its already huge progress.

True. I've some notes on paper describing a match definition db which
basically defines a u32 like match and assigns a name and id to it, it
is stored in a external database file so everyone can define their
own pre defined matches without recompiling.

I've put together some code printing a tc tree as a whole and added
it to netconfig. It's just a start and still contains redundant
information which can be removed but I think it's already a step
forward because it all gets down to one command. Currently it's only
possible to filter on the device but I'll extend this later so one
can extract a part of the tree. One pretty ugly thing is that cbq
creates a qdisc and class with the same handle which gets quite
confusing if one wants list the filters attached to a certain handle
because they will show up for both, the qdisc and the root class.

Full output for a default ethernet device:
lsx# tc tree full where device eth0 
eth0 ether 00:02:44:63:ed:53 mtu 1500 <BROADCAST,MULTICAST,UP>
    txqlen 1000 weight 64 qdisc pfifo_fast irq 19 index 4 brd ff:ff:ff:ff:ff:ff
  pfifo_fast qdisc dev eth0 handle none parent none bands 3
      refcnt 1 priomap [1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1]
            besteffort => 1                0x8 => 1
                filler => 2                0x9 => 1
                  bulk => 2                0xa => 1
                   0x3 => 2                0xb => 1
      interactive_bulk => 1                0xc => 1
                   0x5 => 2                0xd => 1
           interactive => 0                0xe => 1
               control => 0                0xf => 1

Brief output for some cbq classes:
lsx# tc tree where device eth0 
eth0 ether 00:02:44:63:ed:53 mtu 1500 <BROADCAST,MULTICAST,UP>
  cbq qdisc dev eth0 handle 10: parent none rate 11.92MiB/s (95Mbit) prio 8
    cbq class dev eth0 handle 10: parent root rate 11.92MiB/s (95Mbit) prio 8
      u32 cls dev eth0 handle none parent 10: prio 10 protocol ip
      u32 cls dev eth0 handle 8000: parent 10: prio 10 protocol ip divisor 1
      u32 cls dev eth0 handle 8000:800 parent 10: prio 10 protocol ip target 10:12
      cbq class dev eth0 handle 10:12 parent 10: rate 11.92MiB/s (95Mbit) prio 3
        sfq qdisc dev eth0 handle 8003: parent 10:12 quantum 1514 perturb 0us

Full output with stats for some cbq classes:
lsx# tc tree stats where device eth0 
eth0 ether 00:02:44:63:ed:53 mtu 1500 <BROADCAST,MULTICAST,UP>
    txqlen 1000 weight 64 qdisc cbq irq 19 index 4 brd ff:ff:ff:ff:ff:ff
    Stats:    bytes    packets     errors    dropped   fifo-err compressed
    RX    46.65 MiB      42211          0          0          0          0
    TX     1.91 MiB      16234          0          0          0          0
    Errors:  length       over        crc      frame     missed  multicast
    RX            0          0          0          0          0          0
    Errors: aborted    carrier  heartbeat     window  collision
    TX            0          0          0          0          0
  cbq qdisc dev eth0 handle 10: parent none rate 11.92MiB/s (95Mbit) prio 8
      refcnt 1 avgpkt 1400 mpu 64 cell 16 allot 1514 weight 95Mbit
      minidle 65535999us maxidle 2us offtime 0us level 1 ewma_log 5
      penalty 0us strategy classic split none defmap 0x00000000 police ok
      Stats:    bytes    packets      drops overlimits       qlen    backlog
            44.02 KiB        380          0          0          0          0
               0.00 B/s        0 pps
              borrows    overact    avgidle  undertime
                    0          0        114          0
    cbq class dev eth0 handle 10: parent root rate 11.92MiB/s (95Mbit) prio 8
        avgpkt 1400 mpu 64 cell 16 allot 1514 weight 95Mbit
        minidle 65535999us maxidle 2us offtime 0us level 1 ewma_log 5
        penalty 0us strategy classic split none defmap 0x00000000 police ok
        Stats:    bytes    packets      drops overlimits       qlen    backlog
              44.02 KiB        380          0          0          0          0
                 0.00 B/s        0 pps
                borrows    overact    avgidle  undertime
                      0          0        114          0
      u32 cls dev eth0 handle none parent 10: prio 10 protocol ip
          Stats:    bytes    packets      drops overlimits       qlen    backlog
                   0.00 B          0          0          0          0          0
                   0.00 B/s        0 pps
      u32 cls dev eth0 handle 8000: parent 10: prio 10 protocol ip divisor 1
          Stats:    bytes    packets      drops overlimits       qlen    backlog
                   0.00 B          0          0          0          0          0
                   0.00 B/s        0 pps
      u32 cls dev eth0 handle 8000:800 parent 10: prio 10 protocol ip target 10:12
          nkeys 1 ht key 0x800 hash 0x0 <TERMINAL>
              match u32 at 8 & 0x00ff0000 == 0x00010000 successful 34
          Stats:    bytes    packets      drops overlimits       qlen    backlog
                   0.00 B          0          0          0          0          0
                   0.00 B/s        0 pps
               successful       hits
                       34        379

      cbq class dev eth0 handle 10:12 parent 10: rate 11.92MiB/s (95Mbit) prio 3
          child-qdisc 8003: avgpkt 500 mpu 0 cell 8 allot 1514 weight 95Mbit
          minidle 65535999us maxidle 0us offtime 0us level 0 ewma_log 5
          penalty 0us strategy classic split none defmap 0x00000000 police ok
          Stats:    bytes    packets      drops overlimits       qlen    backlog
                 3.25 KiB         34          0          0          0          0
                   0.00 B/s        0 pps
                  borrows    overact    avgidle  undertime
                        0          0         40          0
        sfq qdisc dev eth0 handle 8003: parent 10:12 quantum 1514 perturb 0us
            refcnt 1 limit 128 divisor 1024 flows 128
            Stats:    bytes    packets      drops overlimits       qlen    backlog
                   3.25 KiB         34          0          0          0          0
                     0.00 B/s        0 pps

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-02-01 12:51                     ` Thomas Graf
@ 2005-02-01 13:13                       ` jamal
  2005-02-01 22:44                         ` Thomas Graf
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-02-01 13:13 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

On Tue, 2005-02-01 at 07:51, Thomas Graf wrote:
> > Why bother putting extra classifier functionality into a qdisc? 
> > you should be able to rip off the classifier from sfq so you dont depend
> > on it; you can then select one of n queues (eaction meta set class 1:X
> > based on result of sfq classifier - or you can have it set the classids
> > based on resulting hash index) 
> 
> Excellent idea, this would allow for various hash functions to be used
> in a single sfq. We can use skb->tc_index for it so we can easly combine

Let the meta action do that. Just set the skb->tc_classid in my opinion;
recall we can change classid now ..

> it with a underlying dsmark. The hardest part is to find a intuitive
> form to define the hash, it should be possible to for example define
> a hash based on daddr + hproto only completely ignoring saddr. The
> perutrbation must be made optional, sometimes the hash will not produce
> any unwanted collisions (hash based on dscp for example) so modifying it
> wouldn't make sense. We can fork sfq and make a gsfq which takes the
> hash from tc_index and disabled perturbation if it is set to 0.
> Thoughts?

You can let the user define that via tc but have a default;
eg: 
tc dev eth0 add sfq ematch
tc dev eth0 set sfq pertub xxx

match u32 ...
ematch sfq
ematch meta classid 1:2 
eaction meta set tcindex 101
eaction meta set fwmark ..

etc

I have to run, havent looked at the rest of your email - will later.

cheers,
jamal 

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-02-01  1:02       ` Andy Furniss
@ 2005-02-01 13:31         ` Thomas Graf
  2005-02-01 15:03           ` Andy Furniss
  0 siblings, 1 reply; 126+ messages in thread
From: Thomas Graf @ 2005-02-01 13:31 UTC (permalink / raw)
  To: Andy Furniss
  Cc: jamal, netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Damion de Soto

> >>  X =  ----------------------------------------------------------
> >>       R*sqrt(2*b*p/3) + (t_RTO * (3*sqrt(3*b*p/8) * p * (1+32*p^2)))
> >>
> >>Where:
> >>
> >>     X is the transmit rate in bytes/second.
> >>     s is the packet size in bytes.
> >>     R is the round trip time in seconds.
> >>     p is the loss event rate, between 0 and 1.0, of the number of loss
> >>       events as a fraction of the number of packets transmitted.
> >>     t_RTO is the TCP retransmission timeout value in seconds.
> >>     b is the number of packets acknowledged by a single TCP
> >>       acknowledgement.
> 
> WRT policers I never figured out where you would put the effects of 
> playing with the burst size parameter and it's effects with few/many 
> connections and any burstiness caused into an equasion like that.

A burst buffer has impact on R on later packets, it can "smooth" R
and X and thus results in more stable rates. Depending on the actual
burst, it can avoid retransmits which stabilizes the rate as well.

> This sounds cool. For me in someways I think it could be nicer (in the 
> case of shaping from the wrong end of a slow link) to delay the real 
> packets - that way the tcps of the clients get to see the smoothed 
> version of the traffic and you can delay udp aswell.

It's impossible to never drop anything, for udp we can either drop
it or use ECN and hope the other ip stack takes care of it or the
application implements its own cc algorithm. Basically you can already
do that with (G)RED. Most UDP users which provide a continous stream
such as video streams, implement some kind of key datagram which contains
the number of datagrams received since the last key datagram and the
application throttles down based on that so dropping is often the only
way to achieve a general working solution. Delaying UDP packets and
then drop them if the buffer is full is very dangerous, often the
protocols based on UDP rely on the assumption that datagrams get lost
randomly and not succcessive. We can think about precicse policing
for UDP again once the current poor application level cc algorithms
have failed and the industry accepted ECN as the right thing. For
now most of them still suffer from the NIH syndrom in this area.

> How intelligent and how much, if any, per connection state do you/could 
> you keep?

I keep a rate estimator for every flow on ingress in a hash table and
lookup it up on egress with the flow parameters reversed. It gets
pretty expensive on huge amounts of connection usually one doesn't
want to do per connection policing on such boxes. ;->

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-02-01 11:49   ` jamal
@ 2005-02-01 14:53     ` Andy Furniss
  2005-02-02 14:05       ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-02-01 14:53 UTC (permalink / raw)
  To: hadi
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> On Mon, 2005-01-31 at 17:39, Andy Furniss wrote:
> 
>>Jamal Hadi Salim wrote:
> 
> 
>>>2) Allows for queueing incoming traffic for shaping instead of
>>>dropping. I am not aware of any study that shows policing is 
>>>worse than shaping in achieving the end goal of rate control.
>>
>>I would say the end goal is shaping not just rate control. Shaping 
>>meaning different things to different people and ingress shaping being 
>>different from egress.
> 
> 
> I know for a while the Bart howto was mislabeling the meaning of
> policing - not sure about shaping. 
> Shaping has a precise definition that involves a queue and a
> non-working-conserving scheduler in order to rate control. This doesnt
> matter where it happens (egress or ingress). Policing on the other hand
> is work conserving etc.

Ok, but shaping to LARTC posters means being able to classify traffic 
and set up sharing/priorotising of classes. This is the reason most need 
to be able to queue - they want to use htb/hfsc for complicated setups 
and until now were not aware that it was even possible to replicate this 
in policers and if it becomes feasable it will probably appear daunting 
to do compared with HTB and all the existing docs/scripts.

For me, I think queuing and dropping is better than just dropping, you 
can affect tcp by delaying eg. 1 ack per packet rather than delayed acks 
and clocking out the packets helps smooth burstiness, which hurts 
latency if you are doing traffic control from the wrong end of the 
bottleneck.

> 
>>For me it's from the wrong end of a relativly narrow (512kbit) 
>>bottleneck link that has a 600ms fifo at the other end. My aim to 
>>sacrifice as little bandwidth as possible while not adding latency 
>>bursts for gaming and per user bandwidth allocation (with sharing of 
>>unused) and sfq within that for bulk tcp traffic.
>>
>>If I was shaping LAN traffic, then policers/drops would be OK for me - 
>>but for a slow link I think queueing and dropping are better/give more 
>>control eg. I get to use sfq which should not drop the one packet a 56k 
>>user has managed to send me in the face of lots of incoming from low 
>>latency high bandwidth servers.
>>
>>Even if it's possible I bet few can easily get policers to setup the 
>>complex sharing/prioritisations that you can with HTB or HFSC.
> 
> 
> sfq has a built in classifier that can efficiently separate those
> flows so you can achieve semi-fairness; so its not the shaping perse
> that helps, rather you ended up using a clever scheme that can isolate
> flows and benefited from shaping as a result; the hashing function
> should prove weak when a lot of flows start showing up.
> You could write some interesting classifier (as an example steal the one
> that sfq has) and achieve the same end results with policing. This would
> be easier now with ematches .. 

The idea of loosing the s from sfq and doing multilevel hash/mapping is 
attractive - of course I would want to queue each flow and have the 
queue be variable length for each flow depending on occupancy of other 
flows. I suppose a per flow intelligent dropping scheme would be even 
better. It would be nice to be able to set/control queuelength for link 
bandwidth, nothing classful in linux tc does this.


> 
> 
>>>But i wont go back to putting netfilter hooks in the device to satisfy
>>>this.  I also dont think its worth it hacking dummy some more to be 
>>>aware of say L3 info and play ip rule tricks to achieve this.
>>>--> Instead the plan is to have a contrack related action. This action
>>>will selectively either query/create contrack state on incoming packets.
>>
>>I don't understand exactly what you mean here - for my setup to work I 
>>need to see denatted addresses and mark (connbytes - it helps me be 
>>extra nasty to multiple simoultaneous connections in slowstart and 
>>prioritise browsing over bulk) in prerouting mangle. Of course if I can 
>>use netfilter to classify and save into contrack then I could do 
>>evrything in netfilter and then use something like connmark to save it 
>>per connection.
>>
> 
> 
> You may be refering to requirement #3 then? 
> In other words what you are doing is best served by knowing the state?

As long as I could use netfilter to mark/classify connections then I 
think I can sort my setup, don't know about others.


> Are pre/post routing sufficient as netfilter hooks for your case?

Yes but depends on where in pre/postrouting. For me after/before nat, as 
I say above though I could workaround if I classify connections with 
netfilter. For others as long as they can filter on a mark/classify set 
in forward, then I think it will be OK for them.

> 
> 
>>>Packets could then be redirected to dummy based on what happens -> eg 
>>>on incoming packets; if we find they are of known state we could send to
>>>a different queue than one which didnt have existing state. This
>>>all however is dependent on whatever rules the admin enters.
>>
>>
>>How does the admin enter the rules - netfilter or other?
>>
>  
> Just like i showed in that post (It was long - so dont wanna cutnpaste
> here).
> 

I am not sure what exactly can can't be done in your example:



 ># redirect all IP packets arriving in eth0 to dummy0
 ># use mark 1 --> puts them onto class 1:1
 >$TC filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
 >match u32 0 0 flowid 1:1 \

What I can do here depends where it hooks packets.

 >action ipt -j MARK --set-mark 1 \

I don't know what table I am using - may be OK as long as I can test for 
a mark I made earlier in the egress dummy case or test connmark/state I 
set for that connection in the ingress case.

 >action mirred egress redirect dev dummy0

Andy.


> cheers,
> jamal
> 
> 

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-02-01 13:31         ` Thomas Graf
@ 2005-02-01 15:03           ` Andy Furniss
  2005-02-02 13:28             ` Thomas Graf
  0 siblings, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-02-01 15:03 UTC (permalink / raw)
  To: Thomas Graf
  Cc: jamal, netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Damion de Soto

Thomas Graf wrote:
>>>> X =  ----------------------------------------------------------
>>>>      R*sqrt(2*b*p/3) + (t_RTO * (3*sqrt(3*b*p/8) * p * (1+32*p^2)))
>>>>
>>>>Where:
>>>>
>>>>    X is the transmit rate in bytes/second.
>>>>    s is the packet size in bytes.
>>>>    R is the round trip time in seconds.
>>>>    p is the loss event rate, between 0 and 1.0, of the number of loss
>>>>      events as a fraction of the number of packets transmitted.
>>>>    t_RTO is the TCP retransmission timeout value in seconds.
>>>>    b is the number of packets acknowledged by a single TCP
>>>>      acknowledgement.
>>
>>WRT policers I never figured out where you would put the effects of 
>>playing with the burst size parameter and it's effects with few/many 
>>connections and any burstiness caused into an equasion like that.
> 
> 
> A burst buffer has impact on R on later packets, it can "smooth" R
> and X and thus results in more stable rates. Depending on the actual
> burst, it can avoid retransmits which stabilizes the rate as well.

But it's not a real rate limiting buffer in the policer case is it?

> 
> 
>>This sounds cool. For me in someways I think it could be nicer (in the 
>>case of shaping from the wrong end of a slow link) to delay the real 
>>packets - that way the tcps of the clients get to see the smoothed 
>>version of the traffic and you can delay udp aswell.
> 
> 
> It's impossible to never drop anything, for udp we can either drop
> it or use ECN and hope the other ip stack takes care of it or the
> application implements its own cc algorithm. Basically you can already
> do that with (G)RED. Most UDP users which provide a continous stream
> such as video streams, implement some kind of key datagram which contains
> the number of datagrams received since the last key datagram and the
> application throttles down based on that so dropping is often the only
> way to achieve a general working solution. Delaying UDP packets and
> then drop them if the buffer is full is very dangerous, often the
> protocols based on UDP rely on the assumption that datagrams get lost
> randomly and not succcessive. We can think about precicse policing
> for UDP again once the current poor application level cc algorithms
> have failed and the industry accepted ECN as the right thing. For
> now most of them still suffer from the NIH syndrom in this area.

Interesting stuff. I was thinking of game udp where just dropping would 
simulate what the user should have done anyway, but costing you 
bandwidth. If alot of gamers share a slow link then if you lag them out 
they know it's time to turn the rate down.

> 
> 
>>How intelligent and how much, if any, per connection state do you/could 
>>you keep?
> 
> 
> I keep a rate estimator for every flow on ingress in a hash table and
> lookup it up on egress with the flow parameters reversed. It gets
> pretty expensive on huge amounts of connection usually one doesn't
> want to do per connection policing on such boxes. ;->
> 

Nice - are you planning to add anything to tweak things for the wrong 
end of the bottleneck problems?

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-02-01 13:13                       ` jamal
@ 2005-02-01 22:44                         ` Thomas Graf
  2005-02-02 14:24                           ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Thomas Graf @ 2005-02-01 22:44 UTC (permalink / raw)
  To: jamal
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

> Let the meta action do that. Just set the skb->tc_classid in my opinion;
> recall we can change classid now ..

True, I don't really care but it's already quite confusing. The priority
of the packet is described in viarous field depeding on which qdisc/cls
being used, we have skb->priority with sch_prio, tc_index for dsmark
and cls_tcindex and now tc_classid directly.  Some even use u32 to
match on DSCP and set a nfmark.  I can already feel the perfect confusion
once we open up access for rt_classid, realm and other routing fields.
I'm always aiming for easy to understand solutions, this doesn't mean
it to be simple. Why is nfmark so heavly used? Because it's damn simple
to undertand, you can set and read it and that's it. The only thing one
has to care about is to make sure that is actually gets set before it being
read and to make sure all userspace apps read it in the same base ;->
(This is basically one of the issue in usability, the lack of clearliness
in what base number are read the displayed. Often they get displayed in
hex without a 0x prefix but are read with strtol(...,0) resulting in
a decimal reading if no prefix is specified)

Long rant short statement, I'm pleading for a generic way to transfer
such things between a classifier and a qdisc because it's simply
easier to explain and use.

... eaction meta set tc_index ip_saddr_proto_hash
... qdisc sfq tcindex-hash

where ip_saddr_proto_hash is not a variable but rather a special meta
value calulated in the kernel.

> You can let the user define that via tc but have a default;
> eg: 
> tc dev eth0 add sfq ematch
> tc dev eth0 set sfq pertub xxx
> 
> match u32 ...
> ematch sfq
> ematch meta classid 1:2 
> eaction meta set tcindex 101
> eaction meta set fwmark ..

I think we're on the same road or at least going into the same direction.
However I'm not sure whether it's a good to have ematches return
some values besides true/false. I'd rather like to see an eaction store
it in the skb and the sfq catching it up again. Of course we can have the
userspace part be configured within the sfq.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-02-01 15:03           ` Andy Furniss
@ 2005-02-02 13:28             ` Thomas Graf
  0 siblings, 0 replies; 126+ messages in thread
From: Thomas Graf @ 2005-02-02 13:28 UTC (permalink / raw)
  To: Andy Furniss
  Cc: jamal, netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Damion de Soto

> >>WRT policers I never figured out where you would put the effects of 
> >>playing with the burst size parameter and it's effects with few/many 
> >>connections and any burstiness caused into an equasion like that.
> >
> >
> >A burst buffer has impact on R on later packets, it can "smooth" R
> >and X and thus results in more stable rates. Depending on the actual
> >burst, it can avoid retransmits which stabilizes the rate as well.
> 
> But it's not a real rate limiting buffer in the policer case is it?

Abstractly speaking, burst specifies the maximum amount of time allowed
for a single packet to sit in the burst buffer. Although the burst is
configured as the size of the buffer it is transformed into a time
delta before providing it to the kernel. Because the policer doesn't
enqueue things the packet simply gets dropped if it would exceed that
time. It's not _exactly_ like this but it gives you an idea what
happens, net/sched/police.c isn't that big so one coffee should do it.

> Nice - are you planning to add anything to tweak things for the wrong 
> end of the bottleneck problems?

I hope so, once I figured out an acceptable compromise between a good
result and simplicity. Currently it would be way to expensive and hard
to use.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-02-01 14:53     ` Andy Furniss
@ 2005-02-02 14:05       ` jamal
  2005-02-04  0:33         ` Andy Furniss
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-02-02 14:05 UTC (permalink / raw)
  To: Andy Furniss
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml, Damion de Soto

On Tue, 2005-02-01 at 09:53, Andy Furniss wrote:
> jamal wrote: 
> > I know for a while the Bart howto was mislabeling the meaning of
> > policing - not sure about shaping. 
> > Shaping has a precise definition that involves a queue and a
> > non-working-conserving scheduler in order to rate control. This doesnt
> > matter where it happens (egress or ingress). Policing on the other hand
> > is work conserving etc.
> 
> Ok, but shaping to LARTC posters means being able to classify traffic 
> and set up sharing/priorotising of classes. 
>
> This is the reason most need 
> to be able to queue - they want to use htb/hfsc for complicated setups 

Close - but you are missing the rate control requirement. 
You can do the above with prio qdisc for example but that does not
equate to shaping. Understood about Lartc users definitions. However,
note that they are influenced by what people tell them or what people
write in docs. Help in making sure the redefinition doesnt propagate.
Theres a very precise meaning to shaping and it is _exactly_ the way i
described it above. Clue people who ask questions.

> and until now were not aware that it was even possible to replicate this 
> in policers 

I am sure i have written about 50 emails on this topic over the last 5
years;->  look at the archives. I even joked about it here:
http://www.cyberus.ca/~hadi/patches/action/README over 2 years ago.
look at the text reading "it must be the summer heat again; weve had
someone doing that every year around summer"
Unfortunately people who are writting docs havent picked it up for
whatever reasons. I am hoping we finaly get this documented somewhere.
Can you help fix this?

> and if it becomes feasable it will probably appear daunting 
> to do compared with HTB and all the existing docs/scripts.
> 

>From a usability perspective i agree with you. 
Its still doable is all i can say ;-> (but you are correct in that it
may not be for the weak hearts)
As a reminder of some of the big discussions on shared and hierachical
policing - look at the many many discussions I had with devik on this
topic a few years back.
It resulted in the birth of HTB (which is essentially a hierachy of the
same token bucket meters used in policers; hierachical shared policers
are doable - look at iproute2/examples/diffserv). HTB otoh has proven
useful due to simplicty - so it stands on its own merit now.
I think there may be peculiar occasions where you may need to have
queues to shape traffic to a local app - but thats peculiar. 

> For me, I think queuing and dropping is better than just dropping, you 
> can affect tcp by delaying eg. 1 ack per packet rather than delayed acks 
> and clocking out the packets helps smooth burstiness, 

True - but i question the usefulness of localy terminating TCP packets
being shaped. For packets being forwarded, the shaping happens on
egress.

> which hurts 
> latency if you are doing traffic control from the wrong end of the 
> bottleneck.
> 

Not sure i followed the latency connection.

> As long as I could use netfilter to mark/classify connections then I 
> think I can sort my setup, don't know about others.
> 
> 

Great. yes, you can.

> > Are pre/post routing sufficient as netfilter hooks for your case?
> 
> Yes but depends on where in pre/postrouting. For me after/before nat, as 
> I say above though I could workaround if I classify connections with 
> netfilter. For others as long as they can filter on a mark/classify set 
> in forward, then I think it will be OK for them.
> 

You can mark in netfilter or even in tc and use those marks in both
places.


> I am not sure what exactly can can't be done in your example:
>
>  ># redirect all IP packets arriving in eth0 to dummy0
>  ># use mark 1 --> puts them onto class 1:1
>  >$TC filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
>  >match u32 0 0 flowid 1:1 \
> 
> What I can do here depends where it hooks packets.
> 
>  >action ipt -j MARK --set-mark 1 \
> 
> I don't know what table I am using - may be OK as long as I can test for 
> a mark I made earlier in the egress dummy case or test connmark/state I 
> set for that connection in the ingress case.
> 

That would be doable. Thanks for taking the time Andy.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-02-01 22:44                         ` Thomas Graf
@ 2005-02-02 14:24                           ` jamal
  2005-02-02 15:40                             ` Thomas Graf
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-02-02 14:24 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

On Tue, 2005-02-01 at 17:44, Thomas Graf wrote:
> > Let the meta action do that. Just set the skb->tc_classid in my opinion;
> > recall we can change classid now ..
> 
> True, I don't really care but it's already quite confusing. The priority
> of the packet is described in viarous field depeding on which qdisc/cls
> being used, we have skb->priority with sch_prio, tc_index for dsmark
> and cls_tcindex and now tc_classid directly.  Some even use u32 to
> match on DSCP and set a nfmark.  I can already feel the perfect confusion
> once we open up access for rt_classid, realm and other routing fields.
> I'm always aiming for easy to understand solutions, this doesn't mean
> it to be simple. Why is nfmark so heavly used? Because it's damn simple
> to undertand, you can set and read it and that's it. The only thing one
> has to care about is to make sure that is actually gets set before it being
> read and to make sure all userspace apps read it in the same base ;->
> (This is basically one of the issue in usability, the lack of clearliness
> in what base number are read the displayed. Often they get displayed in
> hex without a 0x prefix but are read with strtol(...,0) resulting in
> a decimal reading if no prefix is specified)

So let me put it this way:
You cant avoid passing around metadata between the different blocks.
Whether the metadata is set by the admin or by some other block along
the packet path is way of life. 
All of the metadata defined and attached around skbs so far has a
standalone semantical meaning whic unfortunately cant just be hidden
from the user. Its the unfortunate consequence of giving someone a
weapon )they may shoot their toe off). 
As an example:
People have been setting flowid/classid for years via the classifiers
to stamp session a flow belongs to. All we are doing with
skb->tc_classid is giving more power to them. i.e before you get to the
queue given certain computation/state you may decide to belong to a
different session.
sfq as a matter of setting the hash is computing what flow you belong to
and thats why i suggested tc_classid (in this case not set by the admin,
rather by a smart stateful classifier).

> Long rant short statement, I'm pleading for a generic way to transfer
> such things between a classifier and a qdisc because it's simply
> easier to explain and use.
> ... eaction meta set tc_index ip_saddr_proto_hash
> ... qdisc sfq tcindex-hash

> where ip_saddr_proto_hash is not a variable but rather a special meta
> value calulated in the kernel.
> 

Let me see if i understood correctly: Instead of giving static values
(such as 0x10) you want to assign a variable(ip_saddr_proto_hash) which
is computed at runtime to tcindex? 
Thats a parallel issue though but indeed useful .

> > You can let the user define that via tc but have a default;
> > eg: 
> > tc dev eth0 add sfq ematch
> > tc dev eth0 set sfq pertub xxx
> > 
> > match u32 ...
> > ematch sfq
> > ematch meta classid 1:2 
> > eaction meta set tcindex 101
> > eaction meta set fwmark ..
> 
> I think we're on the same road or at least going into the same direction.
> However I'm not sure whether it's a good to have ematches return
> some values besides true/false. I'd rather like to see an eaction store
> it in the skb and the sfq catching it up again. Of course we can have the
> userspace part be configured within the sfq.

A classifier is allowed to select/set the class/flow/sessionID. 
The sfq hash result should at least set/map to the minor value of the
classid

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-02-02 14:24                           ` jamal
@ 2005-02-02 15:40                             ` Thomas Graf
  2005-02-02 15:55                               ` Thomas Graf
  0 siblings, 1 reply; 126+ messages in thread
From: Thomas Graf @ 2005-02-02 15:40 UTC (permalink / raw)
  To: jamal
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

First of all, sorry for the massive amount of typos in my last post.
I could barely see anything due to the sun shining onto my display.

> You cant avoid passing around metadata between the different blocks.
> Whether the metadata is set by the admin or by some other block along
> the packet path is way of life. 

Agreed.

> All of the metadata defined and attached around skbs so far has a
> standalone semantical meaning whic unfortunately cant just be hidden
> from the user. Its the unfortunate consequence of giving someone a
> weapon )they may shoot their toe off). 

Agreed, I'm just trying avoid further confusion. I think my country
has one of highest if not the higest density of fully automatic
assault weapons (because everyone liable to miltary service needs
to have one at home), everyone owning one is forced to practice once
a year and shooting is a common sport.  OTOH, we have one of the lowest
crime rates. Why's that?  Because almost everyone is well educated
in terms of weapon saftey, so I think this should be our way as well.
So yes, we can definitely add more complexity but we should try to make
it easy to understand and use.

> sfq as a matter of setting the hash is computing what flow you belong to
> and thats why i suggested tc_classid (in this case not set by the admin,
> rather by a smart stateful classifier).

If the value for tc_classid is directly set by the user then I agree.
What I want to avoid is having hidden uses of parameters which can also
be modified by the user. It results in a backward compatibility hell
later on because we can't just add another use for it without possibily
breaking scripts.

> Let me see if i understood correctly: Instead of giving static values
> (such as 0x10) you want to assign a variable(ip_saddr_proto_hash) which
> is computed at runtime to tcindex? 
> Thats a parallel issue though but indeed useful .

OK, so basically we weren't talking of exactly the same thing. In a
user setting only context your argumentation makes sense. Let me follow
on this thought a little further, what I basically want is a generic
way to influence various qdiscs, be it a hashing index for sfq, a
priority value for priority band based qdiscs, etc.

tc_classid isn't a bad choice but it gets complex once we want a
classful qdiscs to be able to use this input parameter.

Summarizing what we currently have:
  tc_index: May contain a dscp value if dsmark is told to fetch the dscp
            field, the minor part of priority if dsmark is told to map
            priority values via handle values, or the minor part of the
            classid in a classifier result via ingress classification or
            a classifier attached to a dsmark. cls_tcindex, gred, and
            meta ematch use it as input value.

  nf_mark:  cls_fw map to classids, meta ematch may read it, meta
            eaction may set it.

  tc_classid: Actions may set it to overwrite the result of a classifer,
              meta ematch may read it and I guess meta eaction may
              write to it.

  tc_verd: Set early in net stack, used to track location and tc
           relevant flags.

  tclassid: Set withing routing db, may be read via meta ematch.

At the moment all of them can be described properly and it should be
easy to understand if the relations are outlined properly.

Assuming we allow setting tc_classid to overwrite the sfq internal
hash we introduce a not so obvious double use because tc_classid is
assumed to at least partly point to a class. We can redefine tc_classid
as being a generic flow/session descriptor but then it should be moved
out of being used to overwrite the classid within actions. Assuming I
have a classifier which normally classifies into a child class but
sometimes I want the traffic to go into a leaf sfq qdisc by using the
action to overwrite the result. It will then be impossible to overwrite
the sfq hash because I would no longer be able to overwrite the
classifier result. It's probably possible to find some working solution
by having the minor part being the sfq input or vice versa but it
gets really nasty. Therefore I think we should make a difference between
the current use of tc_classid to overwrite the classifier result and
giving qdiscs some kind of input not directly related to their handle.

Thoughts?

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-02-02 15:40                             ` Thomas Graf
@ 2005-02-02 15:55                               ` Thomas Graf
  0 siblings, 0 replies; 126+ messages in thread
From: Thomas Graf @ 2005-02-02 15:55 UTC (permalink / raw)
  To: jamal
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

>   tc_index: May contain a dscp value if dsmark is told to fetch the dscp
>             field, the minor part of priority if dsmark is told to map
>             priority values via handle values, or the minor part of the
>             classid in a classifier result via ingress classification or
>             a classifier attached to a dsmark. cls_tcindex, gred, and
>             meta ematch use it as input value.

Assuming we use tc_index to provide the hash...

- we don't need to worry about any definitions. tc_index already stands for
  some kind of index grouping together various packets.
- we can directly use sfq to do fair queueing on dscp values and skb
  priority including specialized map with a underlying dsmark or cls_tcindex.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-02-02 14:05       ` jamal
@ 2005-02-04  0:33         ` Andy Furniss
  0 siblings, 0 replies; 126+ messages in thread
From: Andy Furniss @ 2005-02-04  0:33 UTC (permalink / raw)
  To: hadi
  Cc: netdev, Nguyen Dinh Nam, Remus, Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> On Tue, 2005-02-01 at 09:53, Andy Furniss wrote:
> 
>>jamal wrote: 
>>
>>>I know for a while the Bart howto was mislabeling the meaning of
>>>policing - not sure about shaping. 
>>>Shaping has a precise definition that involves a queue and a
>>>non-working-conserving scheduler in order to rate control. This doesnt
>>>matter where it happens (egress or ingress). Policing on the other hand
>>>is work conserving etc.
>>
>>Ok, but shaping to LARTC posters means being able to classify traffic 
>>and set up sharing/priorotising of classes. 
>>
>>This is the reason most need 
>>to be able to queue - they want to use htb/hfsc for complicated setups 
> 
> 
> Close - but you are missing the rate control requirement. 
> You can do the above with prio qdisc for example but that does not
> equate to shaping. Understood about Lartc users definitions. However,
> note that they are influenced by what people tell them or what people
> write in docs. Help in making sure the redefinition doesnt propagate.
> Theres a very precise meaning to shaping and it is _exactly_ the way i
> described it above. Clue people who ask questions.

I see your point.

> 
> 
>>and until now were not aware that it was even possible to replicate this 
>>in policers 
> 
> 
> I am sure i have written about 50 emails on this topic over the last 5
> years;->  look at the archives. I even joked about it here:
> http://www.cyberus.ca/~hadi/patches/action/README over 2 years ago.
> look at the text reading "it must be the summer heat again; weve had
> someone doing that every year around summer"
> Unfortunately people who are writting docs havent picked it up for
> whatever reasons. I am hoping we finaly get this documented somewhere.
> Can you help fix this?

I could write up some what I did type stuff. Once I work out what to do 
and how to do it :-)

> 
> 
>>and if it becomes feasable it will probably appear daunting 
>>to do compared with HTB and all the existing docs/scripts.
>>
> 
> 
>>From a usability perspective i agree with you. 
> Its still doable is all i can say ;-> (but you are correct in that it
> may not be for the weak hearts)
> As a reminder of some of the big discussions on shared and hierachical
> policing - look at the many many discussions I had with devik on this
> topic a few years back.
> It resulted in the birth of HTB (which is essentially a hierachy of the
> same token bucket meters used in policers; hierachical shared policers
> are doable - look at iproute2/examples/diffserv). HTB otoh has proven
> useful due to simplicty - so it stands on its own merit now.
> I think there may be peculiar occasions where you may need to have
> queues to shape traffic to a local app - but thats peculiar. 
> 

I'll have to read up abit.

> 
>>For me, I think queuing and dropping is better than just dropping, you 
>>can affect tcp by delaying eg. 1 ack per packet rather than delayed acks 
>>and clocking out the packets helps smooth burstiness, 
> 
> 
> True - but i question the usefulness of localy terminating TCP packets
> being shaped. For packets being forwarded, the shaping happens on
> egress.

I know it's a bit odd, but then if I had just one PC I would want to 
rather than police for reasons below.

> 
> 
>>which hurts 
>>latency if you are doing traffic control from the wrong end of the 
>>bottleneck.
>>
> 
> 
> Not sure i followed the latency connection.

I am shaping a relativly slow link from the wrong end. My objective is 
to avoid the 600ms buffer at ISP/Teleco getting filled as it adds 
latency for my interactive traffic. If I have a dozen bulk tcp 
connections running then policing encourages each to send data in bursts 
at link speed, because delayed acks will pair packets and say a group of 
four passes without dropping it causes another group of four from that 
connection at link speed. Add to that the different or variable rtts of 
the 12 connections it means that there will be times when large bunches 
of big packets arrive together and delay the interactive traffic.

If I shape and dequeue each connection round robin and the aggeragate 
rate is below link speed then the aggregate flow is smoothed better. If 
the rates are low enough I will delay longer than delayed ack timers and 
get one packet per ack aswell. It's still not perfect of course.

> 
> 
>>As long as I could use netfilter to mark/classify connections then I 
>>think I can sort my setup, don't know about others.
>>
>>
> 
> 
> Great. yes, you can.
> 
> 
>>>Are pre/post routing sufficient as netfilter hooks for your case?
>>
>>Yes but depends on where in pre/postrouting. For me after/before nat, as 
>>I say above though I could workaround if I classify connections with 
>>netfilter. For others as long as they can filter on a mark/classify set 
>>in forward, then I think it will be OK for them.
>>
> 
> 
> You can mark in netfilter or even in tc and use those marks in both
> places.

Great.

> 
> 
> 
>>I am not sure what exactly can can't be done in your example:
>>
>> ># redirect all IP packets arriving in eth0 to dummy0
>> ># use mark 1 --> puts them onto class 1:1
>> >$TC filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
>> >match u32 0 0 flowid 1:1 \
>>
>>What I can do here depends where it hooks packets.
>>
>> >action ipt -j MARK --set-mark 1 \
>>
>>I don't know what table I am using - may be OK as long as I can test for 
>>a mark I made earlier in the egress dummy case or test connmark/state I 
>>set for that connection in the ingress case.
>>
> 
> 
> That would be doable. Thanks for taking the time Andy.

Glad I can help.

Andy.


> 
> cheers,
> jamal
> 
> 

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
       [not found]   ` <1107174142.8021.121.camel@jzny.localdomain>
@ 2005-03-09 14:30     ` Remus
  2005-03-09 14:38       ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Remus @ 2005-03-09 14:30 UTC (permalink / raw)
  To: hadi
  Cc: netdev, Nguyen Dinh Nam, Andre Tomt, syrius.ml, Andy Furniss,
	Damion de Soto

Hi Jamal,


I have problem with iproute2:
1) if I use current iproute2 I get this error when I run this command:
tc filter add dev eth2 parent ffff: protocol ip prio 10 u32 match u32 0 0 
flowid 1:1 action ipt -j MARK --set-mark 1 action mirred egress redirect dev 
dummy0
/usr/local/lib/iptables/libipt_mark.so: undefined symbol: register_match
 failed to find target MARK

bad action parsing
parse_action: bad value (11:ipt)!
Illegal "action"


2) if I use iproute2-2.6.9-jamal  I get this error:
 tc filter add dev eth2 parent ffff: protocol ip prio 10 u32 match u32 0 0 
flowid 1:1 action ipt -j MARK --set-mark 1 action mirred egress redirect dev 
dummy0
bad action type ipt
Usage: ... gact <ACTION> [RAND] [INDEX]
Where: ACTION := reclassify | drop | continue | pass RAND := random 
<RANDTYPE> <ACTION> <VAL>RANDTYPE := netrand | determVAL : = value not 
exceeding 10000INDEX := index value used
bad action parsing
parse_action: bad value (11:ipt)!
Illegal "action"

Any ideas?

May I use some iptables command to forward all incomming traffic to dummy?

Regards


Remus

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-03-09 14:30     ` Remus
@ 2005-03-09 14:38       ` jamal
  2005-03-10  1:06         ` Jamal Hadi Salim
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-09 14:38 UTC (permalink / raw)
  To: Remus
  Cc: netdev, Nguyen Dinh Nam, Andre Tomt, syrius.ml, Andy Furniss,
	Damion de Soto

On Wed, 2005-03-09 at 09:30, Remus wrote:
> Hi Jamal,
> 
> 
> I have problem with iproute2:
> 1) if I use current iproute2 I get this error when I run this command:
> tc filter add dev eth2 parent ffff: protocol ip prio 10 u32 match u32 0 0 
> flowid 1:1 action ipt -j MARK --set-mark 1 action mirred egress redirect dev 
> dummy0
> /usr/local/lib/iptables/libipt_mark.so: undefined symbol: register_match
>  failed to find target MARK
> 

That should work. Dont bother with iproute2-2.6.9-jamal.
What version of iptables are you using? Unfortunately i have to keep
track of changes happening in iptables as well and they keep changing
the interface from under me. Try to use the same iptables version as the
one whose headers are found in the iproute2 version you are using.

I have to go to work - so wont have time to look at this for sometime.
Maybe some of the netfilter folks like Patrick can solve it for you
before i get back.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-03-09 14:38       ` jamal
@ 2005-03-10  1:06         ` Jamal Hadi Salim
  2005-03-10  9:18           ` Remus
  0 siblings, 1 reply; 126+ messages in thread
From: Jamal Hadi Salim @ 2005-03-10  1:06 UTC (permalink / raw)
  To: Remus
  Cc: netdev, Nguyen Dinh Nam, Andre Tomt, syrius.ml, Andy Furniss,
	Damion de Soto

[-- Attachment #1: Type: text/plain, Size: 428 bytes --]

Hi Remus,

Please try this patch on top of latest iproute2. Credit to Patrick for
spoting it. I dont know when or who made this change - in any case it
doesnt matter if it works for you.

cheers,

On Wed, 2005-03-09 at 09:38, jamal wrote:

> I have to go to work - so wont have time to look at this for sometime.
> Maybe some of the netfilter folks like Patrick can solve it for you
> before i get back.
> 
> cheers,
> jamal
> 

[-- Attachment #2: ipt-p --]
[-- Type: text/plain, Size: 372 bytes --]

--- iproute2-a/tc/m_ipt.c	2005/03/10 00:59:38	1.1
+++ iproute2-b/tc/m_ipt.c	2005/03/10 01:00:05
@@ -72,7 +72,6 @@
 static unsigned int global_option_offset = 0;
 #define OPTION_OFFSET 256
 
-#if 0
 /* no clue why register match is within targets
  figure out later. Talk to Harald -- JHS
 */
@@ -91,7 +90,6 @@
 	t_list = me;
 
 }
-#endif
 
 void
 exit_tryhelp(int status)

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-03-10  1:06         ` Jamal Hadi Salim
@ 2005-03-10  9:18           ` Remus
  2005-03-10 11:22             ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Remus @ 2005-03-10  9:18 UTC (permalink / raw)
  To: hadi
  Cc: jamal, netdev, Nguyen Dinh Nam, Andre Tomt, syrius.ml,
	Andy Furniss, Damion de Soto

Hi Jamal,

Thanks for the patch.
That error is gone but I got a new error:

ifconfig dummy0 up
tc qdisc add dev eth2 ingress
tc filter add dev eth2 parent ffff: protocol ip prio 10 u32 match u32 0 0 
flowid 1:1 action ipt -j MARK --set-mark 1 action mirred egress redirect dev 
dummy0
iptables: calloc failed: Cannot allocate memory

I use 2.6.11.2 kernel patched with your dummy patch, iptables 1.3.1 and the 
latest iproute2 patched with the pacth you sent yesterday.


Regards

Remus


----- Original Message ----- 
From: "Jamal Hadi Salim" <hadi@znyx.com>
To: "Remus" <rmocius@auste.elnet.lt>
Cc: <netdev@oss.sgi.com>; "Nguyen Dinh Nam" <nguyendinhnam@gmail.com>; 
"Andre Tomt" <andre@tomt.net>; <syrius.ml@no-log.org>; "Andy Furniss" 
<andy.furniss@dsl.pipex.com>; "Damion de Soto" <damion@snapgear.com>
Sent: Thursday, March 10, 2005 1:06 AM
Subject: Re: dummy as IMQ replacement


> Hi Remus,
>
> Please try this patch on top of latest iproute2. Credit to Patrick for
> spoting it. I dont know when or who made this change - in any case it
> doesnt matter if it works for you.
>
> cheers,
>
> On Wed, 2005-03-09 at 09:38, jamal wrote:
>
>> I have to go to work - so wont have time to look at this for sometime.
>> Maybe some of the netfilter folks like Patrick can solve it for you
>> before i get back.
>>
>> cheers,
>> jamal
>>
> 

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-03-10  9:18           ` Remus
@ 2005-03-10 11:22             ` jamal
  2005-03-19  1:09               ` Andy Furniss
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-10 11:22 UTC (permalink / raw)
  To: Remus
  Cc: netdev, Nguyen Dinh Nam, Andre Tomt, syrius.ml, Andy Furniss,
	Damion de Soto

Hi Remus,
I could not reproduce this one - it is also a bit odd for calloc to
fail. I dont have iptables 1.3.1 but i will get and retry.
Does this happen all the time?

cheers,
jamal

On Thu, 2005-03-10 at 04:18, Remus wrote:
> Hi Jamal,
> 
> Thanks for the patch.
> That error is gone but I got a new error:
> 
> ifconfig dummy0 up
> tc qdisc add dev eth2 ingress
> tc filter add dev eth2 parent ffff: protocol ip prio 10 u32 match u32 0 0 
> flowid 1:1 action ipt -j MARK --set-mark 1 action mirred egress redirect dev 
> dummy0
> iptables: calloc failed: Cannot allocate memory
> 
> I use 2.6.11.2 kernel patched with your dummy patch, iptables 1.3.1 and the 
> latest iproute2 patched with the pacth you sent yesterday.
> 
> 
> Regards
> 
> Remus
> 
> 
> ----- Original Message ----- 
> From: "Jamal Hadi Salim" <hadi@znyx.com>
> To: "Remus" <rmocius@auste.elnet.lt>
> Cc: <netdev@oss.sgi.com>; "Nguyen Dinh Nam" <nguyendinhnam@gmail.com>; 
> "Andre Tomt" <andre@tomt.net>; <syrius.ml@no-log.org>; "Andy Furniss" 
> <andy.furniss@dsl.pipex.com>; "Damion de Soto" <damion@snapgear.com>
> Sent: Thursday, March 10, 2005 1:06 AM
> Subject: Re: dummy as IMQ replacement
> 
> 
> > Hi Remus,
> >
> > Please try this patch on top of latest iproute2. Credit to Patrick for
> > spoting it. I dont know when or who made this change - in any case it
> > doesnt matter if it works for you.
> >
> > cheers,
> >
> > On Wed, 2005-03-09 at 09:38, jamal wrote:
> >
> >> I have to go to work - so wont have time to look at this for sometime.
> >> Maybe some of the netfilter folks like Patrick can solve it for you
> >> before i get back.
> >>
> >> cheers,
> >> jamal
> >>
> > 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-03-10 11:22             ` jamal
@ 2005-03-19  1:09               ` Andy Furniss
  2005-03-19  1:45                 ` jamal
  2005-03-21 13:14                 ` iptables breakage WAS(Re: " jamal
  0 siblings, 2 replies; 126+ messages in thread
From: Andy Furniss @ 2005-03-19  1:09 UTC (permalink / raw)
  To: hadi
  Cc: Remus, netdev, Nguyen Dinh Nam, Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> Hi Remus,
> I could not reproduce this one - it is also a bit odd for calloc to
> fail. I dont have iptables 1.3.1 but i will get and retry.
> Does this happen all the time?

I get the same with iptables 1.3.1 and 1.3.0

iptables: calloc failed: Cannot allocate memory

using kernel 2.6.11.3 and tc iproute2-ss050314

If I try an earlier iptables (tested 9, 10, 11) I get

tablename: mangle hook: NF_IP_PRE_ROUTING
         target: MARK set 0x1  index 0
bad action type mirred
Usage: ... gact <ACTION> [RAND] [INDEX]
Where: ACTION := reclassify | drop | continue | pass RAND := random 
<RANDTYPE> <ACTION> <VAL>RANDTYPE := netrand | determVAL : = value not 
exceeding 10000INDEX := index value used
bad action parsing
parse_action: bad value (5:mirred)!
Illegal "action"

Andy.




> cheers,
> jamal
> 
> On Thu, 2005-03-10 at 04:18, Remus wrote:
> 
>>Hi Jamal,
>>
>>Thanks for the patch.
>>That error is gone but I got a new error:
>>
>>ifconfig dummy0 up
>>tc qdisc add dev eth2 ingress
>>tc filter add dev eth2 parent ffff: protocol ip prio 10 u32 match u32 0 0 
>>flowid 1:1 action ipt -j MARK --set-mark 1 action mirred egress redirect dev 
>>dummy0
>>iptables: calloc failed: Cannot allocate memory
>>
>>I use 2.6.11.2 kernel patched with your dummy patch, iptables 1.3.1 and the 
>>latest iproute2 patched with the pacth you sent yesterday.
>>
>>
>>Regards
>>
>>Remus
>>
>>
>>----- Original Message ----- 
>>From: "Jamal Hadi Salim" <hadi@znyx.com>
>>To: "Remus" <rmocius@auste.elnet.lt>
>>Cc: <netdev@oss.sgi.com>; "Nguyen Dinh Nam" <nguyendinhnam@gmail.com>; 
>>"Andre Tomt" <andre@tomt.net>; <syrius.ml@no-log.org>; "Andy Furniss" 
>><andy.furniss@dsl.pipex.com>; "Damion de Soto" <damion@snapgear.com>
>>Sent: Thursday, March 10, 2005 1:06 AM
>>Subject: Re: dummy as IMQ replacement
>>
>>
>>
>>>Hi Remus,
>>>
>>>Please try this patch on top of latest iproute2. Credit to Patrick for
>>>spoting it. I dont know when or who made this change - in any case it
>>>doesnt matter if it works for you.
>>>
>>>cheers,
>>>
>>>On Wed, 2005-03-09 at 09:38, jamal wrote:
>>>
>>>
>>>>I have to go to work - so wont have time to look at this for sometime.
>>>>Maybe some of the netfilter folks like Patrick can solve it for you
>>>>before i get back.
>>>>
>>>>cheers,
>>>>jamal
>>>>
>>>
>>
>>
>>
> 
> 
> 

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-03-19  1:09               ` Andy Furniss
@ 2005-03-19  1:45                 ` jamal
  2005-03-19 10:23                   ` Andy Furniss
  2005-03-21 13:14                 ` iptables breakage WAS(Re: " jamal
  1 sibling, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-19  1:45 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Remus, netdev, Nguyen Dinh Nam, Andre Tomt, syrius.ml, Damion de Soto


Remus seems to have got it working with iptables earlier than
1.3.0 works fine.
Unfortunately i am not close to my machines and wont be for a while.
Can you try not to pass mirred in the command line and see if that
works?

cheers,
jamal

On Fri, 2005-03-18 at 20:09, Andy Furniss wrote:
> jamal wrote:
> > Hi Remus,
> > I could not reproduce this one - it is also a bit odd for calloc to
> > fail. I dont have iptables 1.3.1 but i will get and retry.
> > Does this happen all the time?
> 
> I get the same with iptables 1.3.1 and 1.3.0
> 
> iptables: calloc failed: Cannot allocate memory
> 
> using kernel 2.6.11.3 and tc iproute2-ss050314
> 
> If I try an earlier iptables (tested 9, 10, 11) I get
> 
> tablename: mangle hook: NF_IP_PRE_ROUTING
>          target: MARK set 0x1  index 0
> bad action type mirred
> Usage: ... gact <ACTION> [RAND] [INDEX]
> Where: ACTION := reclassify | drop | continue | pass RAND := random 
> <RANDTYPE> <ACTION> <VAL>RANDTYPE := netrand | determVAL : = value not 
> exceeding 10000INDEX := index value used
> bad action parsing
> parse_action: bad value (5:mirred)!
> Illegal "action"
> 
> Andy.
> 
> 
> 
> 
> > cheers,
> > jamal
> > 
> > On Thu, 2005-03-10 at 04:18, Remus wrote:
> > 
> >>Hi Jamal,
> >>
> >>Thanks for the patch.
> >>That error is gone but I got a new error:
> >>
> >>ifconfig dummy0 up
> >>tc qdisc add dev eth2 ingress
> >>tc filter add dev eth2 parent ffff: protocol ip prio 10 u32 match u32 0 0 
> >>flowid 1:1 action ipt -j MARK --set-mark 1 action mirred egress redirect dev 
> >>dummy0
> >>iptables: calloc failed: Cannot allocate memory
> >>
> >>I use 2.6.11.2 kernel patched with your dummy patch, iptables 1.3.1 and the 
> >>latest iproute2 patched with the pacth you sent yesterday.
> >>
> >>
> >>Regards
> >>
> >>Remus
> >>
> >>
> >>----- Original Message ----- 
> >>From: "Jamal Hadi Salim" <hadi@znyx.com>
> >>To: "Remus" <rmocius@auste.elnet.lt>
> >>Cc: <netdev@oss.sgi.com>; "Nguyen Dinh Nam" <nguyendinhnam@gmail.com>; 
> >>"Andre Tomt" <andre@tomt.net>; <syrius.ml@no-log.org>; "Andy Furniss" 
> >><andy.furniss@dsl.pipex.com>; "Damion de Soto" <damion@snapgear.com>
> >>Sent: Thursday, March 10, 2005 1:06 AM
> >>Subject: Re: dummy as IMQ replacement
> >>
> >>
> >>
> >>>Hi Remus,
> >>>
> >>>Please try this patch on top of latest iproute2. Credit to Patrick for
> >>>spoting it. I dont know when or who made this change - in any case it
> >>>doesnt matter if it works for you.
> >>>
> >>>cheers,
> >>>
> >>>On Wed, 2005-03-09 at 09:38, jamal wrote:
> >>>
> >>>
> >>>>I have to go to work - so wont have time to look at this for sometime.
> >>>>Maybe some of the netfilter folks like Patrick can solve it for you
> >>>>before i get back.
> >>>>
> >>>>cheers,
> >>>>jamal
> >>>>
> >>>
> >>
> >>
> >>
> > 
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-03-19  1:45                 ` jamal
@ 2005-03-19 10:23                   ` Andy Furniss
  2005-03-20 13:20                     ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-03-19 10:23 UTC (permalink / raw)
  To: hadi
  Cc: Remus, netdev, Nguyen Dinh Nam, Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> Remus seems to have got it working with iptables earlier than
> 1.3.0 works fine.
> Unfortunately i am not close to my machines and wont be for a while.
> Can you try not to pass mirred in the command line and see if that
> works?
> 
> cheers,
> jamal

If I just do -

....
$TC filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
match u32 0 0 flowid 1:1 \
action ipt -j MARK --set-mark 1

It still gives memory error with 1.3.1 , with 1.2.11 it parses OK but I 
get bogus stats - hit count is OK

[root@amd /home/andy/Qos]# tc -s filter ls dev eth0 parent ffff:

filter protocol ip pref 10 u32
filter protocol ip pref 10 u32 fh 800: ht divisor 1
filter protocol ip pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 
flowid 1:1  (rule hit 12 success 12)
   match 00000000/00000000 at 0 (success 12 )
         action order 1: tablename: mangle  hook: NF_IP_PRE_ROUTING
         target MARK set 0x1
         index 1 ref 1 bind 1 installed 251 sec expires 1 sec
         Action statistics:
         Sent 7630953 bytes 0 pkt
         rate 3146Kbit 1095565348pps

If I try with the lines below added

action egress redirect dev dummy0 or
action redirect dev dummy0

I just get errors on whatever is after action - or memory errors with 1.3.1.

Using tc iproute2-ss050112 + patch for these tests.

I don't know if it's relevant but I am using gcc-2.95.3, which meant I 
had to change the dummy patch a bit as it doesn't (for me) like variable 
declarations in the middle of functions.

drivers/net/dummy.c: In function `dummy_xmit':
drivers/net/dummy.c:218: parse error before `from'
drivers/net/dummy.c:219: `from' undeclared (first use in this function)

So I changed to

static int dummy_xmit(struct sk_buff *skb, struct net_device *dev)
{
	struct dummy_private *dp = ((struct net_device *)dev)->priv;
	struct net_device_stats *stats = &dp->stats;
	int ret = 0;
	__u32 from;

	{
	stats->tx_packets++;
	stats->tx_bytes+=skb->len;
	}
#ifdef CONFIG_NET_CLS_ACT
	from = G_TC_FROM(skb->tc_verd);

Is there a switch for this? - or am I the only one who actually does 
kernel stuff with 2.95.3 (using LFS 5.1, which uses 3.3 by default but 
keeps 2.95.3 in /opt specially for kernels)

I am using it for iptables and tc aswell (though I tried 3.3 - but not 
for kernel yet)

I have manually loaded modules

dummy                   3972  0
mirred                  6400  0
pedit                   6624  0
gact                    5984  0
ipt_MARK                2432  0
ipt                     6944  0
ip_tables              19728  2 ipt_MARK,ipt
sch_tbf                 5120  1
sch_ingress             3136  1
cls_fw                  3904  2
sch_sfq                 5184  2
sch_prio                4736  1
cls_u32                 6916  0

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-03-19 10:23                   ` Andy Furniss
@ 2005-03-20 13:20                     ` jamal
  2005-03-20 13:55                       ` jamal
  2005-03-21 22:08                       ` Andy Furniss
  0 siblings, 2 replies; 126+ messages in thread
From: jamal @ 2005-03-20 13:20 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Thomas Graf, Remus, netdev, Nguyen Dinh Nam, Andre Tomt,
	syrius.ml, Damion de Soto

[-- Attachment #1: Type: text/plain, Size: 3083 bytes --]

Hi Andy,
Apologies again - I wont be able to get access to my test machine until
tuesday.

On Sat, 2005-03-19 at 05:23, Andy Furniss wrote:

> $TC filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
> match u32 0 0 flowid 1:1 \
> action ipt -j MARK --set-mark 1
> 
> It still gives memory error with 1.3.1 , with 1.2.11 it parses OK but I 
> get bogus stats - hit count is OK
> 
> [root@amd /home/andy/Qos]# tc -s filter ls dev eth0 parent ffff:
> 
> filter protocol ip pref 10 u32
> filter protocol ip pref 10 u32 fh 800: ht divisor 1
> filter protocol ip pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 
> flowid 1:1  (rule hit 12 success 12)
>    match 00000000/00000000 at 0 (success 12 )
>          action order 1: tablename: mangle  hook: NF_IP_PRE_ROUTING
>          target MARK set 0x1
>          index 1 ref 1 bind 1 installed 251 sec expires 1 sec
>          Action statistics:
>          Sent 7630953 bytes 0 pkt
>          rate 3146Kbit 1095565348pps
> 

Ok, this seems to be a bug in the stats - I think it may have been
introduced during the new kernel stats code updates.
Ive cced Thomas who added that code, he may be able to figure it oput
before i get back

> If I try with the lines below added
> 
> action egress redirect dev dummy0 or
> action redirect dev dummy0
> 
> I just get errors on whatever is after action - or memory errors with 1.3.1.
> 
> Using tc iproute2-ss050112 + patch for these tests.
> 

So if i have understood you correctly, with this version of tc and
version of iproute2, you have no problems other than stats being messed
up? i.e action ipt .. action mirred .. looks/works fine?

I think iptables >= 1.2.11 may have broken backward compatibility, i
will investigate when i get back. 
Lets narrow down to what version of iproute2 that things break - stick
with iptables 1.2.11

> I don't know if it's relevant but I am using gcc-2.95.3, which meant I 
> had to change the dummy patch a bit as it doesn't (for me) like variable 
> declarations in the middle of functions.
> 
> drivers/net/dummy.c: In function `dummy_xmit':
> drivers/net/dummy.c:218: parse error before `from'
> drivers/net/dummy.c:219: `from' undeclared (first use in this function)
> 
> So I changed to
> 
> static int dummy_xmit(struct sk_buff *skb, struct net_device *dev)
> {
> 	struct dummy_private *dp = ((struct net_device *)dev)->priv;
> 	struct net_device_stats *stats = &dp->stats;
> 	int ret = 0;
> 	__u32 from;
> 
> 	{
> 	stats->tx_packets++;
> 	stats->tx_bytes+=skb->len;
> 	}
> #ifdef CONFIG_NET_CLS_ACT
> 	from = G_TC_FROM(skb->tc_verd);
> 
> Is there a switch for this? - or am I the only one who actually does 
> kernel stuff with 2.95.3 (using LFS 5.1, which uses 3.3 by default but 
> keeps 2.95.3 in /opt specially for kernels)
> 

The change you have above is needed - dont recall of any gcc switches
which will resolve this.
I have fixed this for Remus as well back when he was testing (attached
dummy.c he is using); the stats are also a little misleading on whats rx
or tx. So those are fixed too in the attached version.

cheers,
jamal

[-- Attachment #2: dummy.c.gz --]
[-- Type: application/x-gzip, Size: 2952 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-03-20 13:20                     ` jamal
@ 2005-03-20 13:55                       ` jamal
  2005-03-20 18:31                         ` jamal
  2005-03-21 22:08                       ` Andy Furniss
  1 sibling, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-20 13:55 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Thomas Graf, Remus, netdev, Nguyen Dinh Nam, Andre Tomt,
	syrius.ml, Damion de Soto

On Sun, 2005-03-20 at 08:20, jamal wrote:

> On Sat, 2005-03-19 at 05:23, Andy Furniss wrote:
> 

> > [root@amd /home/andy/Qos]# tc -s filter ls dev eth0 parent ffff:
> > 
> > filter protocol ip pref 10 u32
> > filter protocol ip pref 10 u32 fh 800: ht divisor 1
> > filter protocol ip pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 
> > flowid 1:1  (rule hit 12 success 12)
> >    match 00000000/00000000 at 0 (success 12 )
> >          action order 1: tablename: mangle  hook: NF_IP_PRE_ROUTING
> >          target MARK set 0x1
> >          index 1 ref 1 bind 1 installed 251 sec expires 1 sec
> >          Action statistics:
> >          Sent 7630953 bytes 0 pkt
> >          rate 3146Kbit 1095565348pps
> > 
> 
> Ok, this seems to be a bug in the stats - I think it may have been
> introduced during the new kernel stats code updates.
> Ive cced Thomas who added that code, he may be able to figure it oput
> before i get back
> 

OTOH, this may be a kernel issue. There have been some changes recently
which updated some counters from 32 bit to 64 bit ;-> Clearly this
will break the ABI and will give crap stats.

Try also if you can kernel 2.6.10.
I think weve narrowed down iptables to be working if <= 1.2.11
It will help me if you can narrow down the iproute2 version as well
as the kernel version where things start breaking.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-03-20 13:55                       ` jamal
@ 2005-03-20 18:31                         ` jamal
  0 siblings, 0 replies; 126+ messages in thread
From: jamal @ 2005-03-20 18:31 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Thomas Graf, Remus, netdev, Nguyen Dinh Nam, Andre Tomt,
	syrius.ml, Damion de Soto

[-- Attachment #1: Type: text/plain, Size: 2011 bytes --]


[Its amazing how much time i seem to have when i have no test machine].

Andy, dont bother trying to figure what kernels break. I think that
i may have found the bug though not 100% sure -its subtle but tricky.

I think my suspicion was correct - the stats changes that happened a
while back are causing havoc. Patch "p_kstats" should resolve a kernel
side bug introduced at the time. Patch "p_tcstats" for now should
resolve the tc side. I have not tested but was able to compile courtesy
of someones machine. 

cheers,
jamal

On Sun, 2005-03-20 at 08:55, jamal wrote:
> On Sun, 2005-03-20 at 08:20, jamal wrote:
> 
> > On Sat, 2005-03-19 at 05:23, Andy Furniss wrote:
> > 
> 
> > > [root@amd /home/andy/Qos]# tc -s filter ls dev eth0 parent ffff:
> > > 
> > > filter protocol ip pref 10 u32
> > > filter protocol ip pref 10 u32 fh 800: ht divisor 1
> > > filter protocol ip pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 
> > > flowid 1:1  (rule hit 12 success 12)
> > >    match 00000000/00000000 at 0 (success 12 )
> > >          action order 1: tablename: mangle  hook: NF_IP_PRE_ROUTING
> > >          target MARK set 0x1
> > >          index 1 ref 1 bind 1 installed 251 sec expires 1 sec
> > >          Action statistics:
> > >          Sent 7630953 bytes 0 pkt
> > >          rate 3146Kbit 1095565348pps
> > > 
> > 
> > Ok, this seems to be a bug in the stats - I think it may have been
> > introduced during the new kernel stats code updates.
> > Ive cced Thomas who added that code, he may be able to figure it oput
> > before i get back
> > 
> 
> OTOH, this may be a kernel issue. There have been some changes recently
> which updated some counters from 32 bit to 64 bit ;-> Clearly this
> will break the ABI and will give crap stats.
> 
> Try also if you can kernel 2.6.10.
> I think weve narrowed down iptables to be working if <= 1.2.11
> It will help me if you can narrow down the iproute2 version as well
> as the kernel version where things start breaking.
> 
> cheers,
> jamal
> 
> 
> 

[-- Attachment #2: p_kstats --]
[-- Type: text/plain, Size: 414 bytes --]

--- a/include/linux/rtnetlink.h	2005/03/20 17:53:07	1.1
+++ b/include/linux/rtnetlink.h	2005/03/20 17:53:34
@@ -699,7 +699,6 @@
 	TCA_RATE,
 	TCA_FCNT,
 	TCA_STATS2,
-	TCA_ACT_STATS,
 	__TCA_MAX
 };
 
--- a/include/linux/pkt_cls.h	2005/03/22 17:54:23	1.1
+++ b/include/linux/pkt_cls.h	2005/03/22 17:55:15
@@ -80,6 +80,7 @@
 	TCA_ACT_KIND,
 	TCA_ACT_OPTIONS,
 	TCA_ACT_INDEX,
+	TCA_ACT_STATS,
 	__TCA_ACT_MAX
 };
 

[-- Attachment #3: p_tcstats --]
[-- Type: text/plain, Size: 616 bytes --]

--- a/include/linux/rtnetlink.h	2005/03/20 17:56:53	1.1
+++ b/include/linux/rtnetlink.h	2005/03/20 17:57:17
@@ -699,7 +699,6 @@
 	TCA_RATE,
 	TCA_FCNT,
 	TCA_STATS2,
-	TCA_ACT_STATS,
 	__TCA_MAX
 };
 
--- a/include/linux/pkt_cls.h	2005-03-20 08:45:44.000000000 -0500
+++ b/include/linux/pkt_cls.h	2005-03-20 12:56:19.000000000 -0500
@@ -78,6 +78,7 @@
 	TCA_ACT_KIND,
 	TCA_ACT_OPTIONS,
 	TCA_ACT_INDEX,
+	TCA_ACT_STATS,
 	__TCA_ACT_MAX
 };
 
@@ -136,9 +137,9 @@
 
 struct tcf_t
 {
-	__u32   install;
-	__u32   lastuse;
-	__u32   expires;
+	__u64   install;
+	__u64   lastuse;
+	__u64   expires;
 };
 
 struct tc_cnt

^ permalink raw reply	[flat|nested] 126+ messages in thread

* iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-19  1:09               ` Andy Furniss
  2005-03-19  1:45                 ` jamal
@ 2005-03-21 13:14                 ` jamal
  2005-03-21 21:50                   ` Andy Furniss
  2005-03-23  1:31                   ` Patrick McHardy
  1 sibling, 2 replies; 126+ messages in thread
From: jamal @ 2005-03-21 13:14 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Fri, 2005-03-18 at 20:09, Andy Furniss wrote:
> jamal wrote:
> > Hi Remus,
> > I could not reproduce this one - it is also a bit odd for calloc to
> > fail. I dont have iptables 1.3.1 but i will get and retry.
> > Does this happen all the time?
> 
> I get the same with iptables 1.3.1 and 1.3.0
> 
> iptables: calloc failed: Cannot allocate memory
> 
> using kernel 2.6.11.3 and tc iproute2-ss050314
> 
> If I try an earlier iptables (tested 9, 10, 11) I get
> 

Ok, I think i figured this one out as well - sorry dont have access to
my test hardware still to verify.

As i was suspecting this is related to iptables breaking backwards
compatibility. Starting with 1.3.0 the target structure changed ;->
(right at the top is a new field called version)
I suspect the iptables folks maybe unaware that there are other users of
iptables and assume that anyone needing to use new iptables will
recompile everything from scratch. BAD! BAD!
I am ccing the necessary evil doers (Harald and Patrick - at least they
would know who the real evildoer is). 

To test the theory copy iptables.h and iptables_common.h from
iptables-1.3.1/include into iproute2/include with the latest iproute2
and recompile. Make sure m_ipt.c is recompiled - you may have to do a 
make clean in iproute2/tc/

I should be able to validate all this stuff starting tommorow evening.
Also I have a feeling if you make this change, things will not work for
iptables <=1.2.9/10/11. Can you verify that?

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-21 13:14                 ` iptables breakage WAS(Re: " jamal
@ 2005-03-21 21:50                   ` Andy Furniss
  2005-03-21 22:41                     ` jamal
  2005-03-23  1:31                   ` Patrick McHardy
  1 sibling, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-03-21 21:50 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> On Fri, 2005-03-18 at 20:09, Andy Furniss wrote:
> 
>>jamal wrote:
>>
>>>Hi Remus,
>>>I could not reproduce this one - it is also a bit odd for calloc to
>>>fail. I dont have iptables 1.3.1 but i will get and retry.
>>>Does this happen all the time?
>>
>>I get the same with iptables 1.3.1 and 1.3.0
>>
>>iptables: calloc failed: Cannot allocate memory
>>
>>using kernel 2.6.11.3 and tc iproute2-ss050314
>>
>>If I try an earlier iptables (tested 9, 10, 11) I get
>>
> 
> 
> Ok, I think i figured this one out as well - sorry dont have access to
> my test hardware still to verify.
> 
> As i was suspecting this is related to iptables breaking backwards
> compatibility. Starting with 1.3.0 the target structure changed ;->
> (right at the top is a new field called version)
> I suspect the iptables folks maybe unaware that there are other users of
> iptables and assume that anyone needing to use new iptables will
> recompile everything from scratch. BAD! BAD!
> I am ccing the necessary evil doers (Harald and Patrick - at least they
> would know who the real evildoer is). 
> 
> To test the theory copy iptables.h and iptables_common.h from
> iptables-1.3.1/include into iproute2/include with the latest iproute2
> and recompile. Make sure m_ipt.c is recompiled - you may have to do a 
> make clean in iproute2/tc/

I haven't done a new kernel with stats patched yet. Using iptables 1.3.1 
and iproute2-ss050314 with iptables headers I now get below instead of 
memory error.

++ /usr/sbin/tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 
match u32 0 0 flowid 1:1 action ipt -j MARK --set-mark 1 action mirred 
egress redirect dev dummy0
tablename: mangle hook: NF_IP_PRE_ROUTING
         target: MARK set 0x1  index 0
bad action type mirred
Usage: ... gact <ACTION> [RAND] [INDEX]
Where: ACTION := reclassify | drop | continue | pass RAND := random 
<RANDTYPE> <ACTION> <VAL>RANDTYPE := netrand | determVAL : = value not 
exceeding 10000INDEX := index value used
bad action parsing
parse_action: bad value (5:mirred)!
Illegal "action"

I will try with new kernel later tonight.

> 
> I should be able to validate all this stuff starting tommorow evening.
> Also I have a feeling if you make this change, things will not work for
> iptables <=1.2.9/10/11. Can you verify that?
>

Yes it segfaults with iptables v1.2.11


++ /usr/sbin/tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 
match u32 0 0 flowid 1:1 action ipt -j MARK --set-mark 1 action mirred 
egress redirect dev dummy0
./dummy-ingress-2: line 43:  1345 Segmentation fault      $TC filter add 
dev eth0 parent ffff: protocol ip prio 10 u32 match u32 0 0 flowid 1:1 
action ipt -j MARK --set-mark 1 action mirred egress redirect dev dummy0

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: dummy as IMQ replacement
  2005-03-20 13:20                     ` jamal
  2005-03-20 13:55                       ` jamal
@ 2005-03-21 22:08                       ` Andy Furniss
  1 sibling, 0 replies; 126+ messages in thread
From: Andy Furniss @ 2005-03-21 22:08 UTC (permalink / raw)
  To: hadi
  Cc: Thomas Graf, Remus, netdev, Nguyen Dinh Nam, Andre Tomt,
	syrius.ml, Damion de Soto

jamal wrote:
> Hi Andy,
> Apologies again - I wont be able to get access to my test machine until
> tuesday.
> 
> On Sat, 2005-03-19 at 05:23, Andy Furniss wrote:
> 
> 
>>$TC filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
>>match u32 0 0 flowid 1:1 \
>>action ipt -j MARK --set-mark 1
>>
>>It still gives memory error with 1.3.1 , with 1.2.11 it parses OK but I 
>>get bogus stats - hit count is OK
>>
>>[root@amd /home/andy/Qos]# tc -s filter ls dev eth0 parent ffff:
>>
>>filter protocol ip pref 10 u32
>>filter protocol ip pref 10 u32 fh 800: ht divisor 1
>>filter protocol ip pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 
>>flowid 1:1  (rule hit 12 success 12)
>>   match 00000000/00000000 at 0 (success 12 )
>>         action order 1: tablename: mangle  hook: NF_IP_PRE_ROUTING
>>         target MARK set 0x1
>>         index 1 ref 1 bind 1 installed 251 sec expires 1 sec
>>         Action statistics:
>>         Sent 7630953 bytes 0 pkt
>>         rate 3146Kbit 1095565348pps
>>
> 
> 
> Ok, this seems to be a bug in the stats - I think it may have been
> introduced during the new kernel stats code updates.
> Ive cced Thomas who added that code, he may be able to figure it oput
> before i get back
> 
> 
>>If I try with the lines below added
>>
>>action egress redirect dev dummy0 or
>>action redirect dev dummy0
>>
>>I just get errors on whatever is after action - or memory errors with 1.3.1.
>>
>>Using tc iproute2-ss050112 + patch for these tests.
>>
> 
> 
> So if i have understood you correctly, with this version of tc and
> version of iproute2, you have no problems other than stats being messed
> up? i.e action ipt .. action mirred .. looks/works fine?

No, I haven't got anything to work with action mirred the stats was just 
using

$TC filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
match u32 0 0 flowid 1:1 \
action ipt -j MARK --set-mark 1

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-21 21:50                   ` Andy Furniss
@ 2005-03-21 22:41                     ` jamal
  2005-03-22  1:15                       ` Andy Furniss
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-21 22:41 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Mon, 2005-03-21 at 16:50, Andy Furniss wrote:
> jamal wrote:

> > To test the theory copy iptables.h and iptables_common.h from
> > iptables-1.3.1/include into iproute2/include with the latest iproute2
> > and recompile. Make sure m_ipt.c is recompiled - you may have to do a 
> > make clean in iproute2/tc/
> 
> I haven't done a new kernel with stats patched yet. 

Thanks for atching that btw - it was tricky; i have a strong feeling it
was resolved by patch i sent.

> Using iptables 1.3.1 
> and iproute2-ss050314 with iptables headers I now get below instead of 
> memory error.
> 
> ++ /usr/sbin/tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 
> match u32 0 0 flowid 1:1 action ipt -j MARK --set-mark 1 action mirred 
> egress redirect dev dummy0
> tablename: mangle hook: NF_IP_PRE_ROUTING
>          target: MARK set 0x1  index 0
> bad action type mirred
> Usage: ... gact <ACTION> [RAND] [INDEX]
> Where: ACTION := reclassify | drop | continue | pass RAND := random 
> <RANDTYPE> <ACTION> <VAL>RANDTYPE := netrand | determVAL : = value not 
> exceeding 10000INDEX := index value used
> bad action parsing
> parse_action: bad value (5:mirred)!
> Illegal "action"
> 

But what happens when you try without mirred? Lets debug that first.

The fact that mirred fails is very strange - shouldnt;
[You could try something like  "action ok" instead of "action mirred .."
and see if cascading of actions works ..]. Remus didnt seem to have this
specific issue.

> I will try with new kernel later tonight.
> 
> > 
> > I should be able to validate all this stuff starting tommorow evening.
> > Also I have a feeling if you make this change, things will not work for
> > iptables <=1.2.9/10/11. Can you verify that?
> >
> 
> Yes it segfaults with iptables v1.2.11


So the changes that happened on iptables are neither forward nor
backward compatible. 
I am begining to question the wisdom of putting the header files
in iproute2. We may have to make a call and say we are going to work
only on iptables >= 1.3.0 - would this make sense?

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-21 22:41                     ` jamal
@ 2005-03-22  1:15                       ` Andy Furniss
  2005-03-22  3:31                         ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-03-22  1:15 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> On Mon, 2005-03-21 at 16:50, Andy Furniss wrote:
> 
>>jamal wrote:

> 
> But what happens when you try without mirred? Lets debug that first.
> 
> The fact that mirred fails is very strange - shouldnt;
> [You could try something like  "action ok" instead of "action mirred .."
> and see if cascading of actions works ..]. Remus didnt seem to have this
> specific issue.

Using 2.6.11.5 with new dummy.c and p_kstats.

p_tcstats wouldn't apply to latest iproute2 so used patched 
iproute2-ss050112 + p_tcstats

With iptables 1.3.1 and tc with it's iptables.h and iptables_common.h 
all I can do is -

++ /usr/sbin/tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 
match u32 0 0 flowid 1:1 action ok action ok

6 packets transmitted, 6 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.337/0.566/1.630/0.476 ms
[root@amd /home/andy/Qos]# tc -s filter ls dev eth0 parent ffff:
filter protocol ip pref 10 u32
filter protocol ip pref 10 u32 fh 800: ht divisor 1
filter protocol ip pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 
flowid 1:1  (rule hit 6 success 6)
   match 00000000/00000000 at 0 (success 6 )
         action order 1: gact action pass
          random type none pass val 0
          index 3 ref 1 bind 1 installed 115 sec used 3 sec
         Action statistics:
         Sent 504 bytes 6 pkt (dropped 0, overlimits 0 requeues 0)
         rate 0bit 0pps backlog 0b 0p requeues 0

         action order 2: gact action pass
          random type none pass val 0
          index 4 ref 1 bind 1 installed 115 sec used 115 sec
         Action statistics:
         Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
         rate 0bit 0pps backlog 0b 0p requeues 0

ipt MARK now fails though -

++ /usr/sbin/tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 
match u32 0 0 flowid 1:1 action ipt -j MARK --set-mark 1 action ok
tablename: mangle hook: NF_IP_PRE_ROUTING
         target: MARK set 0x1  index 0
RTNETLINK answers: Invalid argument
We have an error talking to the kernel


If I build same tc with iptables 1.2.11 headers and use iptables 1.2.11 
the above works.

mirred still fails whatever I try.

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-22  1:15                       ` Andy Furniss
@ 2005-03-22  3:31                         ` jamal
  2005-03-22 21:09                           ` Andy Furniss
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-22  3:31 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

Andy,
Thanks for all your efforts.
I will be back on my regular setup by tommorow evening and should be
able to hopefuly test this. I am going to try:

- latest iproute2 with 1.3.x ipt changes
- i am just gonna jump to iptables 1.3.x - we are going to ignore 1.2.11
and below 
- kernel 2.6.11.5 patches with stats

Issues seen so far - the following dont work:

a) tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
match u32 0 0 flowid 1:1 action ipt -j MARK --set-mark
[Actually did you test this?]

b) above with mirred as the next action fails in user space

c) a) with a simple "action ok" is also rejected by the kernel
with "Invalid argument"

Did i miss anything else?

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-22  3:31                         ` jamal
@ 2005-03-22 21:09                           ` Andy Furniss
  2005-03-23  3:57                             ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-03-22 21:09 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> Andy,
> Thanks for all your efforts.
> I will be back on my regular setup by tommorow evening and should be
> able to hopefuly test this. I am going to try:
> 
> - latest iproute2 with 1.3.x ipt changes
> - i am just gonna jump to iptables 1.3.x - we are going to ignore 1.2.11
> and below 
> - kernel 2.6.11.5 patches with stats
> 
> Issues seen so far - the following dont work:
> 
> a) tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
> match u32 0 0 flowid 1:1 action ipt -j MARK --set-mark
> [Actually did you test this?]

Not without the 1 - If I do I get

++ /usr/sbin/tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 
match u32 0 0 flowid 1:1 action ipt -j MARK --set-mark
ipt: option `--set-mark' requires an argument
tablename: mangle hook: NF_IP_PRE_ROUTING
         target: MARK set 0x0  index 0
RTNETLINK answers: Invalid argument
We have an error talking to the kernel

With the one -

++ /usr/sbin/tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 
match u32 0 0 flowid 1:1 action ipt -j MARK --set-mark 1
tablename: mangle hook: NF_IP_PRE_ROUTING
         target: MARK set 0x1  index 0
RTNETLINK answers: Invalid argument
We have an error talking to the kernel

> 
> b) above with mirred as the next action fails in user space

Yes -

++ /usr/sbin/tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 
match u32 0 0 flowid 1:1 action ipt -j MARK --set-mark 1 action mirred 
egress redirect dev dummy0
tablename: mangle hook: NF_IP_PRE_ROUTING
         target: MARK set 0x1  index 0
bad action type mirred
Usage: ... gact <ACTION> [RAND] [INDEX]
Where: ACTION := reclassify | drop | continue | pass RAND := random 
<RANDTYPE> <ACTION> <VAL>RANDTYPE := netrand | determVAL : = value not 
exceeding 10000INDEX := index value used
bad action parsing
parse_action: bad value (5:mirred)!
Illegal "action"

I notice if I grep iproute for "bad action type" it's in m_gact.c which 
does not contain the word mirred to test at all.

> 
> c) a) with a simple "action ok" is also rejected by the kernel
> with "Invalid argument"

Yes.

> 
> Did i miss anything else?

Don't think so - I can get a and c to work with older iptables and headers.

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-21 13:14                 ` iptables breakage WAS(Re: " jamal
  2005-03-21 21:50                   ` Andy Furniss
@ 2005-03-23  1:31                   ` Patrick McHardy
  2005-03-23  4:01                     ` jamal
  1 sibling, 1 reply; 126+ messages in thread
From: Patrick McHardy @ 2005-03-23  1:31 UTC (permalink / raw)
  To: hadi
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto,
	Netfilter Development Mailinglist

jamal wrote:
> As i was suspecting this is related to iptables breaking backwards
> compatibility. Starting with 1.3.0 the target structure changed ;->
> (right at the top is a new field called version)
> I suspect the iptables folks maybe unaware that there are other users of
> iptables and assume that anyone needing to use new iptables will
> recompile everything from scratch. BAD! BAD!
> I am ccing the necessary evil doers (Harald and Patrick - at least they
> would know who the real evildoer is). 

We'll try to keep this in mind in the future. We could move
the version field to the end, but I guess its already too
late. What do you think?

Regards
Patrick

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-22 21:09                           ` Andy Furniss
@ 2005-03-23  3:57                             ` jamal
  2005-03-23 19:33                               ` Andy Furniss
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-23  3:57 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

[-- Attachment #1: Type: text/plain, Size: 4282 bytes --]

Ok, Andy - I have tested this and should all work.
Can you double check on your side before i push kernel patch to Dave? I
tested on ubuntu distro on an AMD athlon.
Attached tar.gz with necessary patches. I only bothered to do 2 out of 3
tests. The second one covers the third. iptables libraries at runtime:
1.3.1

cheers,
jamal

-- start details (collected while i was testing) -----------

patch to kernel 2.6.11.5:
1)stats fix - attached as p_kernel

patch to tc:
1) stats - in patch file p_tc
2) mirred structure - in patch file p_tc
3) iptables headers copied from iptables 1.3.1 - both files in
attachment

bantu:~# uname -a
Linux bantu.foo 2.6.11.5 #1 Mon Mar 21 23:23:51 EST 2005 i686 GNU/Linux
bantu:~#

bantu:~# tc -V
tc utility, iproute2-ss050314
bantu:~#

TEST1:

Check if ipt works on its own and stats are fixed.

tc qdisc del dev eth0 ingress
tc qdisc add dev eth0 ingress

tc filter add dev eth0 parent ffff: protocol ip prio 6 u32 \
match ip src 10.0.2.24/32 flowid 1:16 \
action ipt -j TOS --set-tos Maximize-Reliability

** machine 10.0.2.24/32 is directly connected (via switch) to eth0

tc -s filter ls dev eth0 parent ffff:

bantu:~# tc -s filter ls dev eth0 parent ffff:
filter protocol ip pref 6 u32
filter protocol ip pref 6 u32 fh 800: ht divisor 1
filter protocol ip pref 6 u32 fh 800::800 order 2048 key ht 800 bkt 0
flowid 1:16  (rule hit 0 success 0)
  match 0a000218/ffffffff at 12 (success 0 )
        action order 1: tablename: mangle  hook: NF_IP_PRE_ROUTING
        target TOS set Maximize-Reliability
        index 5 ref 1 bind 1 installed 10 sec used 10 sec
        Action statistics:
        Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
        rate 0bit 0pps backlog 0b 0p requeues 0

ke82:~# ping -c 2 10.0.2.24
PING 10.0.2.24 (10.0.2.24) 56(84) bytes of data.
64 bytes from 10.0.2.24: icmp_seq=1 ttl=64 time=36.1 ms
64 bytes from 10.0.2.24: icmp_seq=2 ttl=64 time=3.79 ms

--- 10.0.2.24 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1005ms
rtt min/avg/max/mdev = 3.798/19.960/36.122/16.162 ms
bantu:~#

bantu:~# tc -s filter ls dev eth0 parent ffff:
filter protocol ip pref 6 u32
filter protocol ip pref 6 u32 fh 800: ht divisor 1
filter protocol ip pref 6 u32 fh 800::800 order 2048 key ht 800 bkt 0
flowid 1:16  (rule hit 2 success 2)
  match 0a000218/ffffffff at 12 (success 2 )
        action order 1: tablename: mangle  hook: NF_IP_PRE_ROUTING
        target TOS set Maximize-Reliability
        index 5 ref 1 bind 1 installed 109 sec used 36 sec
        Action statistics:
        Sent 168 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
        rate 0bit 0pps backlog 0b 0p requeues 0

TEST2:
- check if ipt followed by another action works.
- check if mirred works

tc qdisc del dev eth0 ingress
tc qdisc add dev eth0 ingress

tc filter add dev eth0 parent ffff: protocol ip prio 6 \
u32 match ip src 10.0.2.24/32 flowid 1:16 \
action ipt -j TOS --set-tos Maximize-Reliability \
action mirred egress redirect dev lo

--> Installs fine

ping Replies should never be seen since they are redirected to 
loopback device; tcdump on dev lo should show them.Actually even
tcpdump on eth0 should see them - they just dont make it up the stack.

bantu:~# ping -c 2 10.0.2.24
PING 10.0.2.24 (10.0.2.24) 56(84) bytes of data.

--- 10.0.2.24 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1145ms

bantu:~#

bantu:~# tc -s filter ls dev eth0 parent ffff:
filter protocol ip pref 6 u32
filter protocol ip pref 6 u32 fh 800: ht divisor 1
filter protocol ip pref 6 u32 fh 800::800 order 2048 key ht 800 bkt 0
flowid 1:16  (rule hit 2 success 2)
  match 0a000218/ffffffff at 12 (success 2 )
        action order 1: tablename: mangle  hook: NF_IP_PRE_ROUTING
        target TOS set Maximize-Reliability
        index 6 ref 1 bind 1 installed 128 sec used 123 sec
        Action statistics:
        Sent 168 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
        rate 0bit 0pps backlog 0b 0p requeues 0

        action order 2: mirred (Egress Redirect to device lo) stolen
        index 1 ref 1 bind 1 installed 128 sec used 123 sec
        Action statistics:
        Sent 168 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
        rate 0bit 0pps backlog 0b 0p requeues 0



[-- Attachment #2: iptmir.tgz --]
[-- Type: application/x-gzip, Size: 2358 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-23  1:31                   ` Patrick McHardy
@ 2005-03-23  4:01                     ` jamal
  0 siblings, 0 replies; 126+ messages in thread
From: jamal @ 2005-03-23  4:01 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto,
	Netfilter Development Mailinglist

On Tue, 2005-03-22 at 20:31, Patrick McHardy wrote:

> We'll try to keep this in mind in the future. We could move
> the version field to the end, but I guess its already too
> late. What do you think?
> 

I think its ok for now - we'll say if you want to use ipt you have to
use iptables 1.3.1 and above.
Just keep me in mind in the future. Like i suggested a while back
since i am ripping code off iptables anyways. if that code gets
modularized and in a library then the maintainance of this should be
easier.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-23  3:57                             ` jamal
@ 2005-03-23 19:33                               ` Andy Furniss
  2005-03-23 19:45                                 ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-03-23 19:33 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> Ok, Andy - I have tested this and should all work.
> Can you double check on your side before i push kernel patch to Dave? I
> tested on ubuntu distro on an AMD athlon.
> Attached tar.gz with necessary patches. I only bothered to do 2 out of 3
> tests. The second one covers the third. iptables libraries at runtime:
> 1.3.1

OK rebuilt with those versions and patches.

> TEST1:
> 
> Check if ipt works on its own and stats are fixed.
> 
> tc qdisc del dev eth0 ingress
> tc qdisc add dev eth0 ingress
> 
> tc filter add dev eth0 parent ffff: protocol ip prio 6 u32 \
> match ip src 10.0.2.24/32 flowid 1:16 \
> action ipt -j TOS --set-tos Maximize-Reliability

Yes this works OK

> TEST2:
> - check if ipt followed by another action works.
> - check if mirred works
> 
> tc qdisc del dev eth0 ingress
> tc qdisc add dev eth0 ingress
> 
> tc filter add dev eth0 parent ffff: protocol ip prio 6 \
> u32 match ip src 10.0.2.24/32 flowid 1:16 \
> action ipt -j TOS --set-tos Maximize-Reliability \
> action mirred egress redirect dev lo

Also works OK

> bantu:~# tc -s filter ls dev eth0 parent ffff:

didn't get bash prompt back after doing this till <ctrl><c> but works 
and looks OK. Works if I direct to dummy0 aswell :-)

The thing that still fails is trying to use MARK - but I guess that's 
not to do with mirred as I don't get any mention of it anymore.

[root@amd /home/andy/Qos]# tc qdisc del dev eth0 ingress
RTNETLINK answers: No such file or directory
[root@amd /home/andy/Qos]# tc qdisc add dev eth0 ingress
[root@amd /home/andy/Qos]# tc filter add dev eth0 parent ffff: protocol 
ip prio 6 \
 > u32 match ip src 10.0.2.24/32 flowid 1:16 \
 > action ipt -j MARK --set-mark 1
tablename: mangle hook: NF_IP_PRE_ROUTING
         target: MARK set 0x1  index 0
RTNETLINK answers: Invalid argument
We have an error talking to the kernel

I get exactly the same error if I also add action mirred egress redirect 
dev lo - before I would get different.

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-23 19:33                               ` Andy Furniss
@ 2005-03-23 19:45                                 ` jamal
  2005-03-23 20:53                                   ` Andy Furniss
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-23 19:45 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Wed, 2005-03-23 at 14:33, Andy Furniss wrote:

> > bantu:~# tc -s filter ls dev eth0 parent ffff:
> 
> didn't get bash prompt back after doing this till <ctrl><c> but works 
> and looks OK. 

Needs investigation. Lets defer for now, and see if it continues to
happen

> Works if I direct to dummy0 aswell :-)
> 

Good - hopefully we can now get to where you started ;-> 
I will send the kernel patch to Dave later.

> The thing that still fails is trying to use MARK - but I guess that's 
> not to do with mirred as I don't get any mention of it anymore.
> 


For me all targets are compiled into the kernel; I didnt try with
modules. If you have any modules try to compile in and see what happens.
If it works it could spell trouble perhaps with some of the module
replay code added recently.

> [root@amd /home/andy/Qos]# tc qdisc del dev eth0 ingress
> RTNETLINK answers: No such file or directory
> [root@amd /home/andy/Qos]# tc qdisc add dev eth0 ingress
> [root@amd /home/andy/Qos]# tc filter add dev eth0 parent ffff: protocol 
> ip prio 6 \
>  > u32 match ip src 10.0.2.24/32 flowid 1:16 \
>  > action ipt -j MARK --set-mark 1
> tablename: mangle hook: NF_IP_PRE_ROUTING
>          target: MARK set 0x1  index 0
> RTNETLINK answers: Invalid argument
> We have an error talking to the kernel
> 

Ok, try the module thing; actually try to modprobe mark target first and
see if that works as well.

> I get exactly the same error if I also add action mirred egress redirect 
> dev lo - before I would get different.
> 

Didnt follow - still related to ipt?

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-23 19:45                                 ` jamal
@ 2005-03-23 20:53                                   ` Andy Furniss
  2005-03-23 21:07                                     ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-03-23 20:53 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:

> 
> Ok, try the module thing; actually try to modprobe mark target first and
> see if that works as well.

Looks like they load OK - anyway I rebooted and modprobed ipt and 
ipt_MARK before test and it still fails - will do new kernel a bit later.

> 
> 
>>I get exactly the same error if I also add action mirred egress redirect 
>>dev lo - before I would get different.
>>
> 
> 
> Didnt follow - still related to ipt?

When action ipt MARK failed in previous tests and was followed by an 
action mirred ...

I would get an error like
...
bad action type mirred
...

but I can now follow the action ipt MARK line with an action mirred ..

and I just get the MARK error

tablename: mangle hook: NF_IP_PRE_ROUTING
         target: MARK set 0x1  index 0
RTNETLINK answers: Invalid argument
We have an error talking to the kernel



Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-23 20:53                                   ` Andy Furniss
@ 2005-03-23 21:07                                     ` jamal
  2005-03-23 22:46                                       ` Andy Furniss
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-23 21:07 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Wed, 2005-03-23 at 15:53, Andy Furniss wrote:

> 
> but I can now follow the action ipt MARK line with an action mirred ..
> 
> and I just get the MARK error
> 
> tablename: mangle hook: NF_IP_PRE_ROUTING
>          target: MARK set 0x1  index 0
> RTNETLINK answers: Invalid argument
> We have an error talking to the kernel
> 
> 

Ok, this is my worry - that it works when everything is compiled in
but not when as modules.
So when you rebuild compile everything in.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-23 21:07                                     ` jamal
@ 2005-03-23 22:46                                       ` Andy Furniss
  2005-03-23 23:12                                         ` Andy Furniss
  0 siblings, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-03-23 22:46 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> On Wed, 2005-03-23 at 15:53, Andy Furniss wrote:
> 
> 
>>but I can now follow the action ipt MARK line with an action mirred ..
>>
>>and I just get the MARK error
>>
>>tablename: mangle hook: NF_IP_PRE_ROUTING
>>         target: MARK set 0x1  index 0
>>RTNETLINK answers: Invalid argument
>>We have an error talking to the kernel
>>
>>
> 
> 
> Ok, this is my worry - that it works when everything is compiled in
> but not when as modules.
> So when you rebuild compile everything in.

Compiled everything in but it still doesn't work.

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-23 22:46                                       ` Andy Furniss
@ 2005-03-23 23:12                                         ` Andy Furniss
  2005-03-24  0:34                                           ` jamal
  2005-03-24  0:53                                           ` jamal
  0 siblings, 2 replies; 126+ messages in thread
From: Andy Furniss @ 2005-03-23 23:12 UTC (permalink / raw)
  To: Andy Furniss
  Cc: hadi, Harald Welte, Patrick McHardy, Remus, netdev,
	Nguyen Dinh Nam, Andre Tomt, syrius.ml, Damion de Soto

Andy Furniss wrote:
> jamal wrote:
> 
>> On Wed, 2005-03-23 at 15:53, Andy Furniss wrote:
>>
>>
>>> but I can now follow the action ipt MARK line with an action mirred ..
>>>
>>> and I just get the MARK error
>>>
>>> tablename: mangle hook: NF_IP_PRE_ROUTING
>>>         target: MARK set 0x1  index 0
>>> RTNETLINK answers: Invalid argument
>>> We have an error talking to the kernel
>>>
>>>
>>
>>
>> Ok, this is my worry - that it works when everything is compiled in
>> but not when as modules.
>> So when you rebuild compile everything in.
> 
> 
> Compiled everything in but it still doesn't work.
> 
> Andy.

Noticed I get this in logs

Mar 23 23:11:18 amd kernel: MARK: targinfosize 8 != 4

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-23 23:12                                         ` Andy Furniss
@ 2005-03-24  0:34                                           ` jamal
  2005-03-24  1:00                                             ` Andy Furniss
  2005-03-24  0:53                                           ` jamal
  1 sibling, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-24  0:34 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Wed, 2005-03-23 at 18:12, Andy Furniss wrote:

> 
> Noticed I get this in logs
> 
> Mar 23 23:11:18 amd kernel: MARK: targinfosize 8 != 4
> 

Aha!
The finger is still pointing to iptables version thing.
More breakage than i thought.

I dont get this message and it works just fine.
What iptables version are you using? I tested with 1.3.1.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-23 23:12                                         ` Andy Furniss
  2005-03-24  0:34                                           ` jamal
@ 2005-03-24  0:53                                           ` jamal
  2005-03-24  1:08                                             ` Andy Furniss
  1 sibling, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-24  0:53 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto


Never mind, I have reproduced this as well. It doesnt happen in all
targets it seems - just some. 

I will look at the netfilter code later and try and figure to unbreak
this. I think i will have to find a big bat and flog some of the
netfilter people responsible for breaking this ABI. 

cheers,
jamal

On Wed, 2005-03-23 at 18:12, Andy Furniss wrote:
> Andy Furniss wrote:
> > jamal wrote:
> > 
> >> On Wed, 2005-03-23 at 15:53, Andy Furniss wrote:
> >>
> >>
> >>> but I can now follow the action ipt MARK line with an action mirred ..
> >>>
> >>> and I just get the MARK error
> >>>
> >>> tablename: mangle hook: NF_IP_PRE_ROUTING
> >>>         target: MARK set 0x1  index 0
> >>> RTNETLINK answers: Invalid argument
> >>> We have an error talking to the kernel
> >>>
> >>>
> >>
> >>
> >> Ok, this is my worry - that it works when everything is compiled in
> >> but not when as modules.
> >> So when you rebuild compile everything in.
> > 
> > 
> > Compiled everything in but it still doesn't work.
> > 
> > Andy.
> 
> Noticed I get this in logs
> 
> Mar 23 23:11:18 amd kernel: MARK: targinfosize 8 != 4
> 
> Andy.
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-24  0:34                                           ` jamal
@ 2005-03-24  1:00                                             ` Andy Furniss
  0 siblings, 0 replies; 126+ messages in thread
From: Andy Furniss @ 2005-03-24  1:00 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> On Wed, 2005-03-23 at 18:12, Andy Furniss wrote:
> 
> 
>>Noticed I get this in logs
>>
>>Mar 23 23:11:18 amd kernel: MARK: targinfosize 8 != 4
>>
> 
> 
> Aha!
> The finger is still pointing to iptables version thing.
> More breakage than i thought.
> 
> I dont get this message and it works just fine.
> What iptables version are you using? I tested with 1.3.1.

It's 1.3.1

FWIW I can use MARK OK from netfilter without the message ie.

iptables -A PREROUTING -t mangle -j MARK --set-mark 1

works fine.

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-24  0:53                                           ` jamal
@ 2005-03-24  1:08                                             ` Andy Furniss
  2005-03-24 11:32                                               ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-03-24  1:08 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> Never mind, I have reproduced this as well. It doesnt happen in all
> targets it seems - just some. 

Whoo - I was starting to think it was me being lame somehow :-)

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-24  1:08                                             ` Andy Furniss
@ 2005-03-24 11:32                                               ` jamal
  2005-03-24 11:57                                                 ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-24 11:32 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Wed, 2005-03-23 at 20:08, Andy Furniss wrote:
> jamal wrote:
> > Never mind, I have reproduced this as well. It doesnt happen in all
> > targets it seems - just some. 
> 
> Whoo - I was starting to think it was me being lame somehow :-)

I can confirm your sanity ;->

Ok, I have figured the cause fatale at least - some targets have
multiple versions. MARK happens to be one of those. The reason TOS and
others worked is because they only have one version.

What happens when you go looking for the target is you get the new
version as a default ;-> I think the default should be to get the old
version so old binaries continue to work.
I believe you may have to go explicitly go and ask for the old version
or you may have to do something funky to get the new version passed to
the kernel.

Need caffeine, I think i will find some workaround - I should probably
put it in user space.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-24 11:32                                               ` jamal
@ 2005-03-24 11:57                                                 ` jamal
  2005-03-24 15:41                                                   ` Andy Furniss
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-24 11:57 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

[-- Attachment #1: Type: text/plain, Size: 214 bytes --]

On Thu, 2005-03-24 at 06:32, jamal wrote:

> 
> Need caffeine, I think i will find some workaround - I should probably
> put it in user space.
> 

Ok, try attached patch on tc - seems to work for me

cheers,
jamal

[-- Attachment #2: p_mipt --]
[-- Type: text/plain, Size: 854 bytes --]

--- a/tc/m_ipt.c	2005-03-14 17:23:54.000000000 -0500
+++ b/tc/m_ipt.c	2005-03-24 06:53:31.000000000 -0500
@@ -337,6 +337,17 @@
 	return &addr;
 }
 
+static void set_revision(char *name, u_int8_t revision)
+{
+	/* Old kernel sources don't have ".revision" field,
+	*  but we stole a byte from name. */
+	name[IPT_FUNCTION_MAXNAMELEN - 2] = '\0';
+	name[IPT_FUNCTION_MAXNAMELEN - 1] = revision;
+}
+
+/* 
+ * we may need to check for version mismatch
+*/
 int
 build_st(struct iptables_target *target, struct ipt_entry_target *t)
 {
@@ -350,8 +361,11 @@
 
 		if (NULL == t) {
 			target->t = fw_calloc(1, size);
-			target->init(target->t, &nfcache);
 			target->t->u.target_size = size;
+
+			if (target->init != NULL)
+				target->init(target->t, &nfcache);
+			set_revision(target->t->u.user.name, target->revision);
 		} else {
 			target->t = t;
 		}

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-24 11:57                                                 ` jamal
@ 2005-03-24 15:41                                                   ` Andy Furniss
  2005-03-25 11:13                                                     ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-03-24 15:41 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> On Thu, 2005-03-24 at 06:32, jamal wrote:
> 
> 
>>Need caffeine, I think i will find some workaround - I should probably
>>put it in user space.
>>
> 
> 
> Ok, try attached patch on tc - seems to work for me

Yes - it works fine now - thanks.

I can still get

tc -s filter ls dev eth0 parent ffff:

to not exit till <ctrl><c> (ps shows it aswell) - I tested from clean 
boot without X/KDE etc. It only happens with action mirred egress redirect.

action ipt or action ok don't cause it.

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-24 15:41                                                   ` Andy Furniss
@ 2005-03-25 11:13                                                     ` jamal
  2005-03-25 12:39                                                       ` jamal
  2005-03-25 19:59                                                       ` Andy Furniss
  0 siblings, 2 replies; 126+ messages in thread
From: jamal @ 2005-03-25 11:13 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Thu, 2005-03-24 at 10:41, Andy Furniss wrote:

> I can still get
> 
> tc -s filter ls dev eth0 parent ffff:
> 
> to not exit till <ctrl><c> (ps shows it aswell) - I tested from clean 
> boot without X/KDE etc. It only happens with action mirred egress redirect.
> 
> action ipt or action ok don't cause it.
> 

I have reproduced this as well ;-> Reproduced typically means it will be
fixed!
I gotta give it to you - you have helped isolate more bugs in this one
session than all others in the past summed up.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 11:13                                                     ` jamal
@ 2005-03-25 12:39                                                       ` jamal
  2005-03-25 17:27                                                         ` Patrick McHardy
  2005-03-25 19:59                                                       ` Andy Furniss
  1 sibling, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-25 12:39 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

[-- Attachment #1: Type: text/plain, Size: 664 bytes --]

On Fri, 2005-03-25 at 06:13, jamal wrote:

> I have reproduced this as well ;-> Reproduced typically means it will be
> fixed!
> I gotta give it to you - you have helped isolate more bugs in this one
> session than all others in the past summed up.
> 

Dang - this is a _serious_  kernel bug. I went back to some of the rcx
kernels pre-2611 and its there too.
Essentially what happens is once you enter netlink from user space
you cant go back in to query.

I am attaching a workaround patch for tc - Actually it is a solution -
but this means we have a kernel bug that needs investigation. In other
words we close that theres an issue for mirred.

cheers,
jamal


[-- Attachment #2: p2_mirred --]
[-- Type: text/plain, Size: 322 bytes --]

--- iproute2-2.6.11/tc/m_mirred.c	2005/03/25 12:32:39	1.2
+++ iproute2-2.6.11/tc/m_mirred.c	2005/03/25 12:34:05
@@ -263,7 +263,10 @@
 	}
 	p = RTA_DATA(tb[TCA_MIRRED_PARMS]);
 
+	/*
 	ll_init_map(&rth);
+	*/
+
 
 	if ((dev = ll_index_to_name(p->ifindex)) == 0) {
 		fprintf(stderr, "Cannot find device %d\n", p->ifindex);

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 12:39                                                       ` jamal
@ 2005-03-25 17:27                                                         ` Patrick McHardy
  2005-03-25 18:34                                                           ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Patrick McHardy @ 2005-03-25 17:27 UTC (permalink / raw)
  To: hadi
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> On Fri, 2005-03-25 at 06:13, jamal wrote:
> 
> 
>>I have reproduced this as well ;-> Reproduced typically means it will be
>>fixed!
>>I gotta give it to you - you have helped isolate more bugs in this one
>>session than all others in the past summed up.
>>
> 
> Dang - this is a _serious_  kernel bug. I went back to some of the rcx
> kernels pre-2611 and its there too.
> Essentially what happens is once you enter netlink from user space
> you cant go back in to query.
> 
> I am attaching a workaround patch for tc - Actually it is a solution -
> but this means we have a kernel bug that needs investigation. In other
> words we close that theres an issue for mirred.

What does ps -eo args,wchan show?

Regards
Patrick

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 17:27                                                         ` Patrick McHardy
@ 2005-03-25 18:34                                                           ` jamal
  2005-03-25 19:01                                                             ` Patrick McHardy
  2005-03-25 19:08                                                             ` jamal
  0 siblings, 2 replies; 126+ messages in thread
From: jamal @ 2005-03-25 18:34 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Fri, 2005-03-25 at 12:27, Patrick McHardy wrote:

> What does ps -eo args,wchan show?
> 

It shows tc stuck on wait_for_packet; dump is:

------
tc            S C06493A0     0 20153  20074                     (NOTLB)
c3e4fc1c 00000086 c4ea8d70 c06493a0 000005b4 00000000 00000000 00000000 
       00000000 00000000 00000000 00022e09 b5edbac0 000f48bb c4ea8d70
c4ea8ed8 
       00000000 7fffffff c3e4fca0 c3e4fc78 c04b28d4 c015a52d cffebc80
c3e4fc44 
Call Trace:
 [<c04b28d4>] schedule_timeout+0xd4/0xe0
 [<c03ae4f0>] wait_for_packet+0xb0/0x110
 [<c03ae6a3>] skb_recv_datagram+0x153/0x220
 [<c03eef68>] netlink_recvmsg+0x58/0x210
 [<c03a70ac>] sock_recvmsg+0xcc/0xf0
 [<c03a8c9b>] sys_recvmsg+0x13b/0x200
 [<c03a8f8d>] sys_socketcall+0x22d/0x240
 [<c0103c0d>] sysenter_past_esp+0x52/0x75
------

user space is stuck in recvmsg(). It seems to be waiting for an
NLMSG_DONE to complete the transaction - but that never comes.

One thing i've verified so far is it has nothing to do with the module
replay code. I am also doubting it has naything to do with locks in
the kernel. Its also a possibility that something changed in the
iproute2 causing this stuck waiting for NLMSG_DONE.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 18:34                                                           ` jamal
@ 2005-03-25 19:01                                                             ` Patrick McHardy
  2005-03-25 20:07                                                               ` Patrick McHardy
  2005-03-25 19:08                                                             ` jamal
  1 sibling, 1 reply; 126+ messages in thread
From: Patrick McHardy @ 2005-03-25 19:01 UTC (permalink / raw)
  To: hadi
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> On Fri, 2005-03-25 at 12:27, Patrick McHardy wrote:
> 
>>What does ps -eo args,wchan show?
>
> It shows tc stuck on wait_for_packet; 
> 
> user space is stuck in recvmsg(). It seems to be waiting for an
> NLMSG_DONE to complete the transaction - but that never comes.
> 
> One thing i've verified so far is it has nothing to do with the module
> replay code. I am also doubting it has naything to do with locks in
> the kernel. Its also a possibility that something changed in the
> iproute2 causing this stuck waiting for NLMSG_DONE.

Could it be that it is simply not making any forward progress?
tcf_dump_walker() doesn't save the number of skipped entries, but
the last order dumped, so it could dump the same entries again
and again when they exceed the room in the skb.

Regards
Patrick

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 18:34                                                           ` jamal
  2005-03-25 19:01                                                             ` Patrick McHardy
@ 2005-03-25 19:08                                                             ` jamal
  2005-03-25 19:22                                                               ` jamal
  1 sibling, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-25 19:08 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto


Ok, false alarm. 
The behavior the kernel exhibits is the same that has always been.
I went back about 10 kernels with the same iproute2 code upto around
2.6.8.
Its narrowed down to be user space problem. Investigating ..
I also found that the kernel does send NLMSG_DONE; somehow
user space misses it.

cheers,
jamal

On Fri, 2005-03-25 at 13:34, jamal wrote:
> On Fri, 2005-03-25 at 12:27, Patrick McHardy wrote:
> 
> > What does ps -eo args,wchan show?
> > 
> 
> It shows tc stuck on wait_for_packet; dump is:
> 
> ------
> tc            S C06493A0     0 20153  20074                     (NOTLB)
> c3e4fc1c 00000086 c4ea8d70 c06493a0 000005b4 00000000 00000000 00000000 
>        00000000 00000000 00000000 00022e09 b5edbac0 000f48bb c4ea8d70
> c4ea8ed8 
>        00000000 7fffffff c3e4fca0 c3e4fc78 c04b28d4 c015a52d cffebc80
> c3e4fc44 
> Call Trace:
>  [<c04b28d4>] schedule_timeout+0xd4/0xe0
>  [<c03ae4f0>] wait_for_packet+0xb0/0x110
>  [<c03ae6a3>] skb_recv_datagram+0x153/0x220
>  [<c03eef68>] netlink_recvmsg+0x58/0x210
>  [<c03a70ac>] sock_recvmsg+0xcc/0xf0
>  [<c03a8c9b>] sys_recvmsg+0x13b/0x200
>  [<c03a8f8d>] sys_socketcall+0x22d/0x240
>  [<c0103c0d>] sysenter_past_esp+0x52/0x75
> ------
> 
> user space is stuck in recvmsg(). It seems to be waiting for an
> NLMSG_DONE to complete the transaction - but that never comes.
> 
> One thing i've verified so far is it has nothing to do with the module
> replay code. I am also doubting it has naything to do with locks in
> the kernel. Its also a possibility that something changed in the
> iproute2 causing this stuck waiting for NLMSG_DONE.
> 
> cheers,
> jamal
> 
> 
> 

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 19:08                                                             ` jamal
@ 2005-03-25 19:22                                                               ` jamal
  0 siblings, 0 replies; 126+ messages in thread
From: jamal @ 2005-03-25 19:22 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Fri, 2005-03-25 at 14:08, jamal wrote:

> Its narrowed down to be user space problem. Investigating ..

sigh. I just wasted 3 hours on this.

Seems someone (the patch did not come from me) made a change to mirred
code in user space to ensure only a single socket was being used. This
will never work if you have a dump inside a dump. To resolve this in the
patch that i sent to Stephen for mirred had two separate sockets - one
for each dump. I wish i had remembered this a few hours back.

Anyways i will stick with patch i sent Andy.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 11:13                                                     ` jamal
  2005-03-25 12:39                                                       ` jamal
@ 2005-03-25 19:59                                                       ` Andy Furniss
  2005-03-25 20:09                                                         ` Patrick McHardy
  2005-03-25 20:10                                                         ` jamal
  1 sibling, 2 replies; 126+ messages in thread
From: Andy Furniss @ 2005-03-25 19:59 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:

> I gotta give it to you - you have helped isolate more bugs in this one
> session than all others in the past summed up.

LOL - there may be more to come, though I haven't used CONNMARK before 
and may be missing something about it - and maybe there is a tc way to 
look at connmark anyway?

Works as I expect -

iptables -A POSTROUTING -t mangle -j CONNMARK --set-mark 1
iptables -A INPUT -t mangle -m mark --mark 1
iptables -A PREROUTING -t mangle -j CONNMARK --restore-mark

I ping a lan pc and I get matches for the marked incoming packets in INPUT.

If I do

iptables -A POSTROUTING -t mangle -j CONNMARK --set-mark 1
iptables -A INPUT -t mangle -m mark --mark 1
tc qdisc add dev eth0 ingress
tc filter add dev eth0 parent ffff: protocol ip prio 6 u32 match ip src 
0/0 flowid 1:1 action ipt -j CONNMARK --restore-mark

It doesn't mark the packets.

3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.355/0.415/0.529/0.081 ms
[root@amd /home/andy/Qos]# tc -s filter ls dev eth0 parent ffff:
filter protocol ip pref 6 u32
filter protocol ip pref 6 u32 fh 800: ht divisor 1
filter protocol ip pref 6 u32 fh 800::800 order 2048 key ht 800 bkt 0 
flowid 1:1  (rule hit 3 success 3)
   match 00000000/00000000 at 12 (success 3 )
         action order 1: tablename: mangle  hook: NF_IP_PRE_ROUTING
         target CONNMARK restore
         index 2 ref 1 bind 1 installed 44 sec used 31 sec
         Action statistics:
         Sent 252 bytes 3 pkt (dropped 0, overlimits 0 requeues 0)
         rate 0bit 0pps backlog 0b 0p requeues 0

[root@amd /home/andy/Qos]# iptables -L -t mangle -v
Chain PREROUTING (policy ACCEPT 364 packets, 401K bytes)
  pkts bytes target     prot opt in     out     source 
destination

Chain INPUT (policy ACCEPT 364 packets, 401K bytes)
  pkts bytes target     prot opt in     out     source 
destination
     0     0            all  --  any    any     anywhere 
anywhere            MARK match 0x1

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
  pkts bytes target     prot opt in     out     source 
destination

Chain OUTPUT (policy ACCEPT 300 packets, 19702 bytes)
  pkts bytes target     prot opt in     out     source 
destination

Chain POSTROUTING (policy ACCEPT 300 packets, 19702 bytes)
  pkts bytes target     prot opt in     out     source 
destination
     3   252 CONNMARK   all  --  any    any     anywhere 
anywhere            CONNMARK set 0x1

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 19:01                                                             ` Patrick McHardy
@ 2005-03-25 20:07                                                               ` Patrick McHardy
  2005-03-25 20:31                                                                 ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Patrick McHardy @ 2005-03-25 20:07 UTC (permalink / raw)
  To: hadi
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

[-- Attachment #1: Type: text/plain, Size: 553 bytes --]

Patrick McHardy wrote:
> tcf_dump_walker() doesn't save the number of skipped entries, but
> the last order dumped, so it could dump the same entries again
> and again when they exceed the room in the skb.

How about this patch? It fixes two problems:

- off-by-one while skipping entries: index is incremented before the
   comparison with s_i, so it will start dumping at entry s_i-1 instead
   of s_i
- problem described above. n_i doesn't include how many empty hash
   chains were skipped, so adding it to cb->args[0] is incorrect

Regards
Patrick

[-- Attachment #2: x --]
[-- Type: text/plain, Size: 848 bytes --]

===== include/net/pkt_act.h 1.10 vs edited =====
--- 1.10/include/net/pkt_act.h	2005-01-10 22:54:01 +01:00
+++ edited/include/net/pkt_act.h	2005-03-25 20:58:28 +01:00
@@ -102,20 +102,21 @@
 		p = tcf_ht[tcf_hash(i)];
 
 		for (; p; p = p->next) {
-			index++;
-			if (index < s_i)
+			if (index < s_i) {
+				index++;
 				continue;
+			}
 			a->priv = p;
 			a->order = n_i;
 			r = (struct rtattr*) skb->tail;
 			RTA_PUT(skb, a->order, 0, NULL);
 			err = tcf_action_dump_1(skb, a, 0, 0);
 			if (0 > err) {
-				index--;
 				skb_trim(skb, (u8*)r - skb->data);
 				goto done;
 			}
 			r->rta_len = skb->tail - (u8*)r;
+			index++;
 			n_i++;
 			if (n_i >= TCA_ACT_MAX_PRIO) {
 				goto done;
@@ -124,8 +125,7 @@
 	}
 done:
 	read_unlock(&tcf_t_lock);
-	if (n_i)
-		cb->args[0] += n_i;
+	cb->args[0] = index;
 	return n_i;
 
 rtattr_failure:

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 19:59                                                       ` Andy Furniss
@ 2005-03-25 20:09                                                         ` Patrick McHardy
  2005-03-25 20:42                                                           ` Andy Furniss
  2005-03-25 20:10                                                         ` jamal
  1 sibling, 1 reply; 126+ messages in thread
From: Patrick McHardy @ 2005-03-25 20:09 UTC (permalink / raw)
  To: Andy Furniss
  Cc: hadi, Harald Welte, Remus, netdev, Nguyen Dinh Nam, Andre Tomt,
	syrius.ml, Damion de Soto

Andy Furniss wrote:
> iptables -A POSTROUTING -t mangle -j CONNMARK --set-mark 1
> iptables -A INPUT -t mangle -m mark --mark 1
> tc qdisc add dev eth0 ingress
> tc filter add dev eth0 parent ffff: protocol ip prio 6 u32 match ip src 
> 0/0 flowid 1:1 action ipt -j CONNMARK --restore-mark
> 
> It doesn't mark the packets.

With tc actions the ingress qdisc gets packets before connection
tracking, so CONNMARK doesn't have a connection tracking entry to
mark.

Regards
Patrick

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 19:59                                                       ` Andy Furniss
  2005-03-25 20:09                                                         ` Patrick McHardy
@ 2005-03-25 20:10                                                         ` jamal
  2005-03-25 20:18                                                           ` Patrick McHardy
                                                                             ` (3 more replies)
  1 sibling, 4 replies; 126+ messages in thread
From: jamal @ 2005-03-25 20:10 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Fri, 2005-03-25 at 14:59, Andy Furniss wrote:
> jamal wrote:
> 
> > I gotta give it to you - you have helped isolate more bugs in this one
> > session than all others in the past summed up.
> 
> LOL - there may be more to come,

;-> You know what they say: 
Trust in God only (if you believe in one) and not software. 
And youve probably heard the famous last words just before the big crush
which sound like "that was the last bug" ;->

>  though I haven't used CONNMARK before 
> and may be missing something about it - and maybe there is a tc way to 
> look at connmark anyway?
> 

I dont think connmark will work - yet. Patrick? I think you need
something attached on the skb that is derived off the netfilter
contracking code for it to be usable.

Things will work once the  "action track" is in place; i.e you would
then say:
"match xxx .. \
 action track \
 action connmark"

If i was to prioritize my time for new actions - how important is this?
I also wish someone else would start writting some of these actions ;->
Wanna right the tracking one? I could help - wink.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 20:10                                                         ` jamal
@ 2005-03-25 20:18                                                           ` Patrick McHardy
  2005-03-25 20:45                                                             ` jamal
  2005-03-25 20:20                                                           ` Thomas Graf
                                                                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 126+ messages in thread
From: Patrick McHardy @ 2005-03-25 20:18 UTC (permalink / raw)
  To: hadi
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> I dont think connmark will work - yet. Patrick? I think you need
> something attached on the skb that is derived off the netfilter
> contracking code for it to be usable.

Correct.

> Things will work once the  "action track" is in place; i.e you would
> then say:
> "match xxx .. \
>  action track \
>  action connmark"
> 
> If i was to prioritize my time for new actions - how important is this?
> I also wish someone else would start writting some of these actions ;->
> Wanna right the tracking one? I could help - wink.

Before this the ipt action needs to make sure the packets are in valid
state from the view of conntrack/ip_tables. Right now it doesn't even
check if its IP. Both assume the length checks in ip_rcv() have been
performed, it actually creates security problems in a few places if
they haven't - length calculations can underflow and bad things will
happen.

Regards
Patrick

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 20:10                                                         ` jamal
  2005-03-25 20:18                                                           ` Patrick McHardy
@ 2005-03-25 20:20                                                           ` Thomas Graf
  2005-03-25 20:48                                                             ` jamal
  2005-03-25 20:39                                                           ` Patrick McHardy
  2005-03-25 21:18                                                           ` Andy Furniss
  3 siblings, 1 reply; 126+ messages in thread
From: Thomas Graf @ 2005-03-25 20:20 UTC (permalink / raw)
  To: jamal
  Cc: Andy Furniss, Harald Welte, Patrick McHardy, Remus, netdev,
	Nguyen Dinh Nam, Andre Tomt, syrius.ml, Damion de Soto

* jamal <1111781443.1092.631.camel@jzny.localdomain> 2005-03-25 15:10
> Things will work once the  "action track" is in place; i.e you would
> then say:
> "match xxx .. \
>  action track \
>  action connmark"
> 
> If i was to prioritize my time for new actions - how important is this?

7/10 because the meta ematch could make great use of this. Matching
on netfilter meta data is in my local tree but I guess I won't
have time to test everything in the next 2 weeks so it will probably
be too late for 2.6.12.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 20:07                                                               ` Patrick McHardy
@ 2005-03-25 20:31                                                                 ` jamal
  2005-03-25 20:37                                                                   ` Patrick McHardy
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-25 20:31 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto


>From the outset this looks fine. What would be a good test case?
Something that will ensure we go beyond 4K(NLMSG_GOODSIZE) for a dump?

cheers,
jamal

On Fri, 2005-03-25 at 15:07, Patrick McHardy wrote:
> Patrick McHardy wrote:
> > tcf_dump_walker() doesn't save the number of skipped entries, but
> > the last order dumped, so it could dump the same entries again
> > and again when they exceed the room in the skb.
> 
> How about this patch? It fixes two problems:
> 
> - off-by-one while skipping entries: index is incremented before the
>    comparison with s_i, so it will start dumping at entry s_i-1 instead
>    of s_i
> - problem described above. n_i doesn't include how many empty hash
>    chains were skipped, so adding it to cb->args[0] is incorrect
> 
> Regards
> Patrick
> 
> ______________________________________________________________________
> 
> ===== include/net/pkt_act.h 1.10 vs edited =====
> --- 1.10/include/net/pkt_act.h	2005-01-10 22:54:01 +01:00
> +++ edited/include/net/pkt_act.h	2005-03-25 20:58:28 +01:00
> @@ -102,20 +102,21 @@
>  		p = tcf_ht[tcf_hash(i)];
>  
>  		for (; p; p = p->next) {
> -			index++;
> -			if (index < s_i)
> +			if (index < s_i) {
> +				index++;
>  				continue;
> +			}
>  			a->priv = p;
>  			a->order = n_i;
>  			r = (struct rtattr*) skb->tail;
>  			RTA_PUT(skb, a->order, 0, NULL);
>  			err = tcf_action_dump_1(skb, a, 0, 0);
>  			if (0 > err) {
> -				index--;
>  				skb_trim(skb, (u8*)r - skb->data);
>  				goto done;
>  			}
>  			r->rta_len = skb->tail - (u8*)r;
> +			index++;
>  			n_i++;
>  			if (n_i >= TCA_ACT_MAX_PRIO) {
>  				goto done;
> @@ -124,8 +125,7 @@
>  	}
>  done:
>  	read_unlock(&tcf_t_lock);
> -	if (n_i)
> -		cb->args[0] += n_i;
> +	cb->args[0] = index;
>  	return n_i;
>  
>  rtattr_failure:

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 20:31                                                                 ` jamal
@ 2005-03-25 20:37                                                                   ` Patrick McHardy
  2005-03-25 20:54                                                                     ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Patrick McHardy @ 2005-03-25 20:37 UTC (permalink / raw)
  To: hadi
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
>>From the outset this looks fine. What would be a good test case?
> Something that will ensure we go beyond 4K(NLMSG_GOODSIZE) for a dump?

Yes, that should work.

Regards
Patrick

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 20:10                                                         ` jamal
  2005-03-25 20:18                                                           ` Patrick McHardy
  2005-03-25 20:20                                                           ` Thomas Graf
@ 2005-03-25 20:39                                                           ` Patrick McHardy
  2005-03-25 20:55                                                             ` jamal
  2005-03-25 21:18                                                           ` Andy Furniss
  3 siblings, 1 reply; 126+ messages in thread
From: Patrick McHardy @ 2005-03-25 20:39 UTC (permalink / raw)
  To: hadi
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> Things will work once the  "action track" is in place; i.e you would
> then say:
> "match xxx .. \
>  action track \
>  action connmark"

Thinking again, is it really necessary? Please look at the problem
and patch Phil Oester just posted, I would prefer if we could keep
conntrack out of areas with queues :)

Regards
Patrick

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 20:09                                                         ` Patrick McHardy
@ 2005-03-25 20:42                                                           ` Andy Furniss
  0 siblings, 0 replies; 126+ messages in thread
From: Andy Furniss @ 2005-03-25 20:42 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: hadi, Harald Welte, Remus, netdev, Nguyen Dinh Nam, Andre Tomt,
	syrius.ml, Damion de Soto

Patrick McHardy wrote:
> Andy Furniss wrote:
> 
>> iptables -A POSTROUTING -t mangle -j CONNMARK --set-mark 1
>> iptables -A INPUT -t mangle -m mark --mark 1
>> tc qdisc add dev eth0 ingress
>> tc filter add dev eth0 parent ffff: protocol ip prio 6 u32 match ip 
>> src 0/0 flowid 1:1 action ipt -j CONNMARK --restore-mark
>>
>> It doesn't mark the packets.
> 
> 
> With tc actions the ingress qdisc gets packets before connection
> tracking, so CONNMARK doesn't have a connection tracking entry to
> mark.

Ahh - Thanks I misunderstood talk of being able to mark connections 
earlier in this thread and thought it was hooking after conntrack.

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 20:18                                                           ` Patrick McHardy
@ 2005-03-25 20:45                                                             ` jamal
  2005-03-25 21:10                                                               ` Patrick McHardy
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-25 20:45 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Fri, 2005-03-25 at 15:18, Patrick McHardy wrote:

> Before this the ipt action needs to make sure the packets are in valid
> state from the view of conntrack/ip_tables. Right now it doesn't even
> check if its IP. 

In regards to ipt:
This is true and the checking needs to be done. 
At the moment it is expected the user will only direct IP packets at
ipt. Note, however - desire is not to just stick to iptables
but rather also accept arp packets and use targets arptables has etc. 
In such cases it will be important that checks are made.
Even in this case though -there will be target which probably wont care
if i gave them a decnet packet or IP - example mark. Is this correct? I
can understand when headers are to be mucked with.

in regards to tracking:
We will have actions that will do all those validations - but the choice
will be upto the users policy. Will tracking have issues if i passed it
a packet that didnt have the correct checksum?

> Both assume the length checks in ip_rcv() have been
> performed, it actually creates security problems in a few places if
> they haven't - length calculations can underflow and bad things will
> happen.
> 

I havent really stared at the contrack code - If i ask it to track for
me though, would it have issues?
Recall that the packets at the two tc spots (ingress/egress) already
have their skb pointers in the right spots.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 20:20                                                           ` Thomas Graf
@ 2005-03-25 20:48                                                             ` jamal
  2005-03-25 21:01                                                               ` Thomas Graf
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-25 20:48 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Andy Furniss, Harald Welte, Patrick McHardy, Remus, netdev,
	Nguyen Dinh Nam, Andre Tomt, syrius.ml, Damion de Soto

On Fri, 2005-03-25 at 15:20, Thomas Graf wrote:

> 7/10 because the meta ematch could make great use of this. Matching
> on netfilter meta data is in my local tree but I guess I won't
> have time to test everything in the next 2 weeks so it will probably
> be too late for 2.6.12.

/me looks at Thomas ;-> Could you be convinced to do this so i can play
with my ipsec toy? ;->

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 20:37                                                                   ` Patrick McHardy
@ 2005-03-25 20:54                                                                     ` jamal
  2005-03-25 21:23                                                                       ` Patrick McHardy
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-25 20:54 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

[-- Attachment #1: Type: text/plain, Size: 360 bytes --]

On Fri, 2005-03-25 at 15:37, Patrick McHardy wrote:

> Yes, that should work.
> 

I think the attached script may help if you call it with something like
1000;
You may need to adapt it slightly so you add the actions with new
filters instead of directly. I would have to do a lot of plumbing from
my scripts to give you one which does just that

cheers,
jamal

[-- Attachment #2: add_actions.sh --]
[-- Type: text/x-sh, Size: 147 bytes --]

#!/bin/sh

tc qdisc del dev eth1 ingress
tc qdisc add dev eth1 ingress

for ((i = 1 ; i <= $1 ; i++))
do
	tc actions add action drop index $i
done

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 20:39                                                           ` Patrick McHardy
@ 2005-03-25 20:55                                                             ` jamal
  2005-03-25 21:00                                                               ` Patrick McHardy
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-25 20:55 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Fri, 2005-03-25 at 15:39, Patrick McHardy wrote:

> Thinking again, is it really necessary? Please look at the problem
> and patch Phil Oester just posted, I would prefer if we could keep
> conntrack out of areas with queues :)
> 

But it is already there ;-> Recall you are already end up in the qdisc
queues today ;->

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 20:55                                                             ` jamal
@ 2005-03-25 21:00                                                               ` Patrick McHardy
  2005-03-25 21:44                                                                 ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Patrick McHardy @ 2005-03-25 21:00 UTC (permalink / raw)
  To: hadi
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> On Fri, 2005-03-25 at 15:39, Patrick McHardy wrote:
> 
> 
>>Thinking again, is it really necessary? Please look at the problem
>>and patch Phil Oester just posted, I would prefer if we could keep
>>conntrack out of areas with queues :)
>>
> 
> 
> But it is already there ;-> Recall you are already end up in the qdisc
> queues today ;->

I asked Phil to send a new patch which drops the reference when
the packet leaves IP. We can't make assumptions about the packets
fate after that, and the problem with hanging conntrack unload
really should get fixed once and for all.

Regards
Patrick

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 20:48                                                             ` jamal
@ 2005-03-25 21:01                                                               ` Thomas Graf
  2005-03-25 21:48                                                                 ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Thomas Graf @ 2005-03-25 21:01 UTC (permalink / raw)
  To: jamal
  Cc: Andy Furniss, Harald Welte, Patrick McHardy, Remus, netdev,
	Nguyen Dinh Nam, Andre Tomt, syrius.ml, Damion de Soto

* jamal <1111783686.1089.661.camel@jzny.localdomain> 2005-03-25 15:48
> On Fri, 2005-03-25 at 15:20, Thomas Graf wrote:
> 
> > 7/10 because the meta ematch could make great use of this. Matching
> > on netfilter meta data is in my local tree but I guess I won't
> > have time to test everything in the next 2 weeks so it will probably
> > be too late for 2.6.12.
> 
> /me looks at Thomas ;-> Could you be convinced to do this so i can play
> with my ipsec toy? ;->

I can enqueue it to the following working queue, won't be dequeued
for quite a while though.
 resolve an open htb+gred issue
 libqsearch thing + ematch
 push meta ematch changes
 libnl + netconfig release

If you want to see results soon, look for someone else. ;->

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 20:45                                                             ` jamal
@ 2005-03-25 21:10                                                               ` Patrick McHardy
  2005-03-25 21:57                                                                 ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Patrick McHardy @ 2005-03-25 21:10 UTC (permalink / raw)
  To: hadi
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> At the moment it is expected the user will only direct IP packets at
> ipt. Note, however - desire is not to just stick to iptables
> but rather also accept arp packets and use targets arptables has etc. 
> In such cases it will be important that checks are made.
> Even in this case though -there will be target which probably wont care
> if i gave them a decnet packet or IP - example mark. Is this correct? I
> can understand when headers are to be mucked with.

That is correct.

> in regards to tracking:
> We will have actions that will do all those validations - but the choice
> will be upto the users policy. Will tracking have issues if i passed it
> a packet that didnt have the correct checksum?

No, it might (TCP) simply ignore them. NAT usually does incremental
checksumming, except for ICMP errors. As for validation - I think we
have two things, necessary validations, these can't be optional,
and useless validations, since they are not necessary :) TCP checksum
for example would be useless, since everything in iptables that cares
about it needs to verify it itself anyway.

>>Both assume the length checks in ip_rcv() have been
>>performed, it actually creates security problems in a few places if
>>they haven't - length calculations can underflow and bad things will
>>happen.
> 
> I havent really stared at the contrack code - If i ask it to track for
> me though, would it have issues?
> Recall that the packets at the two tc spots (ingress/egress) already
> have their skb pointers in the right spots.

It will try to track. The problematic spots are length calculations,
it is assumed that skb->len == iph->ihl*4.

Regards
Patrick

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 20:10                                                         ` jamal
                                                                             ` (2 preceding siblings ...)
  2005-03-25 20:39                                                           ` Patrick McHardy
@ 2005-03-25 21:18                                                           ` Andy Furniss
  2005-03-25 22:12                                                             ` IMQ again WAS(Re: " jamal
  3 siblings, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-03-25 21:18 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:

> 
> Things will work once the  "action track" is in place; i.e you would
> then say:
> "match xxx .. \
>  action track \
>  action connmark"

OK I would need that to recreate what I do now with IMQ hooked after 
deNAT so I can see local addresses and use connbytes in prerouting 
mangle (though that's on my 2.4 I can't get connbytes to work with 
latest netfilter yet anyway)

> 
> If i was to prioritize my time for new actions - how important is this?

Things are OK for me with IMQ - low bandwidth and not many filters seem 
fine. At high bandwidth/lots of filters it seems problematic - but then 
most people can use dummy now :-)

I'll have to re-run a test I did recently which was lots of tc filter 
matches at 8000pps - on egress IMQ was almost as good as directly on 
eth0. On ingress it was more than 10X worse.

> I also wish someone else would start writting some of these actions ;->
> Wanna right the tracking one? I could help - wink.

LOL - you'd probably end up writing it all anyway.

I really should try and get into coding more though, apart from a few 
small hacks I have had no practice with C/kernel stuff.

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 20:54                                                                     ` jamal
@ 2005-03-25 21:23                                                                       ` Patrick McHardy
  0 siblings, 0 replies; 126+ messages in thread
From: Patrick McHardy @ 2005-03-25 21:23 UTC (permalink / raw)
  To: hadi
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> I think the attached script may help if you call it with something like
> 1000;
> You may need to adapt it slightly so you add the actions with new
> filters instead of directly. I would have to do a lot of plumbing from
> my scripts to give you one which does just that

The patch was wrong, sorry, index is initialized to -1. I'll have
another look at the second problem after getting something to eat.

Regards
Patrick

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 21:00                                                               ` Patrick McHardy
@ 2005-03-25 21:44                                                                 ` jamal
  0 siblings, 0 replies; 126+ messages in thread
From: jamal @ 2005-03-25 21:44 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Fri, 2005-03-25 at 16:00, Patrick McHardy wrote:

> I asked Phil to send a new patch which drops the reference when
> the packet leaves IP. We can't make assumptions about the packets
> fate after that, and the problem with hanging conntrack unload
> really should get fixed once and for all.
> 

Queues which are not getting consumed are always a problem with
skbs.
One of the classical problems i have seen posted is someone
is some person running some IDS or some other thing using 
BPF with more than one socket and having his low mem box being pounded
by some DOS. Soon OOM kicks in and starts randomly killing processes
because skbs are still being refcounted by the user space app that is
now unable to keep up.
i.e it is a generic problem that would happen even with NAPI with lack
of proper feedback. i suppose a contrack reference adds more of a twist
to it ;->

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 21:01                                                               ` Thomas Graf
@ 2005-03-25 21:48                                                                 ` jamal
  2005-03-25 22:03                                                                   ` Thomas Graf
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-25 21:48 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Andy Furniss, Harald Welte, Patrick McHardy, Remus, netdev,
	Nguyen Dinh Nam, Andre Tomt, syrius.ml, Damion de Soto

On Fri, 2005-03-25 at 16:01, Thomas Graf wrote:
> * jamal <1111783686.1089.661.camel@jzny.localdomain> 2005-03-25 15:48

> I can enqueue it to the following working queue, won't be dequeued
> for quite a while though.
>  resolve an open htb+gred issue

What is the issue with htb+gred?

>  libqsearch thing + ematch

yes, I am waiting for this too you know ;->

>  push meta ematch changes

Dare i say i am working on the meta action? Have been talking about it
for a while now ;->
I have to send the simple action patch to Dave first

>  libnl + netconfig release
> 

Are you working with the netconfig code thats out there?

> If you want to see results soon, look for someone else. ;->

Well, lets queue it here for now (I or some braver person may dequeue it
from here;->). Patrick is providing some valuable insight on what could
go wrong.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 21:10                                                               ` Patrick McHardy
@ 2005-03-25 21:57                                                                 ` jamal
  0 siblings, 0 replies; 126+ messages in thread
From: jamal @ 2005-03-25 21:57 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Andy Furniss, Harald Welte, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Fri, 2005-03-25 at 16:10, Patrick McHardy wrote:

> No, it might (TCP) simply ignore them. NAT usually does incremental
> checksumming, except for ICMP errors. As for validation - I think we
> have two things, necessary validations, these can't be optional,
> and useless validations, since they are not necessary :) TCP checksum
> for example would be useless, since everything in iptables that cares
> about it needs to verify it itself anyway.
> 

This is very useful info.

> >>Both assume the length checks in ip_rcv() have been
> >>performed, it actually creates security problems in a few places if
> >>they haven't - length calculations can underflow and bad things will
> >>happen.
> > 
> > I havent really stared at the contrack code - If i ask it to track for
> > me though, would it have issues?
> > Recall that the packets at the two tc spots (ingress/egress) already
> > have their skb pointers in the right spots.
> 
> It will try to track. The problematic spots are length calculations,
> it is assumed that skb->len == iph->ihl*4.
> 

Those kind of things may be fine actually but not the checks.
The classifier depends on some of them being correct i.e
you can be assured  the ip header will be at skb->nh.iph when we pass
the packet
There is theory that ip_rcv() kind of check in any case belongs to some
action attached to a netdevice owning a table that contains IP addresses
for that netdevice ;-> This would really ease management of per device
IP addresses - maybe someday ;->.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 21:48                                                                 ` jamal
@ 2005-03-25 22:03                                                                   ` Thomas Graf
  2005-03-25 22:20                                                                     ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Thomas Graf @ 2005-03-25 22:03 UTC (permalink / raw)
  To: jamal
  Cc: Andy Furniss, Harald Welte, Patrick McHardy, Remus, netdev,
	Nguyen Dinh Nam, Andre Tomt, syrius.ml, Damion de Soto

* jamal <1111787325.1092.690.camel@jzny.localdomain> 2005-03-25 16:48
> On Fri, 2005-03-25 at 16:01, Thomas Graf wrote:
> > * jamal <1111783686.1089.661.camel@jzny.localdomain> 2005-03-25 15:48
> 
> > I can enqueue it to the following working queue, won't be dequeued
> > for quite a while though.
> >  resolve an open htb+gred issue
> 
> What is the issue with htb+gred?

Remember your change to gred where you fixed a bug reported
by stanford checker? I have two reports that this broke things
for HTB+GRED although the change looks ok to me and is definitely
needed.

> >  libnl + netconfig release
> 
> Are you working with the netconfig code thats out there?

I did not receive a reply from the original author of the
code but I'm using it as a base. I simply imported the code
into netconfig and changed it to use libnl for now. I left
the ioctl there so it should be easy to have a compile flag
to get back to ioctl.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* IMQ again WAS(Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 21:18                                                           ` Andy Furniss
@ 2005-03-25 22:12                                                             ` jamal
  2005-03-25 23:26                                                               ` Andy Furniss
  2005-03-27 19:35                                                               ` Andy Furniss
  0 siblings, 2 replies; 126+ messages in thread
From: jamal @ 2005-03-25 22:12 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto


Changed subject to whats being discused ;->

On Fri, 2005-03-25 at 16:18, Andy Furniss wrote:
> jamal wrote:
[..]
> OK I would need that to recreate what I do now with IMQ hooked after 
> deNAT so I can see local addresses and use connbytes in prerouting 
> mangle (though that's on my 2.4 I can't get connbytes to work with 
> latest netfilter yet anyway)
> 

What exactly do you use such a scenario for?

> > 
> > If i was to prioritize my time for new actions - how important is this?
> 
> Things are OK for me with IMQ - low bandwidth and not many filters seem 
> fine. At high bandwidth/lots of filters it seems problematic - but then 
> most people can use dummy now :-)
> 
> I'll have to re-run a test I did recently which was lots of tc filter 
> matches at 8000pps - on egress IMQ was almost as good as directly on 
> eth0. On ingress it was more than 10X worse.
> 

How many filters? I wont suspect any difference between ingress 
and egress.

> > I also wish someone else would start writting some of these actions ;->
> > Wanna right the tracking one? I could help - wink.
> 
> LOL - you'd probably end up writing it all anyway.
> 
> I really should try and get into coding more though, apart from a few 
> small hacks I have had no practice with C/kernel stuff.
> 

Hey, you want to get started let me know ;-> Thomas and myself plan to
do good documentation on the actions and ematch as they say Real Soon
Now ;->

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 22:03                                                                   ` Thomas Graf
@ 2005-03-25 22:20                                                                     ` jamal
  0 siblings, 0 replies; 126+ messages in thread
From: jamal @ 2005-03-25 22:20 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Andy Furniss, Harald Welte, Patrick McHardy, Remus, netdev,
	Nguyen Dinh Nam, Andre Tomt, syrius.ml, Damion de Soto

On Fri, 2005-03-25 at 17:03, Thomas Graf wrote:
> * jamal <1111787325.1092.690.camel@jzny.localdomain> 2005-03-25 16:48

> > 
> > What is the issue with htb+gred?
> 
> Remember your change to gred where you fixed a bug reported
> by stanford checker? I have two reports that this broke things
> for HTB+GRED although the change looks ok to me and is definitely
> needed.
> 

I dont remember the change ;-> But it may be time to make gred classful
as Werner was pushing for many years.

> I did not receive a reply from the original author of the
> code but I'm using it as a base. I simply imported the code
> into netconfig and changed it to use libnl for now. I left
> the ioctl there so it should be easy to have a compile flag
> to get back to ioctl.

selection of ioctls at compile time would be useful.

Just fork the project if they are not responding. I guarantee they will
show up when they hear you forked their code ;->
BTW, I have interest to run this code if you have made improvements i
dont mind playing with it.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: IMQ again WAS(Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 22:12                                                             ` IMQ again WAS(Re: " jamal
@ 2005-03-25 23:26                                                               ` Andy Furniss
  2005-03-27 19:35                                                               ` Andy Furniss
  1 sibling, 0 replies; 126+ messages in thread
From: Andy Furniss @ 2005-03-25 23:26 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> Changed subject to whats being discused ;->
> 
> On Fri, 2005-03-25 at 16:18, Andy Furniss wrote:
> 
>>jamal wrote:
> 
> [..]
> 
>>OK I would need that to recreate what I do now with IMQ hooked after 
>>deNAT so I can see local addresses and use connbytes in prerouting 
>>mangle (though that's on my 2.4 I can't get connbytes to work with 
>>latest netfilter yet anyway)
>>
> 
> 
> What exactly do you use such a scenario for?

IMQ because my shaping box counts as my 3rd PC and sometimes runs bt, 
mldonkey, wget.

After deNAT so I can do per user fairness.

Connbytes has dual use - I mark first 80KB of bulk tcp with it and send 
it to a shortish queue which I hacked to head drop and has half my 
512kbit bandwidth.

This either priorotises browsing in the presence of bulk or stops 
multiple connections in slowstart causing latency bumps if I am gaming 
and someone else is browsing big web pages.

Doesn't fix game + bulk + browsing - I think only a hack to htb/hfsc to 
have a class behave as full before it is would help this.

> 
> 
>>>If i was to prioritize my time for new actions - how important is this?
>>
>>Things are OK for me with IMQ - low bandwidth and not many filters seem 
>>fine. At high bandwidth/lots of filters it seems problematic - but then 
>>most people can use dummy now :-)
>>
>>I'll have to re-run a test I did recently which was lots of tc filter 
>>matches at 8000pps - on egress IMQ was almost as good as directly on 
>>eth0. On ingress it was more than 10X worse.
>>
> 
> 
> How many filters? I wont suspect any difference between ingress 
> and egress.

I'll have to run again to be sure but I saw a big difference - on egress 
I could generate 8000pps and have each packet tested by about 1500 filters.

On ingress I saw packet loss with only a couple of hundred or so - it 
was a tcp test though - so it backed off the loss was deduced by looking 
at netstat retrans on the sender I couldn't see it on any stats. This 
was with netperf. Maybe I should think of a better test - I tried udp 
and it whacked my PC so much I thought it had locked up.

> 
> Hey, you want to get started let me know ;-> Thomas and myself plan to
> do good documentation on the actions and ematch as they say Real Soon
> Now ;->

I look forward to reading them :-)

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: IMQ again WAS(Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-25 22:12                                                             ` IMQ again WAS(Re: " jamal
  2005-03-25 23:26                                                               ` Andy Furniss
@ 2005-03-27 19:35                                                               ` Andy Furniss
  2005-03-28 13:39                                                                 ` Andy Furniss
  1 sibling, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-03-27 19:35 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:

>>I'll have to re-run a test I did recently which was lots of tc filter 
>>matches at 8000pps - on egress IMQ was almost as good as directly on 
>>eth0. On ingress it was more than 10X worse.
>>
> How many filters? I wont suspect any difference between ingress 
> and egress.

You are right - the test was to blame.

I was using my old PC as sender, it's frozen in time at 2.4.20 which for 
some reason has a txqueuelen on eth0 of 0. It doesn't show using netperf 
when just testing LAN speed - but makes alot of difference for the test 
I did - ifconfig eth0 txqueuelen 1000 fixed it.

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: IMQ again WAS(Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-27 19:35                                                               ` Andy Furniss
@ 2005-03-28 13:39                                                                 ` Andy Furniss
  2005-03-28 13:45                                                                   ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-03-28 13:39 UTC (permalink / raw)
  To: Andy Furniss
  Cc: hadi, Harald Welte, Patrick McHardy, Remus, netdev,
	Nguyen Dinh Nam, Andre Tomt, syrius.ml, Damion de Soto

Andy Furniss wrote:
> jamal wrote:
> 
>>> I'll have to re-run a test I did recently which was lots of tc filter 
>>> matches at 8000pps - on egress IMQ was almost as good as directly on 
>>> eth0. On ingress it was more than 10X worse.
>>>
>> How many filters? I wont suspect any difference between ingress and 
>> egress.
> 
> 
> You are right - the test was to blame.
> 
> I was using my old PC as sender, it's frozen in time at 2.4.20 which for 
> some reason has a txqueuelen on eth0 of 0. It doesn't show using netperf 
> when just testing LAN speed - but makes alot of difference for the test 
> I did - ifconfig eth0 txqueuelen 1000 fixed it.

Hmm - I just tried to recreate another test I did - which was using IMQ 
to shape for a single duplex link. I was going to redo it with dummy, 
but don't seem to be able to put an egress filter on eth0 - eg. Your 
example from the first post in this thread -

What you can do with dummy currently with actions
--------------------------------------------------

Lets say you are policing packets from alias 192.168.200.200/32
you dont want those to exceed 100kbps going out.

tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \
match ip src 192.168.200.200/32 flowid 1:2 \
action police rate 100kbit burst 90k drop

Gives me -

RTNETLINK answers: Invalid argument
We have an error talking to the kernel

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: IMQ again WAS(Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-28 13:39                                                                 ` Andy Furniss
@ 2005-03-28 13:45                                                                   ` jamal
  2005-03-28 13:55                                                                     ` Andy Furniss
  2005-03-28 13:57                                                                     ` jamal
  0 siblings, 2 replies; 126+ messages in thread
From: jamal @ 2005-03-28 13:45 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Mon, 2005-03-28 at 08:39, Andy Furniss wrote:

> Hmm - I just tried to recreate another test I did - which was using IMQ 
> to shape for a single duplex link. I was going to redo it with dummy, 
> but don't seem to be able to put an egress filter on eth0 - eg. Your 
> example from the first post in this thread -
> 
> What you can do with dummy currently with actions
> --------------------------------------------------
> 
> Lets say you are policing packets from alias 192.168.200.200/32
> you dont want those to exceed 100kbps going out.
> 
> tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \
> match ip src 192.168.200.200/32 flowid 1:2 \
> action police rate 100kbit burst 90k drop
> 
> Gives me -
> 
> RTNETLINK answers: Invalid argument
> We have an error talking to the kernel
> 

Dont see why this shouldnt work. You are saying it works with
non-aliased interface addreses?

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: IMQ again WAS(Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-28 13:45                                                                   ` jamal
@ 2005-03-28 13:55                                                                     ` Andy Furniss
  2005-03-28 14:08                                                                       ` jamal
  2005-03-28 13:57                                                                     ` jamal
  1 sibling, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-03-28 13:55 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> On Mon, 2005-03-28 at 08:39, Andy Furniss wrote:
> 
> 
>>Hmm - I just tried to recreate another test I did - which was using IMQ 
>>to shape for a single duplex link. I was going to redo it with dummy, 
>>but don't seem to be able to put an egress filter on eth0 - eg. Your 
>>example from the first post in this thread -
>>
>>What you can do with dummy currently with actions
>>--------------------------------------------------
>>
>>Lets say you are policing packets from alias 192.168.200.200/32
>>you dont want those to exceed 100kbps going out.
>>
>>tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \
>>match ip src 192.168.200.200/32 flowid 1:2 \
>>action police rate 100kbit burst 90k drop
>>
>>Gives me -
>>
>>RTNETLINK answers: Invalid argument
>>We have an error talking to the kernel
>>
> 
> 
> Dont see why this shouldnt work. You are saying it works with
> non-aliased interface addreses?

No it doesn't work at all.

I noticed default qdisc has handle 0:

#tc -s qdisc ls dev eth0
qdisc pfifo_fast 0: bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
  Sent 13024 bytes 95 pkt (dropped 0, overlimits 0 requeues 0)
  rate 0bit 0pps backlog 0b 0p requeues 0

but using parent 0: instead of 1: still fails.

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: IMQ again WAS(Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-28 13:45                                                                   ` jamal
  2005-03-28 13:55                                                                     ` Andy Furniss
@ 2005-03-28 13:57                                                                     ` jamal
  2005-03-28 14:12                                                                       ` Andy Furniss
  1 sibling, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-28 13:57 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Mon, 2005-03-28 at 08:45, jamal wrote:

> Dont see why this shouldnt work. You are saying it works with
> non-aliased interface addreses?
> 

Tested - worked fine here. I know those error messages suck and we can
do better in the future. Double check your kernel has police compiled
in.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: IMQ again WAS(Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-28 13:55                                                                     ` Andy Furniss
@ 2005-03-28 14:08                                                                       ` jamal
  0 siblings, 0 replies; 126+ messages in thread
From: jamal @ 2005-03-28 14:08 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Mon, 2005-03-28 at 08:55, Andy Furniss wrote:

> > Dont see why this shouldnt work. You are saying it works with
> > non-aliased interface addreses?
> 
> No it doesn't work at all.
> 
> I noticed default qdisc has handle 0:
> 
> #tc -s qdisc ls dev eth0
> qdisc pfifo_fast 0: bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>   Sent 13024 bytes 95 pkt (dropped 0, overlimits 0 requeues 0)
>   rate 0bit 0pps backlog 0b 0p requeues 0
> 
> but using parent 0: instead of 1: still fails.
> 

Aha! You need to create root qdisc 1: for it to work. 
pfifo_fast is essentially just basic prio and it was supposed to be
hidden so user doesnt know it exists (blame me for why it is visible).
So try installing prio and see if it still fails.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: IMQ again WAS(Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-28 13:57                                                                     ` jamal
@ 2005-03-28 14:12                                                                       ` Andy Furniss
  2005-03-28 14:20                                                                         ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-03-28 14:12 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> On Mon, 2005-03-28 at 08:45, jamal wrote:
> 
> 
>>Dont see why this shouldnt work. You are saying it works with
>>non-aliased interface addreses?
>>
> 
> 
> Tested - worked fine here. I know those error messages suck and we can
> do better in the future. Double check your kernel has police compiled
> in.

Hmm - it works on ingress and all I really wanted to do was

tc filter add dev eth0 parent 0: protocol ip prio 10 u32 match ip src 
0/0 action mirred egress redirect dev dummy0

[root@amd /home/andy/Qos]# tc qdisc add dev eth0 ingress
[root@amd /home/andy/Qos]# tc filter add dev eth0 parent ffff: protocol 
ip prio 10 u32 match ip src 192.168.200.200/32 flowid 1:2 action police 
rate 100kbit burst 90k drop

[root@amd /home/andy/Qos]# tc -s filter ls dev eth0 parent ffff:
filter protocol ip pref 10 u32
filter protocol ip pref 10 u32 fh 800: ht divisor 1
filter protocol ip pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 
flowid 1:2  (rule hit 3 success 0)
   match c0a8c8c8/ffffffff at 12 (success 0 )
         action order 1:  police 0x1 rate 100000bit burst 90Kb mtu 2Kb 
action drop
ref 1 bind 1
         Action statistics:
         Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
         rate 0bit 0pps backlog 0b 0p requeues 0

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: IMQ again WAS(Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-28 14:12                                                                       ` Andy Furniss
@ 2005-03-28 14:20                                                                         ` jamal
  2005-03-28 14:28                                                                           ` Andy Furniss
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-28 14:20 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Mon, 2005-03-28 at 09:12, Andy Furniss wrote:

> 
> Hmm - it works on ingress and all I really wanted to do was
> 

Yes but you installed ingress qdisc ;->

> tc filter add dev eth0 parent 0: protocol ip prio 10 u32 match ip src 
> 0/0 action mirred egress redirect dev dummy0
> 

Likewise you need to install egress qdisc

Alexey did warn about making default qdisc visible - that people would
come back and ask for more ;->
If we are going to allow this then i think we should make ingress also
on by default (so the install before use doesnt apply there either).

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: IMQ again WAS(Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-28 14:20                                                                         ` jamal
@ 2005-03-28 14:28                                                                           ` Andy Furniss
  2005-03-28 14:36                                                                             ` Andy Furniss
  0 siblings, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-03-28 14:28 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> On Mon, 2005-03-28 at 09:12, Andy Furniss wrote:
> 
> 
>>Hmm - it works on ingress and all I really wanted to do was
>>
> 
> 
> Yes but you installed ingress qdisc ;->
> 
>>tc filter add dev eth0 parent 0: protocol ip prio 10 u32 match ip src 
>>0/0 action mirred egress redirect dev dummy0
>>
> 
> 
> Likewise you need to install egress qdisc
> 
> Alexey did warn about making default qdisc visible - that people would
> come back and ask for more ;->
> If we are going to allow this then i think we should make ingress also
> on by default (so the install before use doesnt apply there either).
> 
> cheers,
> jamal
> 
> 

Still having probs - rebooted and tried -

[root@amd /home/andy/Qos]# tc qdisc add dev eth0 root pfifo

[root@amd /home/andy/Qos]# tc -s qdisc ls dev eth0
qdisc pfifo 8001: limit 1000p
  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
  rate 0bit 0pps backlog 0b 0p requeues 0

[root@amd /home/andy/Qos]# tc filter add dev eth0 parent 8001: protocol 
ip prio 10 u32 match ip src 192.168.200.200/32 flowid 1:2 action police 
rate 100kbit burst 90k drop
RTNETLINK answers: Invalid argument
We have an error talking to the kernel

[root@amd /home/andy/Qos]# tc filter add dev eth0 parent 1: protocol ip 
prio 10 u32 match ip src 192.168.200.200/32 flowid 1:2 action police 
rate 100kbit burst 90k drop
RTNETLINK answers: Invalid argument
We have an error talking to the kernel

[root@amd /home/andy/Qos]# tc filter add dev eth0 parent 0x8001: 
protocol ip prio 10 u32 match ip src 192.168.200.200/32 flowid 1:2 
action police rate 100kbit burst 90k drop
RTNETLINK answers: Invalid argument
We have an error talking to the kernel

[root@amd /home/andy/Qos]# tc qdisc del dev eth0 root pfifo

[root@amd /home/andy/Qos]# tc qdisc add dev eth0 root handle 1:0 pfifo

[root@amd /home/andy/Qos]# tc -s qdisc ls dev eth0
qdisc pfifo 1: limit 1000p
  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
  rate 0bit 0pps backlog 0b 0p requeues 0

[root@amd /home/andy/Qos]# tc filter add dev eth0 parent 1: protocol ip 
prio 10 u32 match ip src 192.168.200.200/32 flowid 1:2 action police 
rate 100kbit burst 90k drop
RTNETLINK answers: Invalid argument
We have an error talking to the kernel

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: IMQ again WAS(Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-28 14:28                                                                           ` Andy Furniss
@ 2005-03-28 14:36                                                                             ` Andy Furniss
  2005-03-28 15:24                                                                               ` Andy Furniss
  0 siblings, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-03-28 14:36 UTC (permalink / raw)
  To: Andy Furniss
  Cc: hadi, Harald Welte, Patrick McHardy, Remus, netdev,
	Nguyen Dinh Nam, Andre Tomt, syrius.ml, Damion de Soto

Andy Furniss wrote:

> 
> Still having probs - rebooted and tried -

<snip>

It works if I use HTB instead of pfifo though.

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: IMQ again WAS(Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-28 14:36                                                                             ` Andy Furniss
@ 2005-03-28 15:24                                                                               ` Andy Furniss
  2005-03-28 19:27                                                                                 ` jamal
  0 siblings, 1 reply; 126+ messages in thread
From: Andy Furniss @ 2005-03-28 15:24 UTC (permalink / raw)
  To: Andy Furniss
  Cc: hadi, Harald Welte, Patrick McHardy, Remus, netdev,
	Nguyen Dinh Nam, Andre Tomt, syrius.ml, Damion de Soto

Andy Furniss wrote:
> Andy Furniss wrote:
> 
>>
>> Still having probs - rebooted and tried -
> 
> 
> <snip>
> 
> It works if I use HTB instead of pfifo though.
> 

Played a bit more, it seems liks the qdisc has to be classful

prio, htb, cbq, hfsc work

sfq, pfifo, tbf fail

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: IMQ again WAS(Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-28 15:24                                                                               ` Andy Furniss
@ 2005-03-28 19:27                                                                                 ` jamal
  2005-03-28 20:13                                                                                   ` Andy Furniss
  0 siblings, 1 reply; 126+ messages in thread
From: jamal @ 2005-03-28 19:27 UTC (permalink / raw)
  To: Andy Furniss
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

On Mon, 2005-03-28 at 10:24, Andy Furniss wrote:

> 
> Played a bit more, it seems liks the qdisc has to be classful
> 

Sorry, yes, should have made that clear.

cheers,
jamal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: IMQ again WAS(Re: iptables breakage WAS(Re: dummy as IMQ replacement
  2005-03-28 19:27                                                                                 ` jamal
@ 2005-03-28 20:13                                                                                   ` Andy Furniss
  0 siblings, 0 replies; 126+ messages in thread
From: Andy Furniss @ 2005-03-28 20:13 UTC (permalink / raw)
  To: hadi
  Cc: Harald Welte, Patrick McHardy, Remus, netdev, Nguyen Dinh Nam,
	Andre Tomt, syrius.ml, Damion de Soto

jamal wrote:
> On Mon, 2005-03-28 at 10:24, Andy Furniss wrote:
> 
> 
>>Played a bit more, it seems liks the qdisc has to be classful
>>
> 
> 
> Sorry, yes, should have made that clear.
> 

My fault - You said try prio but I read it as try pfifo - at least I 
know classless don't work now :-)

Andy.

^ permalink raw reply	[flat|nested] 126+ messages in thread

end of thread, other threads:[~2005-03-28 20:13 UTC | newest]

Thread overview: 126+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-30 22:12 dummy as IMQ replacement Jamal Hadi Salim
2005-01-31  8:20 ` Hasso Tepper
2005-01-31 12:25   ` jamal
2005-01-31 12:38     ` Hasso Tepper
2005-01-31 12:47       ` jamal
2005-01-31 13:02         ` Hasso Tepper
2005-01-31 13:28           ` Thomas Graf
2005-01-31 13:45             ` jamal
2005-01-31 14:06               ` Thomas Graf
2005-01-31 14:29                 ` jamal
2005-01-31 13:39           ` jamal
2005-01-31 14:14             ` Hasso Tepper
2005-01-31 14:25               ` jamal
2005-01-31 14:46                 ` Hasso Tepper
2005-01-31 15:34                   ` jamal
2005-01-31 18:00                   ` Lennert Buytenhek
2005-01-31 20:08                     ` jamal
2005-01-31 13:58 ` Thomas Graf
2005-01-31 14:19   ` jamal
2005-01-31 15:15     ` Thomas Graf
2005-01-31 15:40       ` jamal
2005-01-31 15:59         ` Thomas Graf
2005-01-31 16:40           ` jamal
2005-01-31 18:15             ` Thomas Graf
2005-01-31 20:18               ` jamal
2005-01-31 22:53                 ` Thomas Graf
2005-02-01 12:02                   ` jamal
2005-02-01 12:51                     ` Thomas Graf
2005-02-01 13:13                       ` jamal
2005-02-01 22:44                         ` Thomas Graf
2005-02-02 14:24                           ` jamal
2005-02-02 15:40                             ` Thomas Graf
2005-02-02 15:55                               ` Thomas Graf
2005-01-31 20:28         ` David S. Miller
2005-02-01  1:02       ` Andy Furniss
2005-02-01 13:31         ` Thomas Graf
2005-02-01 15:03           ` Andy Furniss
2005-02-02 13:28             ` Thomas Graf
2005-01-31 16:27 ` Andre Correa
2005-01-31 16:51   ` Jamal Hadi Salim
2005-01-31 22:39 ` Andy Furniss
2005-02-01 11:49   ` jamal
2005-02-01 14:53     ` Andy Furniss
2005-02-02 14:05       ` jamal
2005-02-04  0:33         ` Andy Furniss
2005-02-01 11:32 ` Andy Furniss
     [not found] ` <0fcf01c5077f$579e4b80$6e69690a@RIMAS>
     [not found]   ` <1107174142.8021.121.camel@jzny.localdomain>
2005-03-09 14:30     ` Remus
2005-03-09 14:38       ` jamal
2005-03-10  1:06         ` Jamal Hadi Salim
2005-03-10  9:18           ` Remus
2005-03-10 11:22             ` jamal
2005-03-19  1:09               ` Andy Furniss
2005-03-19  1:45                 ` jamal
2005-03-19 10:23                   ` Andy Furniss
2005-03-20 13:20                     ` jamal
2005-03-20 13:55                       ` jamal
2005-03-20 18:31                         ` jamal
2005-03-21 22:08                       ` Andy Furniss
2005-03-21 13:14                 ` iptables breakage WAS(Re: " jamal
2005-03-21 21:50                   ` Andy Furniss
2005-03-21 22:41                     ` jamal
2005-03-22  1:15                       ` Andy Furniss
2005-03-22  3:31                         ` jamal
2005-03-22 21:09                           ` Andy Furniss
2005-03-23  3:57                             ` jamal
2005-03-23 19:33                               ` Andy Furniss
2005-03-23 19:45                                 ` jamal
2005-03-23 20:53                                   ` Andy Furniss
2005-03-23 21:07                                     ` jamal
2005-03-23 22:46                                       ` Andy Furniss
2005-03-23 23:12                                         ` Andy Furniss
2005-03-24  0:34                                           ` jamal
2005-03-24  1:00                                             ` Andy Furniss
2005-03-24  0:53                                           ` jamal
2005-03-24  1:08                                             ` Andy Furniss
2005-03-24 11:32                                               ` jamal
2005-03-24 11:57                                                 ` jamal
2005-03-24 15:41                                                   ` Andy Furniss
2005-03-25 11:13                                                     ` jamal
2005-03-25 12:39                                                       ` jamal
2005-03-25 17:27                                                         ` Patrick McHardy
2005-03-25 18:34                                                           ` jamal
2005-03-25 19:01                                                             ` Patrick McHardy
2005-03-25 20:07                                                               ` Patrick McHardy
2005-03-25 20:31                                                                 ` jamal
2005-03-25 20:37                                                                   ` Patrick McHardy
2005-03-25 20:54                                                                     ` jamal
2005-03-25 21:23                                                                       ` Patrick McHardy
2005-03-25 19:08                                                             ` jamal
2005-03-25 19:22                                                               ` jamal
2005-03-25 19:59                                                       ` Andy Furniss
2005-03-25 20:09                                                         ` Patrick McHardy
2005-03-25 20:42                                                           ` Andy Furniss
2005-03-25 20:10                                                         ` jamal
2005-03-25 20:18                                                           ` Patrick McHardy
2005-03-25 20:45                                                             ` jamal
2005-03-25 21:10                                                               ` Patrick McHardy
2005-03-25 21:57                                                                 ` jamal
2005-03-25 20:20                                                           ` Thomas Graf
2005-03-25 20:48                                                             ` jamal
2005-03-25 21:01                                                               ` Thomas Graf
2005-03-25 21:48                                                                 ` jamal
2005-03-25 22:03                                                                   ` Thomas Graf
2005-03-25 22:20                                                                     ` jamal
2005-03-25 20:39                                                           ` Patrick McHardy
2005-03-25 20:55                                                             ` jamal
2005-03-25 21:00                                                               ` Patrick McHardy
2005-03-25 21:44                                                                 ` jamal
2005-03-25 21:18                                                           ` Andy Furniss
2005-03-25 22:12                                                             ` IMQ again WAS(Re: " jamal
2005-03-25 23:26                                                               ` Andy Furniss
2005-03-27 19:35                                                               ` Andy Furniss
2005-03-28 13:39                                                                 ` Andy Furniss
2005-03-28 13:45                                                                   ` jamal
2005-03-28 13:55                                                                     ` Andy Furniss
2005-03-28 14:08                                                                       ` jamal
2005-03-28 13:57                                                                     ` jamal
2005-03-28 14:12                                                                       ` Andy Furniss
2005-03-28 14:20                                                                         ` jamal
2005-03-28 14:28                                                                           ` Andy Furniss
2005-03-28 14:36                                                                             ` Andy Furniss
2005-03-28 15:24                                                                               ` Andy Furniss
2005-03-28 19:27                                                                                 ` jamal
2005-03-28 20:13                                                                                   ` Andy Furniss
2005-03-23  1:31                   ` Patrick McHardy
2005-03-23  4:01                     ` jamal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.