All of lore.kernel.org
 help / color / mirror / Atom feed
* tun: Use netif_receive_skb instead of netif_rx
@ 2010-05-19  7:57 Herbert Xu
  2010-05-19  8:09 ` Eric Dumazet
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2010-05-19  7:57 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Thomas Graf, Neil Horman, netdev

Hi:

tun: Use netif_receive_skb instead of netif_rx

First a bit of history as I recall, Dave can correct me where
he recalls differently :)

1) There was netif_rx and everyone had to use that.
2) Everyone had to use that, including drivers/net/tun.c.
3) NAPI brings us netif_receive_skb.
4) About the same time people noticed that tun.c can cause wild
   fluctuations in latency because of its use of netif_rx with IRQs
   enabled.
5) netif_rx_ni was added to address this.

However, netif_rx_ni was really a bit of a roundabout way of
injecting a packet if you think about it.  What ends up happening
is that we always queue the packet into the backlog, and then
immediately process it.  Which is what would happen if we simply
called netif_receive_skb directly.

So this patch just does the obvious thing and makes tun.c call
netif_receive_skb, albeit through the netif_receive_skb_ni wrapper
which does the necessary things for calling it in process context.

Now apart from potential performance gains from eliminating
unnecessary steps in the process, this has the benefit of keeping
the process context for the packet processing.  This is needed
by cgroups to shape network traffic based on the original process.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 4326520..0eed49f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -667,7 +667,7 @@ static __inline__ ssize_t tun_get_user(struct tun_struct *tun,
 		skb_shinfo(skb)->gso_segs = 0;
 	}
 
-	netif_rx_ni(skb);
+	netif_receive_skb_ni(skb);
 
 	tun->dev->stats.rx_packets++;
 	tun->dev->stats.rx_bytes += len;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fa8b476..34bb405 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1562,6 +1562,18 @@ extern int		netif_rx(struct sk_buff *skb);
 extern int		netif_rx_ni(struct sk_buff *skb);
 #define HAVE_NETIF_RECEIVE_SKB 1
 extern int		netif_receive_skb(struct sk_buff *skb);
+
+static inline int netif_receive_skb_ni(struct sk_buff *skb)
+{
+	int err;
+
+	local_bh_disable();
+	err = netif_receive_skb(skb);
+	local_bh_enable();
+
+	return err;
+}
+
 extern gro_result_t	dev_gro_receive(struct napi_struct *napi,
 					struct sk_buff *skb);
 extern gro_result_t	napi_skb_finish(gro_result_t ret, struct sk_buff *skb);

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-19  7:57 tun: Use netif_receive_skb instead of netif_rx Herbert Xu
@ 2010-05-19  8:09 ` Eric Dumazet
  2010-05-19  8:18   ` Eric Dumazet
  2010-05-19  8:20   ` Herbert Xu
  0 siblings, 2 replies; 55+ messages in thread
From: Eric Dumazet @ 2010-05-19  8:09 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, Thomas Graf, Neil Horman, netdev

Le mercredi 19 mai 2010 à 17:57 +1000, Herbert Xu a écrit :
> Hi:
> 
> tun: Use netif_receive_skb instead of netif_rx
> 
> First a bit of history as I recall, Dave can correct me where
> he recalls differently :)
> 
> 1) There was netif_rx and everyone had to use that.
> 2) Everyone had to use that, including drivers/net/tun.c.
> 3) NAPI brings us netif_receive_skb.
> 4) About the same time people noticed that tun.c can cause wild
>    fluctuations in latency because of its use of netif_rx with IRQs
>    enabled.
> 5) netif_rx_ni was added to address this.
> 

6) netif_rx() pro is that packet processing is done while stack usage is
guaranteed to be low (from process_backlog, using a special softirq
stack, instead of current stack)

After your patch, tun will use more stack. Is it safe on all contexts ?

Another concern I have is about RPS.

netif_receive_skb() must be called from process_backlog() context, or
there is no guarantee the IPI will be sent if this skb is enqueued for
another cpu.

> However, netif_rx_ni
>  was really a bit of a roundabout way of
> injecting a packet if you think about it.  What ends up happening
> is that we always queue the packet into the backlog, and then
> immediately process it.  Which is what would happen if we simply
> called netif_receive_skb directly.
> 
> So this patch just does the obvious thing and makes tun.c call
> netif_receive_skb, albeit through the netif_receive_skb_ni wrapper
> which does the necessary things for calling it in process context.
> 
> Now apart from potential performance gains from eliminating
> unnecessary steps in the process, this has the benefit of keeping
> the process context for the packet processing.  This is needed
> by cgroups to shape network traffic based on the original process.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 4326520..0eed49f 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -667,7 +667,7 @@ static __inline__ ssize_t tun_get_user(struct tun_struct *tun,
>  		skb_shinfo(skb)->gso_segs = 0;
>  	}
>  
> -	netif_rx_ni(skb);
> +	netif_receive_skb_ni(skb);
>  
>  	tun->dev->stats.rx_packets++;
>  	tun->dev->stats.rx_bytes += len;
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index fa8b476..34bb405 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1562,6 +1562,18 @@ extern int		netif_rx(struct sk_buff *skb);
>  extern int		netif_rx_ni(struct sk_buff *skb);
>  #define HAVE_NETIF_RECEIVE_SKB 1
>  extern int		netif_receive_skb(struct sk_buff *skb);
> +
> +static inline int netif_receive_skb_ni(struct sk_buff *skb)
> +{
> +	int err;
> +
> +	local_bh_disable();
> +	err = netif_receive_skb(skb);
> +	local_bh_enable();
> +
> +	return err;
> +}
> +
>  extern gro_result_t	dev_gro_receive(struct napi_struct *napi,
>  					struct sk_buff *skb);
>  extern gro_result_t	napi_skb_finish(gro_result_t ret, struct sk_buff *skb);
> 
> Cheers,



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-19  8:09 ` Eric Dumazet
@ 2010-05-19  8:18   ` Eric Dumazet
  2010-05-19  8:21     ` Herbert Xu
  2010-05-19 12:05     ` Neil Horman
  2010-05-19  8:20   ` Herbert Xu
  1 sibling, 2 replies; 55+ messages in thread
From: Eric Dumazet @ 2010-05-19  8:18 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, Thomas Graf, Neil Horman, netdev

Le mercredi 19 mai 2010 à 10:09 +0200, Eric Dumazet a écrit :

> Another concern I have is about RPS.
> 
> netif_receive_skb() must be called from process_backlog() context, or
> there is no guarantee the IPI will be sent if this skb is enqueued for
> another cpu.

Hmm, I just checked again, and this is wrong.

In case we enqueue skb on a remote cpu backlog, we also
do __raise_softirq_irqoff(NET_RX_SOFTIRQ); so the IPI will be done
later.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-19  8:09 ` Eric Dumazet
  2010-05-19  8:18   ` Eric Dumazet
@ 2010-05-19  8:20   ` Herbert Xu
  2010-05-19  8:27     ` Eric Dumazet
  2010-05-19 20:14     ` David Miller
  1 sibling, 2 replies; 55+ messages in thread
From: Herbert Xu @ 2010-05-19  8:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, Thomas Graf, Neil Horman, netdev

On Wed, May 19, 2010 at 10:09:42AM +0200, Eric Dumazet wrote:
> 
> 6) netif_rx() pro is that packet processing is done while stack usage is
> guaranteed to be low (from process_backlog, using a special softirq
> stack, instead of current stack)
> 
> After your patch, tun will use more stack. Is it safe on all contexts ?

Dave also raised this but I believe nothing changes with regards
to the stack.  We currently call do_softirq which does not switch
stacks.

Only a real interrupt would switch stacks.

> Another concern I have is about RPS.
> 
> netif_receive_skb() must be called from process_backlog() context, or
> there is no guarantee the IPI will be sent if this skb is enqueued for
> another cpu.

Can you explain this a bit more?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-19  8:18   ` Eric Dumazet
@ 2010-05-19  8:21     ` Herbert Xu
  2010-05-19 12:05     ` Neil Horman
  1 sibling, 0 replies; 55+ messages in thread
From: Herbert Xu @ 2010-05-19  8:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, Thomas Graf, Neil Horman, netdev

On Wed, May 19, 2010 at 10:18:09AM +0200, Eric Dumazet wrote:
> Le mercredi 19 mai 2010 à 10:09 +0200, Eric Dumazet a écrit :
> 
> > Another concern I have is about RPS.
> > 
> > netif_receive_skb() must be called from process_backlog() context, or
> > there is no guarantee the IPI will be sent if this skb is enqueued for
> > another cpu.
> 
> Hmm, I just checked again, and this is wrong.
> 
> In case we enqueue skb on a remote cpu backlog, we also
> do __raise_softirq_irqoff(NET_RX_SOFTIRQ); so the IPI will be done
> later.

OK your concern is only with the stack usage, right?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-19  8:20   ` Herbert Xu
@ 2010-05-19  8:27     ` Eric Dumazet
  2010-05-19  8:44       ` Herbert Xu
  2010-05-19 20:14     ` David Miller
  1 sibling, 1 reply; 55+ messages in thread
From: Eric Dumazet @ 2010-05-19  8:27 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, Thomas Graf, Neil Horman, netdev

Le mercredi 19 mai 2010 à 18:20 +1000, Herbert Xu a écrit :
> On Wed, May 19, 2010 at 10:09:42AM +0200, Eric Dumazet wrote:
> > 
> > 6) netif_rx() pro is that packet processing is done while stack usage is
> > guaranteed to be low (from process_backlog, using a special softirq
> > stack, instead of current stack)
> > 
> > After your patch, tun will use more stack. Is it safe on all contexts ?
> 
> Dave also raised this but I believe nothing changes with regards
> to the stack.  We currently call do_softirq which does not switch
> stacks.
> 
> Only a real interrupt would switch stacks.

This is a bit wrong, at least here (CONFIG_4KSTACKS=y)

Some people still use 32bits these days ;)

Please check arch/x86/kernel/irq_32.c

asmlinkage void do_softirq(void)
{
        unsigned long flags;
        struct thread_info *curctx;
        union irq_ctx *irqctx;
        u32 *isp;

        if (in_interrupt())
                return;

        local_irq_save(flags);

        if (local_softirq_pending()) {
                curctx = current_thread_info();
                irqctx = __get_cpu_var(softirq_ctx);
                irqctx->tinfo.task = curctx->task;
                irqctx->tinfo.previous_esp = current_stack_pointer;

                /* build the stack frame on the softirq stack */
                isp = (u32 *) ((char *)irqctx + sizeof(*irqctx));

                call_on_stack(__do_softirq, isp);
                /*
                 * Shouldnt happen, we returned above if in_interrupt():
                 */
                WARN_ON_ONCE(softirq_count());
        }

        local_irq_restore(flags);
}




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-19  8:27     ` Eric Dumazet
@ 2010-05-19  8:44       ` Herbert Xu
  0 siblings, 0 replies; 55+ messages in thread
From: Herbert Xu @ 2010-05-19  8:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, Thomas Graf, Neil Horman, netdev

On Wed, May 19, 2010 at 10:27:17AM +0200, Eric Dumazet wrote:
> 
> This is a bit wrong, at least here (CONFIG_4KSTACKS=y)
> 
> Some people still use 32bits these days ;)
> 
> Please check arch/x86/kernel/irq_32.c

Right, I was looking at the generic version.

Still, as I'm only changing tun.c where we know the call comes
from a syscall, I don't think the stack is a great issue.

Besides, for TX we already perform everything from the same stack
depth and the TX path isn't that much shallower compared to the
RX path.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-19  8:18   ` Eric Dumazet
  2010-05-19  8:21     ` Herbert Xu
@ 2010-05-19 12:05     ` Neil Horman
  2010-05-19 12:55       ` Neil Horman
  2010-05-19 14:10       ` tun: Use netif_receive_skb instead of netif_rx Eric Dumazet
  1 sibling, 2 replies; 55+ messages in thread
From: Neil Horman @ 2010-05-19 12:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, David S. Miller, Thomas Graf, netdev

On Wed, May 19, 2010 at 10:18:09AM +0200, Eric Dumazet wrote:
> Le mercredi 19 mai 2010 à 10:09 +0200, Eric Dumazet a écrit :
> 
> > Another concern I have is about RPS.
> > 
> > netif_receive_skb() must be called from process_backlog() context, or
> > there is no guarantee the IPI will be sent if this skb is enqueued for
> > another cpu.
> 
> Hmm, I just checked again, and this is wrong.
> 
> In case we enqueue skb on a remote cpu backlog, we also
> do __raise_softirq_irqoff(NET_RX_SOFTIRQ); so the IPI will be done
> later.
> 
But if this happens, then we loose the connection between the packet being
received and the process doing the reception, so the network cgroup classifier
breaks again.

Performance gains are still a big advantage here of course.
Neil


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-19 12:05     ` Neil Horman
@ 2010-05-19 12:55       ` Neil Horman
  2010-05-19 18:00         ` Neil Horman
  2010-05-19 14:10       ` tun: Use netif_receive_skb instead of netif_rx Eric Dumazet
  1 sibling, 1 reply; 55+ messages in thread
From: Neil Horman @ 2010-05-19 12:55 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, Herbert Xu, David S. Miller, Thomas Graf, netdev

On Wed, May 19, 2010 at 08:05:47AM -0400, Neil Horman wrote:
> On Wed, May 19, 2010 at 10:18:09AM +0200, Eric Dumazet wrote:
> > Le mercredi 19 mai 2010 à 10:09 +0200, Eric Dumazet a écrit :
> > 
> > > Another concern I have is about RPS.
> > > 
> > > netif_receive_skb() must be called from process_backlog() context, or
> > > there is no guarantee the IPI will be sent if this skb is enqueued for
> > > another cpu.
> > 
> > Hmm, I just checked again, and this is wrong.
> > 
> > In case we enqueue skb on a remote cpu backlog, we also
> > do __raise_softirq_irqoff(NET_RX_SOFTIRQ); so the IPI will be done
> > later.
> > 
> But if this happens, then we loose the connection between the packet being
> received and the process doing the reception, so the network cgroup classifier
> breaks again.
> 
> Performance gains are still a big advantage here of course.
> Neil
> 
Scratch what I said here, Herbert corrected me on this, and we're ok, as tun has
no rps map.

I'll test this patch out in just a bit
Neil

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-19 12:05     ` Neil Horman
  2010-05-19 12:55       ` Neil Horman
@ 2010-05-19 14:10       ` Eric Dumazet
  2010-05-19 14:31         ` Neil Horman
  1 sibling, 1 reply; 55+ messages in thread
From: Eric Dumazet @ 2010-05-19 14:10 UTC (permalink / raw)
  To: Neil Horman; +Cc: Herbert Xu, David S. Miller, Thomas Graf, netdev

Le mercredi 19 mai 2010 à 08:05 -0400, Neil Horman a écrit :
> On Wed, May 19, 2010 at 10:18:09AM +0200, Eric Dumazet wrote:
> > Le mercredi 19 mai 2010 à 10:09 +0200, Eric Dumazet a écrit :
> > 
> > > Another concern I have is about RPS.
> > > 
> > > netif_receive_skb() must be called from process_backlog() context, or
> > > there is no guarantee the IPI will be sent if this skb is enqueued for
> > > another cpu.
> > 
> > Hmm, I just checked again, and this is wrong.
> > 
> > In case we enqueue skb on a remote cpu backlog, we also
> > do __raise_softirq_irqoff(NET_RX_SOFTIRQ); so the IPI will be done
> > later.
> > 
> But if this happens, then we loose the connection between the packet being
> received and the process doing the reception, so the network cgroup classifier
> breaks again.
> 
> Performance gains are still a big advantage here of course.

RPS is enabled on a per device (or more precisely per subqueue) basis, and disabled
by default, so if cgroup classifier is needed, it should work as is.

Speaking of net/sched/cls_cgroup.c, I am contemplating following
sequence :

rcu_read_lock();
classid = task_cls_state(current)->classid;
rcu_read_unlock();

RCU is definitly so special (should I say magic ?), that we use it even
if not needed. It makes us happy...





^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-19 14:10       ` tun: Use netif_receive_skb instead of netif_rx Eric Dumazet
@ 2010-05-19 14:31         ` Neil Horman
  0 siblings, 0 replies; 55+ messages in thread
From: Neil Horman @ 2010-05-19 14:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neil Horman, Herbert Xu, David S. Miller, Thomas Graf, netdev

On Wed, May 19, 2010 at 04:10:29PM +0200, Eric Dumazet wrote:
> Le mercredi 19 mai 2010 à 08:05 -0400, Neil Horman a écrit :
> > On Wed, May 19, 2010 at 10:18:09AM +0200, Eric Dumazet wrote:
> > > Le mercredi 19 mai 2010 à 10:09 +0200, Eric Dumazet a écrit :
> > > 
> > > > Another concern I have is about RPS.
> > > > 
> > > > netif_receive_skb() must be called from process_backlog() context, or
> > > > there is no guarantee the IPI will be sent if this skb is enqueued for
> > > > another cpu.
> > > 
> > > Hmm, I just checked again, and this is wrong.
> > > 
> > > In case we enqueue skb on a remote cpu backlog, we also
> > > do __raise_softirq_irqoff(NET_RX_SOFTIRQ); so the IPI will be done
> > > later.
> > > 
> > But if this happens, then we loose the connection between the packet being
> > received and the process doing the reception, so the network cgroup classifier
> > breaks again.
> > 
> > Performance gains are still a big advantage here of course.
> 
> RPS is enabled on a per device (or more precisely per subqueue) basis, and disabled
> by default, so if cgroup classifier is needed, it should work as is.
> 
> Speaking of net/sched/cls_cgroup.c, I am contemplating following
> sequence :
> 
> rcu_read_lock();
> classid = task_cls_state(current)->classid;
> rcu_read_unlock();
> 
Yeah, I'd noticed there was no write side to that, but hadn't quite gotten to
investigating  that further :)

> RCU is definitly so special (should I say magic ?), that we use it even
> if not needed. It makes us happy...
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-19 12:55       ` Neil Horman
@ 2010-05-19 18:00         ` Neil Horman
  2010-05-19 20:24           ` Neil Horman
  2010-05-19 20:49           ` Thomas Graf
  0 siblings, 2 replies; 55+ messages in thread
From: Neil Horman @ 2010-05-19 18:00 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, Herbert Xu, David S. Miller, Thomas Graf, netdev

On Wed, May 19, 2010 at 08:55:43AM -0400, Neil Horman wrote:
> On Wed, May 19, 2010 at 08:05:47AM -0400, Neil Horman wrote:
> > On Wed, May 19, 2010 at 10:18:09AM +0200, Eric Dumazet wrote:
> > > Le mercredi 19 mai 2010 à 10:09 +0200, Eric Dumazet a écrit :
> > > 
> > > > Another concern I have is about RPS.
> > > > 
> > > > netif_receive_skb() must be called from process_backlog() context, or
> > > > there is no guarantee the IPI will be sent if this skb is enqueued for
> > > > another cpu.
> > > 
> > > Hmm, I just checked again, and this is wrong.
> > > 
> > > In case we enqueue skb on a remote cpu backlog, we also
> > > do __raise_softirq_irqoff(NET_RX_SOFTIRQ); so the IPI will be done
> > > later.
> > > 
> > But if this happens, then we loose the connection between the packet being
> > received and the process doing the reception, so the network cgroup classifier
> > breaks again.
> > 
> > Performance gains are still a big advantage here of course.
> > Neil
> > 
> Scratch what I said here, Herbert corrected me on this, and we're ok, as tun has
> no rps map.
> 
> I'll test this patch out in just a bit
> Neil
> 

I'm currently testing this, unfortunately, and its not breaking anything, but it
doesn't allow cgroups to classify frames comming from tun interfaces.  I'm still
investigating, but I think the issue is that, because we call local_bh_disable
with this patch, we wind up raising the count at SOFTIRQ_OFFSET in preempt_count
for the task.  Since the cgroup classifier has this check:

if (softirq_count() != SOFTIRQ_OFFSET))
	return -1;

We still fail to classify the frame.  the cgroup classifier is assuming that any
frame arriving with a softirq count of 1 means we came directly from the
dev_queue_xmit routine and is safe to check current().  Any less than that, and
something is wrong (as we at least need the local_bh_disable in dev_queue_xmit),
and any more implies that we have nested calls to local_bh_disable, meaning
we're really handling a softirq context.

Neil

> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-19  8:20   ` Herbert Xu
  2010-05-19  8:27     ` Eric Dumazet
@ 2010-05-19 20:14     ` David Miller
  1 sibling, 0 replies; 55+ messages in thread
From: David Miller @ 2010-05-19 20:14 UTC (permalink / raw)
  To: herbert; +Cc: eric.dumazet, tgraf, nhorman, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 19 May 2010 18:20:47 +1000

> On Wed, May 19, 2010 at 10:09:42AM +0200, Eric Dumazet wrote:
>> 
>> 6) netif_rx() pro is that packet processing is done while stack usage is
>> guaranteed to be low (from process_backlog, using a special softirq
>> stack, instead of current stack)
>> 
>> After your patch, tun will use more stack. Is it safe on all contexts ?
> 
> Dave also raised this but I believe nothing changes with regards
> to the stack.  We currently call do_softirq which does not switch
> stacks.

do_softirq() _does_ switch stacks, it's a per-arch function that
does the stack switch and calls __do_softirq() on the softirq
stack.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-19 18:00         ` Neil Horman
@ 2010-05-19 20:24           ` Neil Horman
  2010-05-19 20:49           ` Thomas Graf
  1 sibling, 0 replies; 55+ messages in thread
From: Neil Horman @ 2010-05-19 20:24 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, Herbert Xu, David S. Miller, Thomas Graf, netdev

On Wed, May 19, 2010 at 02:00:53PM -0400, Neil Horman wrote:
> On Wed, May 19, 2010 at 08:55:43AM -0400, Neil Horman wrote:
> > On Wed, May 19, 2010 at 08:05:47AM -0400, Neil Horman wrote:
> > > On Wed, May 19, 2010 at 10:18:09AM +0200, Eric Dumazet wrote:
> > > > Le mercredi 19 mai 2010 à 10:09 +0200, Eric Dumazet a écrit :
> > > > 
> > > > > Another concern I have is about RPS.
> > > > > 
> > > > > netif_receive_skb() must be called from process_backlog() context, or
> > > > > there is no guarantee the IPI will be sent if this skb is enqueued for
> > > > > another cpu.
> > > > 
> > > > Hmm, I just checked again, and this is wrong.
> > > > 
> > > > In case we enqueue skb on a remote cpu backlog, we also
> > > > do __raise_softirq_irqoff(NET_RX_SOFTIRQ); so the IPI will be done
> > > > later.
> > > > 
> > > But if this happens, then we loose the connection between the packet being
> > > received and the process doing the reception, so the network cgroup classifier
> > > breaks again.
> > > 
> > > Performance gains are still a big advantage here of course.
> > > Neil
> > > 
> > Scratch what I said here, Herbert corrected me on this, and we're ok, as tun has
> > no rps map.
> > 
> > I'll test this patch out in just a bit
> > Neil
> > 
> 
> I'm currently testing this, unfortunately, and its not breaking anything, but it
> doesn't allow cgroups to classify frames comming from tun interfaces.  I'm still
> investigating, but I think the issue is that, because we call local_bh_disable
> with this patch, we wind up raising the count at SOFTIRQ_OFFSET in preempt_count
> for the task.  Since the cgroup classifier has this check:
> 
> if (softirq_count() != SOFTIRQ_OFFSET))
> 	return -1;
> 
> We still fail to classify the frame.  the cgroup classifier is assuming that any
> frame arriving with a softirq count of 1 means we came directly from the
> dev_queue_xmit routine and is safe to check current().  Any less than that, and
> something is wrong (as we at least need the local_bh_disable in dev_queue_xmit),
> and any more implies that we have nested calls to local_bh_disable, meaning
> we're really handling a softirq context.
> 
> Neil
> 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Just out of curiosity, how unsavory would it be if we were to dedicate the upper
bit in the SOFTIRQ_BITS range to be an indicator of weather we were actually
executing softirqs?  As noted above, we're tripping over the ambiguity here
between running in softirq context and actually just having softirqs disabled.
Would it be against anyones sensibilities if we were to dedicate the upper bit
in the softirq_count range to disambiguate the two conitions (or use a separate
flag for that matter)?

Neil


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-19 18:00         ` Neil Horman
  2010-05-19 20:24           ` Neil Horman
@ 2010-05-19 20:49           ` Thomas Graf
  2010-05-19 21:00             ` Brian Bloniarz
  1 sibling, 1 reply; 55+ messages in thread
From: Thomas Graf @ 2010-05-19 20:49 UTC (permalink / raw)
  To: Neil Horman
  Cc: Neil Horman, Eric Dumazet, Herbert Xu, David S. Miller, netdev

On Wed, 2010-05-19 at 14:00 -0400, Neil Horman wrote: 
> I'm currently testing this, unfortunately, and its not breaking anything, but it
> doesn't allow cgroups to classify frames comming from tun interfaces.  I'm still
> investigating, but I think the issue is that, because we call local_bh_disable
> with this patch, we wind up raising the count at SOFTIRQ_OFFSET in preempt_count
> for the task.  Since the cgroup classifier has this check:
> 
> if (softirq_count() != SOFTIRQ_OFFSET))
> 	return -1;
> 
> We still fail to classify the frame.  the cgroup classifier is assuming that any
> frame arriving with a softirq count of 1 means we came directly from the
> dev_queue_xmit routine and is safe to check current().  Any less than that, and
> something is wrong (as we at least need the local_bh_disable in dev_queue_xmit),
> and any more implies that we have nested calls to local_bh_disable, meaning
> we're really handling a softirq context.

It is a hack but the only method to check for softirq context I found. I
would favor using a flag if there was one.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-19 20:49           ` Thomas Graf
@ 2010-05-19 21:00             ` Brian Bloniarz
  2010-05-20  2:55               ` David Miller
  0 siblings, 1 reply; 55+ messages in thread
From: Brian Bloniarz @ 2010-05-19 21:00 UTC (permalink / raw)
  To: tgraf
  Cc: Neil Horman, Neil Horman, Eric Dumazet, Herbert Xu,
	David S. Miller, netdev

On 05/19/2010 04:49 PM, Thomas Graf wrote:
> On Wed, 2010-05-19 at 14:00 -0400, Neil Horman wrote: 
>> I'm currently testing this, unfortunately, and its not breaking anything, but it
>> doesn't allow cgroups to classify frames comming from tun interfaces.  I'm still
>> investigating, but I think the issue is that, because we call local_bh_disable
>> with this patch, we wind up raising the count at SOFTIRQ_OFFSET in preempt_count
>> for the task.  Since the cgroup classifier has this check:
>>
>> if (softirq_count() != SOFTIRQ_OFFSET))
>> 	return -1;
>>
>> We still fail to classify the frame.  the cgroup classifier is assuming that any
>> frame arriving with a softirq count of 1 means we came directly from the
>> dev_queue_xmit routine and is safe to check current().  Any less than that, and
>> something is wrong (as we at least need the local_bh_disable in dev_queue_xmit),
>> and any more implies that we have nested calls to local_bh_disable, meaning
>> we're really handling a softirq context.
> 
> It is a hack but the only method to check for softirq context I found. I
> would favor using a flag if there was one.

Eric probably has some thoughts on this -- his scheduler-batching patch RFC
from last year needed the same bit of info:
http://patchwork.ozlabs.org/patch/24536/
(see the changes to trace_softirq_context).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-19 21:00             ` Brian Bloniarz
@ 2010-05-20  2:55               ` David Miller
  2010-05-20  2:57                 ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: David Miller @ 2010-05-20  2:55 UTC (permalink / raw)
  To: bmb; +Cc: tgraf, nhorman, nhorman, eric.dumazet, herbert, netdev


Since it seems now inevitable to me that we'll need to store the
'classid' in struct sk_buff, I've prepared two patches to do that at
zero cost.

Someone please go with this.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  2:55               ` David Miller
@ 2010-05-20  2:57                 ` Herbert Xu
  2010-05-20  3:05                   ` David Miller
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2010-05-20  2:57 UTC (permalink / raw)
  To: David Miller; +Cc: bmb, tgraf, nhorman, nhorman, eric.dumazet, netdev

On Wed, May 19, 2010 at 07:55:33PM -0700, David Miller wrote:
> 
> Since it seems now inevitable to me that we'll need to store the
> 'classid' in struct sk_buff, I've prepared two patches to do that at
> zero cost.

Actually I'm currently working on a patch to store this in struct
sock.  The reason I prefer struct sock is because we can then
centralise the place where we set the classid rather than having
it spread out through every place that creates an skb.

I'll post it soon.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  2:57                 ` Herbert Xu
@ 2010-05-20  3:05                   ` David Miller
  2010-05-20  3:34                     ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: David Miller @ 2010-05-20  3:05 UTC (permalink / raw)
  To: herbert; +Cc: bmb, tgraf, nhorman, nhorman, eric.dumazet, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 20 May 2010 12:57:42 +1000

> On Wed, May 19, 2010 at 07:55:33PM -0700, David Miller wrote:
>> 
>> Since it seems now inevitable to me that we'll need to store the
>> 'classid' in struct sk_buff, I've prepared two patches to do that at
>> zero cost.
> 
> Actually I'm currently working on a patch to store this in struct
> sock.  The reason I prefer struct sock is because we can then
> centralise the place where we set the classid rather than having
> it spread out through every place that creates an skb.
> 
> I'll post it soon.

Ok, looking forward to it.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  3:05                   ` David Miller
@ 2010-05-20  3:34                     ` Herbert Xu
  2010-05-20  3:42                       ` Herbert Xu
                                         ` (3 more replies)
  0 siblings, 4 replies; 55+ messages in thread
From: Herbert Xu @ 2010-05-20  3:34 UTC (permalink / raw)
  To: David Miller; +Cc: bmb, tgraf, nhorman, nhorman, eric.dumazet, netdev

On Wed, May 19, 2010 at 08:05:22PM -0700, David Miller wrote:
> Ok, looking forward to it.

Here we go:

cls_cgroup: Store classid in struct sock

Up until now cls_cgroup has relied on fetching the classid out of
the current executing thread.  This runs into trouble when a packet
processing is delayed in which case it may execute out of another
thread's context.

Furthermore, even when a packet is not delayed we may fail to
classify it if soft IRQs have been disabled, because this scenario
is indistinguishable from one where a packet unrelated to the
current thread is processed by a real soft IRQ.

As we already have a concept of packet ownership for accounting
purposes in the skb->sk pointer, this is a natural place to store
the classid in a persistent manner.

This patch adds the cls_cgroup classid in struct sock, filling up
an existing hole on 64-bit :)

The value is set at socket creation time.  So all sockets created
via socket(2) automatically gains the ID of the thread creating it.
Now you may argue that this may not be the same as the thread that
is sending the packet.  However, we already have a precedence where
an fd is passed to a different thread, its security property is
inherited.  In this case, inheriting the classid of the thread
creating the socket is also the logical thing to do.

For sockets created on inbound connections through accept(2), we
inherit the classid of the original listening socket through
sk_clone, possibly preceding the actual accept(2) call.

In order to minimise risks, I have not made this the authoritative
classid.  For now it is only used as a backup when we execute
with soft IRQs disabled.  Once we're completely happy with its
semantics we can use it as the sole classid.

Footnote: I have rearranged the error path on cls_group module
creation.  If we didn't do this, then there is a window where
someone could create a tc rule using cls_group before the cgroup
subsystem has been registered.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8f78073..63c5036 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -410,6 +410,12 @@ struct cgroup_scanner {
 	void *data;
 };
 
+struct cgroup_cls_state
+{
+	struct cgroup_subsys_state css;
+	u32 classid;
+};
+
 /*
  * Add a new file to the given cgroup directory. Should only be
  * called by subsystems from within a populate() method
@@ -515,6 +521,10 @@ struct cgroup_subsys {
 	struct module *module;
 };
 
+#ifndef CONFIG_NET_CLS_CGROUP
+extern int net_cls_subsys_id;
+#endif
+
 #define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
 #include <linux/cgroup_subsys.h>
 #undef SUBSYS
diff --git a/include/net/sock.h b/include/net/sock.h
index 1ad6435..361a5dc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -307,7 +307,7 @@ struct sock {
 	void			*sk_security;
 #endif
 	__u32			sk_mark;
-	/* XXX 4 bytes hole on 64 bit */
+	u32			sk_classid;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk, int bytes);
 	void			(*sk_write_space)(struct sock *sk);
diff --git a/net/core/sock.c b/net/core/sock.c
index c5812bb..70bc529 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -110,6 +110,7 @@
 #include <linux/tcp.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
+#include <linux/cgroup.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -217,6 +218,32 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
 
+#ifdef CONFIG_NET_CLS_CGROUP
+static inline u32 task_cls_classid(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, net_cls_subsys_id),
+			    struct cgroup_cls_state, css).classid;
+}
+#else
+int net_cls_subsys_id = -1;
+EXPORT_SYMBOL_GPL(net_cls_subsys_id);
+
+static inline u32 task_cls_classid(struct task_struct *p)
+{
+	int id;
+	u32 classid;
+
+	rcu_read_lock();
+	id = rcu_dereference(net_cls_subsys_id);
+	if (id >= 0)
+		classid = container_of(task_subsys_state(p, id),
+				       struct cgroup_cls_state, css)->classid;
+	rcu_read_unlock();
+
+	return classid;
+}
+#endif
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -1064,6 +1091,9 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		sock_lock_init(sk);
 		sock_net_set(sk, get_net(net));
 		atomic_set(&sk->sk_wmem_alloc, 1);
+
+		if (!in_interrupt())
+			sk->sk_classid = task_cls_classid(current);
 	}
 
 	return sk;
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 2211803..32c1296 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -16,14 +16,10 @@
 #include <linux/errno.h>
 #include <linux/skbuff.h>
 #include <linux/cgroup.h>
+#include <linux/rcupdate.h>
 #include <net/rtnetlink.h>
 #include <net/pkt_cls.h>
-
-struct cgroup_cls_state
-{
-	struct cgroup_subsys_state css;
-	u32 classid;
-};
+#include <net/sock.h>
 
 static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
 					       struct cgroup *cgrp);
@@ -112,6 +108,10 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
 	struct cls_cgroup_head *head = tp->root;
 	u32 classid;
 
+	rcu_read_lock();
+	classid = task_cls_state(current)->classid;
+	rcu_read_unlock();
+
 	/*
 	 * Due to the nature of the classifier it is required to ignore all
 	 * packets originating from softirq context as accessing `current'
@@ -122,12 +122,12 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
 	 * calls by looking at the number of nested bh disable calls because
 	 * softirqs always disables bh.
 	 */
-	if (softirq_count() != SOFTIRQ_OFFSET)
-		return -1;
-
-	rcu_read_lock();
-	classid = task_cls_state(current)->classid;
-	rcu_read_unlock();
+	if (softirq_count() != SOFTIRQ_OFFSET) {
+	 	/* If there is an sk_classid we'll use that. */
+		if (!skb->sk)
+			return -1;
+		classid = skb->sk->sk_classid;
+	}
 
 	if (!classid)
 		return -1;
@@ -289,18 +289,35 @@ static struct tcf_proto_ops cls_cgroup_ops __read_mostly = {
 
 static int __init init_cgroup_cls(void)
 {
-	int ret = register_tcf_proto_ops(&cls_cgroup_ops);
-	if (ret)
-		return ret;
+	int ret;
+
 	ret = cgroup_load_subsys(&net_cls_subsys);
 	if (ret)
-		unregister_tcf_proto_ops(&cls_cgroup_ops);
+		goto out;
+
+#ifndef CONFIG_NET_CLS_CGROUP
+	/* We can't use rcu_assign_pointer because this is an int. */
+	smp_wmb();
+	net_cls_subsys_id = net_cls_subsys.subsys_id;
+#endif
+
+	ret = register_tcf_proto_ops(&cls_cgroup_ops);
+	if (ret)
+		cgroup_unload_subsys(&net_cls_subsys);
+
+out:
 	return ret;
 }
 
 static void __exit exit_cgroup_cls(void)
 {
 	unregister_tcf_proto_ops(&cls_cgroup_ops);
+
+#ifndef CONFIG_NET_CLS_CGROUP
+	net_cls_subsys_id = -1;
+	synchronize_rcu();
+#endif
+
 	cgroup_unload_subsys(&net_cls_subsys);
 }
 

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  3:34                     ` Herbert Xu
@ 2010-05-20  3:42                       ` Herbert Xu
  2010-05-20  3:46                       ` David Miller
                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 55+ messages in thread
From: Herbert Xu @ 2010-05-20  3:42 UTC (permalink / raw)
  To: David Miller; +Cc: bmb, tgraf, nhorman, nhorman, eric.dumazet, netdev

On Thu, May 20, 2010 at 01:34:46PM +1000, Herbert Xu wrote:
>
> cls_cgroup: Store classid in struct sock
> 
> Up until now cls_cgroup has relied on fetching the classid out of
> the current executing thread.  This runs into trouble when a packet
> processing is delayed in which case it may execute out of another
> thread's context.

Oh please add this bit that I forgot about:

This also addresses the case where TCP transmits a queued packet
in response to an inbound ACK.  Currently this isn't classified
at all as we would be in softirq context.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  3:34                     ` Herbert Xu
  2010-05-20  3:42                       ` Herbert Xu
@ 2010-05-20  3:46                       ` David Miller
  2010-05-20  4:54                       ` Eric Dumazet
  2010-05-20  5:15                       ` Eric Dumazet
  3 siblings, 0 replies; 55+ messages in thread
From: David Miller @ 2010-05-20  3:46 UTC (permalink / raw)
  To: herbert; +Cc: bmb, tgraf, nhorman, nhorman, eric.dumazet, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 20 May 2010 13:34:46 +1000

> cls_cgroup: Store classid in struct sock

This looks great to me.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  3:34                     ` Herbert Xu
  2010-05-20  3:42                       ` Herbert Xu
  2010-05-20  3:46                       ` David Miller
@ 2010-05-20  4:54                       ` Eric Dumazet
  2010-05-20  5:01                         ` Herbert Xu
  2010-05-20  5:15                       ` Eric Dumazet
  3 siblings, 1 reply; 55+ messages in thread
From: Eric Dumazet @ 2010-05-20  4:54 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev

Le jeudi 20 mai 2010 à 13:34 +1000, Herbert Xu a écrit :
> On Wed, May 19, 2010 at 08:05:22PM -0700, David Miller wrote:
> > Ok, looking forward to it.
> 
> Here we go:
> 
> cls_cgroup: Store classid in struct sock
> 
> Up until now cls_cgroup has relied on fetching the classid out of
> the current executing thread.  This runs into trouble when a packet
> processing is delayed in which case it may execute out of another
> thread's context.
> 
> Furthermore, even when a packet is not delayed we may fail to
> classify it if soft IRQs have been disabled, because this scenario
> is indistinguishable from one where a packet unrelated to the
> current thread is processed by a real soft IRQ.
> 
> As we already have a concept of packet ownership for accounting
> purposes in the skb->sk pointer, this is a natural place to store
> the classid in a persistent manner.
> 
> This patch adds the cls_cgroup classid in struct sock, filling up
> an existing hole on 64-bit :)
> 
> The value is set at socket creation time.  So all sockets created
> via socket(2) automatically gains the ID of the thread creating it.
> Now you may argue that this may not be the same as the thread that
> is sending the packet.  However, we already have a precedence where
> an fd is passed to a different thread, its security property is
> inherited.  In this case, inheriting the classid of the thread
> creating the socket is also the logical thing to do.
> 
> For sockets created on inbound connections through accept(2), we
> inherit the classid of the original listening socket through
> sk_clone, possibly preceding the actual accept(2) call.
> 
> In order to minimise risks, I have not made this the authoritative
> classid.  For now it is only used as a backup when we execute
> with soft IRQs disabled.  Once we're completely happy with its
> semantics we can use it as the sole classid.
> 
> Footnote: I have rearranged the error path on cls_group module
> creation.  If we didn't do this, then there is a window where
> someone could create a tc rule using cls_group before the cgroup
> subsystem has been registered.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 8f78073..63c5036 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -410,6 +410,12 @@ struct cgroup_scanner {
>  	void *data;
>  };
>  
> +struct cgroup_cls_state
> +{
> +	struct cgroup_subsys_state css;
> +	u32 classid;
> +};
> +
>  /*
>   * Add a new file to the given cgroup directory. Should only be
>   * called by subsystems from within a populate() method
> @@ -515,6 +521,10 @@ struct cgroup_subsys {
>  	struct module *module;
>  };
>  
> +#ifndef CONFIG_NET_CLS_CGROUP
> +extern int net_cls_subsys_id;
> +#endif
> +
>  #define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
>  #include <linux/cgroup_subsys.h>
>  #undef SUBSYS
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 1ad6435..361a5dc 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -307,7 +307,7 @@ struct sock {
>  	void			*sk_security;
>  #endif
>  	__u32			sk_mark;
> -	/* XXX 4 bytes hole on 64 bit */
> +	u32			sk_classid;
>  	void			(*sk_state_change)(struct sock *sk);
>  	void			(*sk_data_ready)(struct sock *sk, int bytes);
>  	void			(*sk_write_space)(struct sock *sk);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index c5812bb..70bc529 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -110,6 +110,7 @@
>  #include <linux/tcp.h>
>  #include <linux/init.h>
>  #include <linux/highmem.h>
> +#include <linux/cgroup.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/system.h>
> @@ -217,6 +218,32 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
>  int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
>  EXPORT_SYMBOL(sysctl_optmem_max);
>  
> +#ifdef CONFIG_NET_CLS_CGROUP
> +static inline u32 task_cls_classid(struct task_struct *p)
> +{
> +	return container_of(task_subsys_state(p, net_cls_subsys_id),
> +			    struct cgroup_cls_state, css).classid;
> +}
> +#else
> +int net_cls_subsys_id = -1;
> +EXPORT_SYMBOL_GPL(net_cls_subsys_id);
> +
> +static inline u32 task_cls_classid(struct task_struct *p)
> +{
> +	int id;
> +	u32 classid;
> +
> +	rcu_read_lock();
> +	id = rcu_dereference(net_cls_subsys_id);
> +	if (id >= 0)
> +		classid = container_of(task_subsys_state(p, id),
> +				       struct cgroup_cls_state, css)->classid;
> +	rcu_read_unlock();
> +
> +	return classid;
> +}
> +#endif
> +
>  static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
>  {
>  	struct timeval tv;
> @@ -1064,6 +1091,9 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
>  		sock_lock_init(sk);
>  		sock_net_set(sk, get_net(net));
>  		atomic_set(&sk->sk_wmem_alloc, 1);
> +
> +		if (!in_interrupt())
> +			sk->sk_classid = task_cls_classid(current);

It means we cache current classid of this process.

Can it change later ?


>  	}
>  
>  	return sk;
> diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
> index 2211803..32c1296 100644
> --- a/net/sched/cls_cgroup.c
> +++ b/net/sched/cls_cgroup.c
> @@ -16,14 +16,10 @@
>  #include <linux/errno.h>
>  #include <linux/skbuff.h>
>  #include <linux/cgroup.h>
> +#include <linux/rcupdate.h>
>  #include <net/rtnetlink.h>
>  #include <net/pkt_cls.h>
> -
> -struct cgroup_cls_state
> -{
> -	struct cgroup_subsys_state css;
> -	u32 classid;
> -};
> +#include <net/sock.h>
>  
>  static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
>  					       struct cgroup *cgrp);
> @@ -112,6 +108,10 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
>  	struct cls_cgroup_head *head = tp->root;
>  	u32 classid;
>  
> +	rcu_read_lock();
> +	classid = task_cls_state(current)->classid;
> +	rcu_read_unlock();
> +

It would be more readable to use :

static inline int task_cls_state(struct task_struct *p)
{
	int classid;

	rcu_read_lock();
	classid = container_of(task_subsys_state(p, net_cls_subsys_id),
				struct cgroup_cls_state, css)->classid;
	rcu_read_unlock();
	return classid;
}

...

classid = task_cls_classid(current);

>  	/*
>  	 * Due to the nature of the classifier it is required to ignore all
>  	 * packets originating from softirq context as accessing `current'
> @@ -122,12 +122,12 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
>  	 * calls by looking at the number of nested bh disable calls because
>  	 * softirqs always disables bh.
>  	 */
> -	if (softirq_count() != SOFTIRQ_OFFSET)
> -		return -1;
> -
> -	rcu_read_lock();
> -	classid = task_cls_state(current)->classid;
> -	rcu_read_unlock();
> +	if (softirq_count() != SOFTIRQ_OFFSET) {

Why do we still need this previous test ?

> +	 	/* If there is an sk_classid we'll use that. */
> +		if (!skb->sk)
> +			return -1;
> +		classid = skb->sk->sk_classid;
> +	}
>  
>  	if (!classid)
>  		return -1;
> @@ -289,18 +289,35 @@ static struct tcf_proto_ops cls_cgroup_ops __read_mostly = {
>  
>  static int __init init_cgroup_cls(void)
>  {
> -	int ret = register_tcf_proto_ops(&cls_cgroup_ops);
> -	if (ret)
> -		return ret;
> +	int ret;
> +
>  	ret = cgroup_load_subsys(&net_cls_subsys);
>  	if (ret)
> -		unregister_tcf_proto_ops(&cls_cgroup_ops);
> +		goto out;
> +
> +#ifndef CONFIG_NET_CLS_CGROUP
> +	/* We can't use rcu_assign_pointer because this is an int. */
> +	smp_wmb();
> +	net_cls_subsys_id = net_cls_subsys.subsys_id;
> +#endif
> +
> +	ret = register_tcf_proto_ops(&cls_cgroup_ops);
> +	if (ret)
> +		cgroup_unload_subsys(&net_cls_subsys);
> +
> +out:
>  	return ret;
>  }
>  
>  static void __exit exit_cgroup_cls(void)
>  {
>  	unregister_tcf_proto_ops(&cls_cgroup_ops);
> +
> +#ifndef CONFIG_NET_CLS_CGROUP
> +	net_cls_subsys_id = -1;
> +	synchronize_rcu();
> +#endif
> +
>  	cgroup_unload_subsys(&net_cls_subsys);
>  }
>  
> 
> Thanks,



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  4:54                       ` Eric Dumazet
@ 2010-05-20  5:01                         ` Herbert Xu
  0 siblings, 0 replies; 55+ messages in thread
From: Herbert Xu @ 2010-05-20  5:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev

On Thu, May 20, 2010 at 06:54:15AM +0200, Eric Dumazet wrote:
>
> > @@ -1064,6 +1091,9 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
> >  		sock_lock_init(sk);
> >  		sock_net_set(sk, get_net(net));
> >  		atomic_set(&sk->sk_wmem_alloc, 1);
> > +
> > +		if (!in_interrupt())
> > +			sk->sk_classid = task_cls_classid(current);
> 
> It means we cache current classid of this process.
> 
> Can it change later ?

Please read my patchset description.  I explained everything there.

> > @@ -112,6 +108,10 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
> >  	struct cls_cgroup_head *head = tp->root;
> >  	u32 classid;
> >  
> > +	rcu_read_lock();
> > +	classid = task_cls_state(current)->classid;
> > +	rcu_read_unlock();
> > +
> 
> It would be more readable to use :

Sure, however, I'm just moving the existing code around, so I
wanted it to be exactly the same.  Clean-ups should be done as
a follow-up, feel free to do this if you like.

> > @@ -122,12 +122,12 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
> >  	 * calls by looking at the number of nested bh disable calls because
> >  	 * softirqs always disables bh.
> >  	 */
> > -	if (softirq_count() != SOFTIRQ_OFFSET)
> > -		return -1;
> > -
> > -	rcu_read_lock();
> > -	classid = task_cls_state(current)->classid;
> > -	rcu_read_unlock();
> > +	if (softirq_count() != SOFTIRQ_OFFSET) {
> 
> Why do we still need this previous test ?

You didn't read my patchset description at all, did you? :)

This patchset is trying to fix the issue at hand with the minimum
amount of changes to current behaviour.  Once we're happy with the
new semantics, we can get rid of the softirq stuff.

That will be done in a later patch to minimise the risks.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  3:34                     ` Herbert Xu
                                         ` (2 preceding siblings ...)
  2010-05-20  4:54                       ` Eric Dumazet
@ 2010-05-20  5:15                       ` Eric Dumazet
  2010-05-20  5:20                         ` Herbert Xu
  3 siblings, 1 reply; 55+ messages in thread
From: Eric Dumazet @ 2010-05-20  5:15 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev

Le jeudi 20 mai 2010 à 13:34 +1000, Herbert Xu a écrit :

> The value is set at socket creation time.  So all sockets created
> via socket(2) automatically gains the ID of the thread creating it.
> Now you may argue that this may not be the same as the thread that
> is sending the packet.  However, we already have a precedence where
> an fd is passed to a different thread, its security property is
> inherited.  In this case, inheriting the classid of the thread
> creating the socket is also the logical thing to do.

I find this very biased, sorry.

In fact, fd passing is just fine today, if we consider that we classify
packets using the identity of the process *using* the fd, not the one
that *created* it.

Now your patch changes this, to the reverse, and you justify the caching
effect on socket. Sorry, this must be too convoluted for me.





^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  5:15                       ` Eric Dumazet
@ 2010-05-20  5:20                         ` Herbert Xu
  2010-05-20  5:36                           ` Eric Dumazet
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2010-05-20  5:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev

On Thu, May 20, 2010 at 07:15:07AM +0200, Eric Dumazet wrote:
> 
> I find this very biased, sorry.
> 
> In fact, fd passing is just fine today, if we consider that we classify
> packets using the identity of the process *using* the fd, not the one
> that *created* it.
> 
> Now your patch changes this, to the reverse, and you justify the caching
> effect on socket. Sorry, this must be too convoluted for me.

I'm sorry you find this convoluted, but using the sending process's
classid is inherently broken.

Here is why: consider a TCP socket shared by two processes with
different classids both writing data to it.  Now suppose further
that each writes just one byte, which is then coalesced into a
single skb.

Whose classid should we use on the resulting skb?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  5:20                         ` Herbert Xu
@ 2010-05-20  5:36                           ` Eric Dumazet
  2010-05-20  5:46                             ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Eric Dumazet @ 2010-05-20  5:36 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev

Le jeudi 20 mai 2010 à 15:20 +1000, Herbert Xu a écrit :
> On Thu, May 20, 2010 at 07:15:07AM +0200, Eric Dumazet wrote:
> > 
> > I find this very biased, sorry.
> > 
> > In fact, fd passing is just fine today, if we consider that we classify
> > packets using the identity of the process *using* the fd, not the one
> > that *created* it.
> > 
> > Now your patch changes this, to the reverse, and you justify the caching
> > effect on socket. Sorry, this must be too convoluted for me.
> 
> I'm sorry you find this convoluted, but using the sending process's
> classid is inherently broken.
> 
> Here is why: consider a TCP socket shared by two processes with
> different classids both writing data to it.  Now suppose further
> that each writes just one byte, which is then coalesced into a
> single skb.
> 
> Whose classid should we use on the resulting skb?

I am ok with any kind of clarification, if its really documented as
such, not as indirect effects of changes.

Right now, I am not sure classification is performed by the current
process/cpu. Our queue handling can process packets queued by other cpus
while we own the queue (__QDISC_STATE_RUNNING,) anyway.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  5:36                           ` Eric Dumazet
@ 2010-05-20  5:46                             ` Herbert Xu
  2010-05-20  6:03                               ` Eric Dumazet
                                                 ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Herbert Xu @ 2010-05-20  5:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev

On Thu, May 20, 2010 at 07:36:19AM +0200, Eric Dumazet wrote:
> 
> I am ok with any kind of clarification, if its really documented as
> such, not as indirect effects of changes.

But I did document this semantic change, quite verbosely too
at that :)

> Right now, I am not sure classification is performed by the current
> process/cpu. Our queue handling can process packets queued by other cpus
> while we own the queue (__QDISC_STATE_RUNNING,) anyway.

Classification should be done when the packet is enqueued.  So
you shouldn't really see other people's traffic.

The original requeueing would have broken this too, but that is
no longer with us :)

In my original submission I forgot to mention the case of TCP
transmission triggered by an incoming ACK, which is really the
same case as this.

So let me take this opportunity to resubmit with an updated
description:

cls_cgroup: Store classid in struct sock

Up until now cls_cgroup has relied on fetching the classid out of
the current executing thread.  This runs into trouble when a packet
processing is delayed in which case it may execute out of another
thread's context.

Furthermore, even when a packet is not delayed we may fail to
classify it if soft IRQs have been disabled, because this scenario
is indistinguishable from one where a packet unrelated to the
current thread is processed by a real soft IRQ.

In fact, the current semantics is inherently broken, as a single
skb may be constructed out of the writes of two different tasks.
A different manifestation of this problem is when the TCP stack
transmits in response of an incoming ACK.  This is currently
unclassified.

As we already have a concept of packet ownership for accounting
purposes in the skb->sk pointer, this is a natural place to store
the classid in a persistent manner.

This patch adds the cls_cgroup classid in struct sock, filling up
an existing hole on 64-bit :)

The value is set at socket creation time.  So all sockets created
via socket(2) automatically gains the ID of the thread creating it.
Now you may argue that this may not be the same as the thread that
is sending the packet.  However, we already have a precedence where
an fd is passed to a different thread, its security property is
inherited.  In this case, inheriting the classid of the thread
creating the socket is also the logical thing to do.

For sockets created on inbound connections through accept(2), we
inherit the classid of the original listening socket through
sk_clone, possibly preceding the actual accept(2) call.

In order to minimise risks, I have not made this the authoritative
classid.  For now it is only used as a backup when we execute
with soft IRQs disabled.  Once we're completely happy with its
semantics we can use it as the sole classid.

Footnote: I have rearranged the error path on cls_group module
creation.  If we didn't do this, then there is a window where
someone could create a tc rule using cls_group before the cgroup
subsystem has been registered.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8f78073..63c5036 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -410,6 +410,12 @@ struct cgroup_scanner {
 	void *data;
 };
 
+struct cgroup_cls_state
+{
+	struct cgroup_subsys_state css;
+	u32 classid;
+};
+
 /*
  * Add a new file to the given cgroup directory. Should only be
  * called by subsystems from within a populate() method
@@ -515,6 +521,10 @@ struct cgroup_subsys {
 	struct module *module;
 };
 
+#ifndef CONFIG_NET_CLS_CGROUP
+extern int net_cls_subsys_id;
+#endif
+
 #define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
 #include <linux/cgroup_subsys.h>
 #undef SUBSYS
diff --git a/include/net/sock.h b/include/net/sock.h
index 1ad6435..361a5dc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -307,7 +307,7 @@ struct sock {
 	void			*sk_security;
 #endif
 	__u32			sk_mark;
-	/* XXX 4 bytes hole on 64 bit */
+	u32			sk_classid;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk, int bytes);
 	void			(*sk_write_space)(struct sock *sk);
diff --git a/net/core/sock.c b/net/core/sock.c
index c5812bb..70bc529 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -110,6 +110,7 @@
 #include <linux/tcp.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
+#include <linux/cgroup.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -217,6 +218,32 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
 
+#ifdef CONFIG_NET_CLS_CGROUP
+static inline u32 task_cls_classid(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, net_cls_subsys_id),
+			    struct cgroup_cls_state, css).classid;
+}
+#else
+int net_cls_subsys_id = -1;
+EXPORT_SYMBOL_GPL(net_cls_subsys_id);
+
+static inline u32 task_cls_classid(struct task_struct *p)
+{
+	int id;
+	u32 classid;
+
+	rcu_read_lock();
+	id = rcu_dereference(net_cls_subsys_id);
+	if (id >= 0)
+		classid = container_of(task_subsys_state(p, id),
+				       struct cgroup_cls_state, css)->classid;
+	rcu_read_unlock();
+
+	return classid;
+}
+#endif
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -1064,6 +1091,9 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		sock_lock_init(sk);
 		sock_net_set(sk, get_net(net));
 		atomic_set(&sk->sk_wmem_alloc, 1);
+
+		if (!in_interrupt())
+			sk->sk_classid = task_cls_classid(current);
 	}
 
 	return sk;
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 2211803..32c1296 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -16,14 +16,10 @@
 #include <linux/errno.h>
 #include <linux/skbuff.h>
 #include <linux/cgroup.h>
+#include <linux/rcupdate.h>
 #include <net/rtnetlink.h>
 #include <net/pkt_cls.h>
-
-struct cgroup_cls_state
-{
-	struct cgroup_subsys_state css;
-	u32 classid;
-};
+#include <net/sock.h>
 
 static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
 					       struct cgroup *cgrp);
@@ -112,6 +108,10 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
 	struct cls_cgroup_head *head = tp->root;
 	u32 classid;
 
+	rcu_read_lock();
+	classid = task_cls_state(current)->classid;
+	rcu_read_unlock();
+
 	/*
 	 * Due to the nature of the classifier it is required to ignore all
 	 * packets originating from softirq context as accessing `current'
@@ -122,12 +122,12 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
 	 * calls by looking at the number of nested bh disable calls because
 	 * softirqs always disables bh.
 	 */
-	if (softirq_count() != SOFTIRQ_OFFSET)
-		return -1;
-
-	rcu_read_lock();
-	classid = task_cls_state(current)->classid;
-	rcu_read_unlock();
+	if (softirq_count() != SOFTIRQ_OFFSET) {
+	 	/* If there is an sk_classid we'll use that. */
+		if (!skb->sk)
+			return -1;
+		classid = skb->sk->sk_classid;
+	}
 
 	if (!classid)
 		return -1;
@@ -289,18 +289,35 @@ static struct tcf_proto_ops cls_cgroup_ops __read_mostly = {
 
 static int __init init_cgroup_cls(void)
 {
-	int ret = register_tcf_proto_ops(&cls_cgroup_ops);
-	if (ret)
-		return ret;
+	int ret;
+
 	ret = cgroup_load_subsys(&net_cls_subsys);
 	if (ret)
-		unregister_tcf_proto_ops(&cls_cgroup_ops);
+		goto out;
+
+#ifndef CONFIG_NET_CLS_CGROUP
+	/* We can't use rcu_assign_pointer because this is an int. */
+	smp_wmb();
+	net_cls_subsys_id = net_cls_subsys.subsys_id;
+#endif
+
+	ret = register_tcf_proto_ops(&cls_cgroup_ops);
+	if (ret)
+		cgroup_unload_subsys(&net_cls_subsys);
+
+out:
 	return ret;
 }
 
 static void __exit exit_cgroup_cls(void)
 {
 	unregister_tcf_proto_ops(&cls_cgroup_ops);
+
+#ifndef CONFIG_NET_CLS_CGROUP
+	net_cls_subsys_id = -1;
+	synchronize_rcu();
+#endif
+
 	cgroup_unload_subsys(&net_cls_subsys);
 }
 
Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  5:46                             ` Herbert Xu
@ 2010-05-20  6:03                               ` Eric Dumazet
  2010-05-20  6:11                                 ` Herbert Xu
  2010-05-20  6:52                               ` Herbert Xu
  2010-05-20 17:29                               ` Neil Horman
  2 siblings, 1 reply; 55+ messages in thread
From: Eric Dumazet @ 2010-05-20  6:03 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev

Le jeudi 20 mai 2010 à 15:46 +1000, Herbert Xu a écrit :

> +#ifdef CONFIG_NET_CLS_CGROUP
> +static inline u32 task_cls_classid(struct task_struct *p)
> +{
> +	return container_of(task_subsys_state(p, net_cls_subsys_id),
> +			    struct cgroup_cls_state, css).classid;
> +}
> +#else
> +int net_cls_subsys_id = -1;
> +EXPORT_SYMBOL_GPL(net_cls_subsys_id);
> +
> +static inline u32 task_cls_classid(struct task_struct *p)
> +{
> +	int id;
> +	u32 classid;
> +
> +	rcu_read_lock();
> +	id = rcu_dereference(net_cls_subsys_id);
> +	if (id >= 0)
> +		classid = container_of(task_subsys_state(p, id),
> +				       struct cgroup_cls_state, css)->classid;
> +	rcu_read_unlock();
> +
> +	return classid;
> +}
> +#endif

I still dont understand why you introduce this function and not reuse it
in net/sched/cls_cgroup.c

You told me it needs a separate patch, I dont think so.
Just make it non static and EXPORTed



> +	rcu_read_lock();
> +	classid = task_cls_state(current)->classid;
> +	rcu_read_unlock();
> +




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  6:03                               ` Eric Dumazet
@ 2010-05-20  6:11                                 ` Herbert Xu
  2010-05-20  6:19                                   ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2010-05-20  6:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev

On Thu, May 20, 2010 at 08:03:35AM +0200, Eric Dumazet wrote:
> Le jeudi 20 mai 2010 à 15:46 +1000, Herbert Xu a écrit :
> 
> > +#ifdef CONFIG_NET_CLS_CGROUP
> > +static inline u32 task_cls_classid(struct task_struct *p)
> > +{
> > +	return container_of(task_subsys_state(p, net_cls_subsys_id),
> > +			    struct cgroup_cls_state, css).classid;
> > +}
> > +#else
> > +int net_cls_subsys_id = -1;
> > +EXPORT_SYMBOL_GPL(net_cls_subsys_id);
> > +
> > +static inline u32 task_cls_classid(struct task_struct *p)
> > +{
> > +	int id;
> > +	u32 classid;
> > +
> > +	rcu_read_lock();
> > +	id = rcu_dereference(net_cls_subsys_id);
> > +	if (id >= 0)
> > +		classid = container_of(task_subsys_state(p, id),
> > +				       struct cgroup_cls_state, css)->classid;
> > +	rcu_read_unlock();
> > +
> > +	return classid;
> > +}
> > +#endif
> 
> I still dont understand why you introduce this function and not reuse it
> in net/sched/cls_cgroup.c
> 
> You told me it needs a separate patch, I dont think so.
> Just make it non static and EXPORTed

Because it doesn't have to go through this RCU crap.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  6:11                                 ` Herbert Xu
@ 2010-05-20  6:19                                   ` Herbert Xu
  0 siblings, 0 replies; 55+ messages in thread
From: Herbert Xu @ 2010-05-20  6:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev

On Thu, May 20, 2010 at 04:11:38PM +1000, Herbert Xu wrote:
> On Thu, May 20, 2010 at 08:03:35AM +0200, Eric Dumazet wrote:
> > Le jeudi 20 mai 2010 à 15:46 +1000, Herbert Xu a écrit :
> > 
> > > +#ifdef CONFIG_NET_CLS_CGROUP
> > > +static inline u32 task_cls_classid(struct task_struct *p)
> > > +{
> > > +	return container_of(task_subsys_state(p, net_cls_subsys_id),
> > > +			    struct cgroup_cls_state, css).classid;
> > > +}
> > > +#else
> > > +int net_cls_subsys_id = -1;
> > > +EXPORT_SYMBOL_GPL(net_cls_subsys_id);
> > > +
> > > +static inline u32 task_cls_classid(struct task_struct *p)
> > > +{
> > > +	int id;
> > > +	u32 classid;
> > > +
> > > +	rcu_read_lock();
> > > +	id = rcu_dereference(net_cls_subsys_id);
> > > +	if (id >= 0)
> > > +		classid = container_of(task_subsys_state(p, id),
> > > +				       struct cgroup_cls_state, css)->classid;
> > > +	rcu_read_unlock();
> > > +
> > > +	return classid;
> > > +}
> > > +#endif
> > 
> > I still dont understand why you introduce this function and not reuse it
> > in net/sched/cls_cgroup.c
> > 
> > You told me it needs a separate patch, I dont think so.
> > Just make it non static and EXPORTed
> 
> Because it doesn't have to go through this RCU crap.

I'm talking about the rcu_dereference on net_cls_subsys_id of
course, not the rest of it.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  5:46                             ` Herbert Xu
  2010-05-20  6:03                               ` Eric Dumazet
@ 2010-05-20  6:52                               ` Herbert Xu
  2010-05-20  8:10                                 ` Thomas Graf
  2010-05-24  6:44                                 ` David Miller
  2010-05-20 17:29                               ` Neil Horman
  2 siblings, 2 replies; 55+ messages in thread
From: Herbert Xu @ 2010-05-20  6:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bmb, tgraf, nhorman, nhorman, netdev

On Thu, May 20, 2010 at 03:46:42PM +1000, Herbert Xu wrote:
> 
> So let me take this opportunity to resubmit with an updated
> description:

OK Dave has convinced me that inheriting the cgroup of the process
creating a socket is not always the right thing to do.  So here's
a new patch that sets the socket classid to the last process with
a valid classid touching it.

cls_cgroup: Store classid in struct sock

Up until now cls_cgroup has relied on fetching the classid out of
the current executing thread.  This runs into trouble when a packet
processing is delayed in which case it may execute out of another
thread's context.

Furthermore, even when a packet is not delayed we may fail to
classify it if soft IRQs have been disabled, because this scenario
is indistinguishable from one where a packet unrelated to the
current thread is processed by a real soft IRQ.

In fact, the current semantics is inherently broken, as a single
skb may be constructed out of the writes of two different tasks.
A different manifestation of this problem is when the TCP stack
transmits in response of an incoming ACK.  This is currently
unclassified.

As we already have a concept of packet ownership for accounting
purposes in the skb->sk pointer, this is a natural place to store
the classid in a persistent manner.

This patch adds the cls_cgroup classid in struct sock, filling up
an existing hole on 64-bit :)

The value is set at socket creation time.  So all sockets created
via socket(2) automatically gains the ID of the thread creating it.
Whenever another process touches the socket by either reading or
writing to it, we will change the socket classid to that of the
process if it has a valid (non-zero) classid.

For sockets created on inbound connections through accept(2), we
inherit the classid of the original listening socket through
sk_clone, possibly preceding the actual accept(2) call.

In order to minimise risks, I have not made this the authoritative
classid.  For now it is only used as a backup when we execute
with soft IRQs disabled.  Once we're completely happy with its
semantics we can use it as the sole classid.

Footnote: I have rearranged the error path on cls_group module
creation.  If we didn't do this, then there is a window where
someone could create a tc rule using cls_group before the cgroup
subsystem has been registered.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h
new file mode 100644
index 0000000..a4cd15f
--- /dev/null
+++ b/include/net/cls_cgroup.h
@@ -0,0 +1,63 @@
+/*
+ * cls_cgroup.h			Control Group Classifier
+ * 
+ * Authors:	Thomas Graf <tgraf@suug.ch>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option) 
+ * any later version.
+ *
+ */
+
+#ifndef _NET_CLS_CGROUP_H
+#define _NET_CLS_CGROUP_H
+
+#include <linux/cgroup.h>
+#include <linux/hardirq.h>
+#include <linux/rcupdate.h>
+
+struct cgroup_cls_state
+{
+	struct cgroup_subsys_state css;
+	u32 classid;
+};
+
+#ifdef CONFIG_CGROUP
+#ifdef CONFIG_NET_CLS_CGROUP
+static inline u32 task_cls_classid(struct task_struct *p)
+{
+	if (in_interrupt())
+		return 0;
+
+	return container_of(task_subsys_state(p, net_cls_subsys_id),
+			    struct cgroup_cls_state, css).classid;
+}
+#else
+extern int net_cls_subsys_id;
+
+static inline u32 task_cls_classid(struct task_struct *p)
+{
+	int id;
+	u32 classid;
+
+	if (in_interrupt())
+		return 0;
+
+	rcu_read_lock();
+	id = rcu_dereference(net_cls_subsys_id);
+	if (id >= 0)
+		classid = container_of(task_subsys_state(p, id),
+				       struct cgroup_cls_state, css)->classid;
+	rcu_read_unlock();
+
+	return classid;
+}
+#endif
+#else
+static inline u32 task_cls_classid(struct task_struct *p)
+{
+	return 0;
+}
+#endif
+#endif  /* _NET_CLS_CGROUP_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index 1ad6435..cf191ea 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -307,7 +307,7 @@ struct sock {
 	void			*sk_security;
 #endif
 	__u32			sk_mark;
-	/* XXX 4 bytes hole on 64 bit */
+	u32			sk_classid;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk, int bytes);
 	void			(*sk_write_space)(struct sock *sk);
@@ -1012,6 +1012,14 @@ extern void *sock_kmalloc(struct sock *sk, int size,
 extern void sock_kfree_s(struct sock *sk, void *mem, int size);
 extern void sk_send_sigurg(struct sock *sk);
 
+#ifdef CONFIG_CGROUP
+extern void sock_update_classid(struct sock *sk);
+#else
+static inline void sock_update_classid(struct sock *sk)
+{
+}
+#endif
+
 /*
  * Functions to fill in entries in struct proto_ops when a protocol
  * does not implement a particular function.
diff --git a/net/core/sock.c b/net/core/sock.c
index c5812bb..8f7fdf8 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -123,6 +123,7 @@
 #include <linux/net_tstamp.h>
 #include <net/xfrm.h>
 #include <linux/ipsec.h>
+#include <net/cls_cgroup.h>
 
 #include <linux/filter.h>
 
@@ -217,6 +218,11 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
 
+#if defined(CONFIG_CGROUPS) && !defined(CONFIG_NET_CLS_CGROUP)
+int net_cls_subsys_id = -1;
+EXPORT_SYMBOL_GPL(net_cls_subsys_id);
+#endif
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -1041,6 +1047,16 @@ static void sk_prot_free(struct proto *prot, struct sock *sk)
 	module_put(owner);
 }
 
+#ifdef CONFIG_CGROUP
+void sock_update_classid(struct sock *sk)
+{
+	u32 classid = task_cls_classid(current);
+
+	if (classid && classid != sk->sk_classid)
+		sk->classid = classid;
+}
+#endif
+
 /**
  *	sk_alloc - All socket objects are allocated here
  *	@net: the applicable net namespace
@@ -1064,6 +1080,8 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		sock_lock_init(sk);
 		sock_net_set(sk, get_net(net));
 		atomic_set(&sk->sk_wmem_alloc, 1);
+
+		sock_update_classid(sk);
 	}
 
 	return sk;
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 2211803..766124b 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -16,14 +16,11 @@
 #include <linux/errno.h>
 #include <linux/skbuff.h>
 #include <linux/cgroup.h>
+#include <linux/rcupdate.h>
 #include <net/rtnetlink.h>
 #include <net/pkt_cls.h>
-
-struct cgroup_cls_state
-{
-	struct cgroup_subsys_state css;
-	u32 classid;
-};
+#include <net/sock.h>
+#include <net/cls_cgroup.h>
 
 static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
 					       struct cgroup *cgrp);
@@ -112,6 +109,10 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
 	struct cls_cgroup_head *head = tp->root;
 	u32 classid;
 
+	rcu_read_lock();
+	classid = task_cls_state(current)->classid;
+	rcu_read_unlock();
+
 	/*
 	 * Due to the nature of the classifier it is required to ignore all
 	 * packets originating from softirq context as accessing `current'
@@ -122,12 +123,12 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
 	 * calls by looking at the number of nested bh disable calls because
 	 * softirqs always disables bh.
 	 */
-	if (softirq_count() != SOFTIRQ_OFFSET)
-		return -1;
-
-	rcu_read_lock();
-	classid = task_cls_state(current)->classid;
-	rcu_read_unlock();
+	if (softirq_count() != SOFTIRQ_OFFSET) {
+	 	/* If there is an sk_classid we'll use that. */
+		if (!skb->sk)
+			return -1;
+		classid = skb->sk->sk_classid;
+	}
 
 	if (!classid)
 		return -1;
@@ -289,18 +290,35 @@ static struct tcf_proto_ops cls_cgroup_ops __read_mostly = {
 
 static int __init init_cgroup_cls(void)
 {
-	int ret = register_tcf_proto_ops(&cls_cgroup_ops);
-	if (ret)
-		return ret;
+	int ret;
+
 	ret = cgroup_load_subsys(&net_cls_subsys);
 	if (ret)
-		unregister_tcf_proto_ops(&cls_cgroup_ops);
+		goto out;
+
+#ifndef CONFIG_NET_CLS_CGROUP
+	/* We can't use rcu_assign_pointer because this is an int. */
+	smp_wmb();
+	net_cls_subsys_id = net_cls_subsys.subsys_id;
+#endif
+
+	ret = register_tcf_proto_ops(&cls_cgroup_ops);
+	if (ret)
+		cgroup_unload_subsys(&net_cls_subsys);
+
+out:
 	return ret;
 }
 
 static void __exit exit_cgroup_cls(void)
 {
 	unregister_tcf_proto_ops(&cls_cgroup_ops);
+
+#ifndef CONFIG_NET_CLS_CGROUP
+	net_cls_subsys_id = -1;
+	synchronize_rcu();
+#endif
+
 	cgroup_unload_subsys(&net_cls_subsys);
 }
 
diff --git a/net/socket.c b/net/socket.c
index 5e8d0af..6a1970f 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -94,6 +94,7 @@
 
 #include <net/compat.h>
 #include <net/wext.h>
+#include <net/cls_cgroup.h>
 
 #include <net/sock.h>
 #include <linux/netfilter.h>
@@ -542,6 +543,8 @@ static inline int __sock_sendmsg(struct kiocb *iocb, struct socket *sock,
 	struct sock_iocb *si = kiocb_to_siocb(iocb);
 	int err;
 
+	sock_update_classid(sock->sk);
+
 	si->sock = sock;
 	si->scm = NULL;
 	si->msg = msg;
@@ -669,6 +672,8 @@ static inline int __sock_recvmsg_nosec(struct kiocb *iocb, struct socket *sock,
 {
 	struct sock_iocb *si = kiocb_to_siocb(iocb);
 
+	sock_update_classid(sock->sk);
+
 	si->sock = sock;
 	si->scm = NULL;
 	si->msg = msg;
@@ -762,6 +767,8 @@ static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
 	if (unlikely(!sock->ops->splice_read))
 		return -EINVAL;
 
+	sock_update_classid(sock->sk);
+
 	return sock->ops->splice_read(sock, ppos, pipe, len, flags);
 }
 
@@ -3096,6 +3103,8 @@ int kernel_setsockopt(struct socket *sock, int level, int optname,
 int kernel_sendpage(struct socket *sock, struct page *page, int offset,
 		    size_t size, int flags)
 {
+	sock_update_classid(sock->sk);
+
 	if (sock->ops->sendpage)
 		return sock->ops->sendpage(sock, page, offset, size, flags);

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  6:52                               ` Herbert Xu
@ 2010-05-20  8:10                                 ` Thomas Graf
  2010-05-20  9:40                                   ` Thomas Graf
  2010-05-24  6:44                                 ` David Miller
  1 sibling, 1 reply; 55+ messages in thread
From: Thomas Graf @ 2010-05-20  8:10 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Eric Dumazet, David Miller, bmb, nhorman, nhorman, netdev

On Thu, 2010-05-20 at 16:52 +1000, Herbert Xu wrote: 
> The value is set at socket creation time.  So all sockets created
> via socket(2) automatically gains the ID of the thread creating it.
> Whenever another process touches the socket by either reading or
> writing to it, we will change the socket classid to that of the
> process if it has a valid (non-zero) classid.

There is a fundamental problem with this. The process needs to be
associated with the cgroup before any sockets get created. Sockets
are often created right after the application starts. This means that
the only viable option is to start each application in a wrapper which
assigns itself to the cgroup and then forks the application as its
child. If a task is associated with a cgroup after it has started it
may lead to unpredictable outcome because only some of the sockets
may end up being classified.

This was the actual reason for the old method.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  8:10                                 ` Thomas Graf
@ 2010-05-20  9:40                                   ` Thomas Graf
  0 siblings, 0 replies; 55+ messages in thread
From: Thomas Graf @ 2010-05-20  9:40 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Eric Dumazet, David Miller, bmb, nhorman, nhorman, netdev

On Thu, 2010-05-20 at 10:10 +0200, Thomas Graf wrote: 
> There is a fundamental problem with this. The process needs to be
> associated with the cgroup before any sockets get created. Sockets
> are often created right after the application starts. This means that
> the only viable option is to start each application in a wrapper which
> assigns itself to the cgroup and then forks the application as its
> child. If a task is associated with a cgroup after it has started it
> may lead to unpredictable outcome because only some of the sockets
> may end up being classified.
> 
> This was the actual reason for the old method.

Disregard this. I didn't read your latest patch correctly.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  5:46                             ` Herbert Xu
  2010-05-20  6:03                               ` Eric Dumazet
  2010-05-20  6:52                               ` Herbert Xu
@ 2010-05-20 17:29                               ` Neil Horman
  2010-05-20 23:16                                 ` Herbert Xu
  2 siblings, 1 reply; 55+ messages in thread
From: Neil Horman @ 2010-05-20 17:29 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Eric Dumazet, David Miller, bmb, tgraf, nhorman, netdev

On Thu, May 20, 2010 at 03:46:42PM +1000, Herbert Xu wrote:
> On Thu, May 20, 2010 at 07:36:19AM +0200, Eric Dumazet wrote:
> > 
> > I am ok with any kind of clarification, if its really documented as
> > such, not as indirect effects of changes.
> 
> But I did document this semantic change, quite verbosely too
> at that :)
> 
> > Right now, I am not sure classification is performed by the current
> > process/cpu. Our queue handling can process packets queued by other cpus
> > while we own the queue (__QDISC_STATE_RUNNING,) anyway.
> 
> Classification should be done when the packet is enqueued.  So
> you shouldn't really see other people's traffic.
> 
> The original requeueing would have broken this too, but that is
> no longer with us :)
> 
> In my original submission I forgot to mention the case of TCP
> transmission triggered by an incoming ACK, which is really the
> same case as this.
> 
> So let me take this opportunity to resubmit with an updated
> description:
> 
> cls_cgroup: Store classid in struct sock
> 
> Up until now cls_cgroup has relied on fetching the classid out of
> the current executing thread.  This runs into trouble when a packet
> processing is delayed in which case it may execute out of another
> thread's context.
> 
> Furthermore, even when a packet is not delayed we may fail to
> classify it if soft IRQs have been disabled, because this scenario
> is indistinguishable from one where a packet unrelated to the
> current thread is processed by a real soft IRQ.
> 
> In fact, the current semantics is inherently broken, as a single
> skb may be constructed out of the writes of two different tasks.
> A different manifestation of this problem is when the TCP stack
> transmits in response of an incoming ACK.  This is currently
> unclassified.
> 
> As we already have a concept of packet ownership for accounting
> purposes in the skb->sk pointer, this is a natural place to store
> the classid in a persistent manner.
> 
> This patch adds the cls_cgroup classid in struct sock, filling up
> an existing hole on 64-bit :)
> 
> The value is set at socket creation time.  So all sockets created
> via socket(2) automatically gains the ID of the thread creating it.
> Now you may argue that this may not be the same as the thread that
> is sending the packet.  However, we already have a precedence where
> an fd is passed to a different thread, its security property is
> inherited.  In this case, inheriting the classid of the thread
> creating the socket is also the logical thing to do.
> 
> For sockets created on inbound connections through accept(2), we
> inherit the classid of the original listening socket through
> sk_clone, possibly preceding the actual accept(2) call.
> 
> In order to minimise risks, I have not made this the authoritative
> classid.  For now it is only used as a backup when we execute
> with soft IRQs disabled.  Once we're completely happy with its
> semantics we can use it as the sole classid.
> 
> Footnote: I have rearranged the error path on cls_group module
> creation.  If we didn't do this, then there is a window where
> someone could create a tc rule using cls_group before the cgroup
> subsystem has been registered.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 8f78073..63c5036 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -410,6 +410,12 @@ struct cgroup_scanner {
>  	void *data;
>  };
>  
> +struct cgroup_cls_state
> +{
> +	struct cgroup_subsys_state css;
> +	u32 classid;
> +};
> +
>  /*
>   * Add a new file to the given cgroup directory. Should only be
>   * called by subsystems from within a populate() method
> @@ -515,6 +521,10 @@ struct cgroup_subsys {
>  	struct module *module;
>  };
>  
> +#ifndef CONFIG_NET_CLS_CGROUP
> +extern int net_cls_subsys_id;
> +#endif
> +
>  #define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
>  #include <linux/cgroup_subsys.h>
>  #undef SUBSYS
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 1ad6435..361a5dc 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -307,7 +307,7 @@ struct sock {
>  	void			*sk_security;
>  #endif
>  	__u32			sk_mark;
> -	/* XXX 4 bytes hole on 64 bit */
> +	u32			sk_classid;
>  	void			(*sk_state_change)(struct sock *sk);
>  	void			(*sk_data_ready)(struct sock *sk, int bytes);
>  	void			(*sk_write_space)(struct sock *sk);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index c5812bb..70bc529 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -110,6 +110,7 @@
>  #include <linux/tcp.h>
>  #include <linux/init.h>
>  #include <linux/highmem.h>
> +#include <linux/cgroup.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/system.h>
> @@ -217,6 +218,32 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
>  int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
>  EXPORT_SYMBOL(sysctl_optmem_max);
>  
> +#ifdef CONFIG_NET_CLS_CGROUP
> +static inline u32 task_cls_classid(struct task_struct *p)
> +{
> +	return container_of(task_subsys_state(p, net_cls_subsys_id),
> +			    struct cgroup_cls_state, css).classid;
> +}
> +#else
> +int net_cls_subsys_id = -1;
> +EXPORT_SYMBOL_GPL(net_cls_subsys_id);
> +
> +static inline u32 task_cls_classid(struct task_struct *p)
> +{
> +	int id;
> +	u32 classid;
> +
> +	rcu_read_lock();
> +	id = rcu_dereference(net_cls_subsys_id);
> +	if (id >= 0)
> +		classid = container_of(task_subsys_state(p, id),
> +				       struct cgroup_cls_state, css)->classid;
> +	rcu_read_unlock();
> +
> +	return classid;
> +}
> +#endif
> +
>  static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
>  {
>  	struct timeval tv;
> @@ -1064,6 +1091,9 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
>  		sock_lock_init(sk);
>  		sock_net_set(sk, get_net(net));
>  		atomic_set(&sk->sk_wmem_alloc, 1);
> +
> +		if (!in_interrupt())
> +			sk->sk_classid = task_cls_classid(current);
>  	}
>  
>  	return sk;
> diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
> index 2211803..32c1296 100644
> --- a/net/sched/cls_cgroup.c
> +++ b/net/sched/cls_cgroup.c
> @@ -16,14 +16,10 @@
>  #include <linux/errno.h>
>  #include <linux/skbuff.h>
>  #include <linux/cgroup.h>
> +#include <linux/rcupdate.h>
>  #include <net/rtnetlink.h>
>  #include <net/pkt_cls.h>
> -
> -struct cgroup_cls_state
> -{
> -	struct cgroup_subsys_state css;
> -	u32 classid;
> -};
> +#include <net/sock.h>
>  
>  static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
>  					       struct cgroup *cgrp);
> @@ -112,6 +108,10 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
>  	struct cls_cgroup_head *head = tp->root;
>  	u32 classid;
>  
> +	rcu_read_lock();
> +	classid = task_cls_state(current)->classid;
> +	rcu_read_unlock();
> +
>  	/*
>  	 * Due to the nature of the classifier it is required to ignore all
>  	 * packets originating from softirq context as accessing `current'
> @@ -122,12 +122,12 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
>  	 * calls by looking at the number of nested bh disable calls because
>  	 * softirqs always disables bh.
>  	 */
> -	if (softirq_count() != SOFTIRQ_OFFSET)
> -		return -1;
> -
> -	rcu_read_lock();
> -	classid = task_cls_state(current)->classid;
> -	rcu_read_unlock();
> +	if (softirq_count() != SOFTIRQ_OFFSET) {
> +	 	/* If there is an sk_classid we'll use that. */
> +		if (!skb->sk)
> +			return -1;
> +		classid = skb->sk->sk_classid;
> +	}
>  
>  	if (!classid)
>  		return -1;
> @@ -289,18 +289,35 @@ static struct tcf_proto_ops cls_cgroup_ops __read_mostly = {
>  
>  static int __init init_cgroup_cls(void)
>  {
> -	int ret = register_tcf_proto_ops(&cls_cgroup_ops);
> -	if (ret)
> -		return ret;
> +	int ret;
> +
>  	ret = cgroup_load_subsys(&net_cls_subsys);
>  	if (ret)
> -		unregister_tcf_proto_ops(&cls_cgroup_ops);
> +		goto out;
> +
> +#ifndef CONFIG_NET_CLS_CGROUP
> +	/* We can't use rcu_assign_pointer because this is an int. */
> +	smp_wmb();
> +	net_cls_subsys_id = net_cls_subsys.subsys_id;
> +#endif
> +
> +	ret = register_tcf_proto_ops(&cls_cgroup_ops);
> +	if (ret)
> +		cgroup_unload_subsys(&net_cls_subsys);
> +
> +out:
>  	return ret;
>  }
>  
>  static void __exit exit_cgroup_cls(void)
>  {
>  	unregister_tcf_proto_ops(&cls_cgroup_ops);
> +
> +#ifndef CONFIG_NET_CLS_CGROUP
> +	net_cls_subsys_id = -1;
> +	synchronize_rcu();
> +#endif
> +
>  	cgroup_unload_subsys(&net_cls_subsys);
>  }
>  
> Cheers,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


So, I'm testing this patch out now, and unfotunately it doesn't seem to be
working.  Every frame seems to be holding a classid of 0.  Trying to figure out
why now.

Neil


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20 17:29                               ` Neil Horman
@ 2010-05-20 23:16                                 ` Herbert Xu
  2010-05-21  0:39                                   ` Neil Horman
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2010-05-20 23:16 UTC (permalink / raw)
  To: Neil Horman; +Cc: Eric Dumazet, David Miller, bmb, tgraf, nhorman, netdev

On Thu, May 20, 2010 at 01:29:18PM -0400, Neil Horman wrote:
>
> So, I'm testing this patch out now, and unfotunately it doesn't seem to be
> working.  Every frame seems to be holding a classid of 0.  Trying to figure out
> why now.

Not very surprising since tun.c doesn't go through the normal
socket interface.  I'll send a additional patch for that.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20 23:16                                 ` Herbert Xu
@ 2010-05-21  0:39                                   ` Neil Horman
  2010-05-21  1:02                                     ` Herbert Xu
  2010-05-21  5:49                                     ` David Miller
  0 siblings, 2 replies; 55+ messages in thread
From: Neil Horman @ 2010-05-21  0:39 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Eric Dumazet, David Miller, bmb, tgraf, nhorman, netdev

On Fri, May 21, 2010 at 09:16:30AM +1000, Herbert Xu wrote:
> On Thu, May 20, 2010 at 01:29:18PM -0400, Neil Horman wrote:
> >
> > So, I'm testing this patch out now, and unfotunately it doesn't seem to be
> > working.  Every frame seems to be holding a classid of 0.  Trying to figure out
> > why now.
> 
> Not very surprising since tun.c doesn't go through the normal
> socket interface.  I'll send a additional patch for that.
> 
I don't think thats it.  I think its a chicken and egg situation.  I think the
problem is that tasks can't be assigned to cgroups until their created, and in
that time a sock can be created.  Its a natural race.  If you create a socket
before you assign it to a cgroup, that socket retains a classid of zero.  I'm
going to try modify the patch to update sockets owned by tasks when the cgroup
is assigned.

Best
Neil

> Cheers,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-21  0:39                                   ` Neil Horman
@ 2010-05-21  1:02                                     ` Herbert Xu
  2010-05-21  1:16                                       ` Herbert Xu
  2010-05-21  5:49                                     ` David Miller
  1 sibling, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2010-05-21  1:02 UTC (permalink / raw)
  To: Neil Horman; +Cc: Eric Dumazet, David Miller, bmb, tgraf, nhorman, netdev

On Thu, May 20, 2010 at 08:39:39PM -0400, Neil Horman wrote:
>
> > Not very surprising since tun.c doesn't go through the normal
> > socket interface.  I'll send a additional patch for that.
> > 
> I don't think thats it.  I think its a chicken and egg situation.  I think the
> problem is that tasks can't be assigned to cgroups until their created, and in
> that time a sock can be created.  Its a natural race.  If you create a socket
> before you assign it to a cgroup, that socket retains a classid of zero.  I'm
> going to try modify the patch to update sockets owned by tasks when the cgroup
> is assigned.

That's what I meant above.  My patch will make tun.c to the
classid update every time it sends out a packet.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-21  1:02                                     ` Herbert Xu
@ 2010-05-21  1:16                                       ` Herbert Xu
  2010-05-24  6:44                                         ` David Miller
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2010-05-21  1:16 UTC (permalink / raw)
  To: Neil Horman; +Cc: Eric Dumazet, David Miller, bmb, tgraf, nhorman, netdev

On Fri, May 21, 2010 at 11:02:11AM +1000, Herbert Xu wrote:
> 
> That's what I meant above.  My patch will make tun.c to the
> classid update every time it sends out a packet.

Here it is:

tun: Update classid on packet injection

This patch makes tun update its socket classid every time we
inject a packet into the network stack.  This is so that any
updates made by the admin to the process writing packets to
tun is effected.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 4326520..a8a9aa8 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -525,6 +525,8 @@ static inline struct sk_buff *tun_alloc_skb(struct tun_struct *tun,
 	struct sk_buff *skb;
 	int err;
 
+	sock_update_classid(sk);
+
 	/* Under a page?  Don't bother with paged skb. */
 	if (prepad + len < PAGE_SIZE || !linear)
 		linear = len;
diff --git a/net/core/sock.c b/net/core/sock.c
index 8f7fdf8..4969bd1 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1055,6 +1055,7 @@ void sock_update_classid(struct sock *sk)
 	if (classid && classid != sk->sk_classid)
 		sk->classid = classid;
 }
+EXPORT_SYMBOL(sock_update_classid);
 #endif
 
 /**

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-21  0:39                                   ` Neil Horman
  2010-05-21  1:02                                     ` Herbert Xu
@ 2010-05-21  5:49                                     ` David Miller
  2010-05-21 10:51                                       ` Neil Horman
  1 sibling, 1 reply; 55+ messages in thread
From: David Miller @ 2010-05-21  5:49 UTC (permalink / raw)
  To: nhorman; +Cc: herbert, eric.dumazet, bmb, tgraf, nhorman, netdev

From: Neil Horman <nhorman@tuxdriver.com>
Date: Thu, 20 May 2010 20:39:39 -0400

> On Fri, May 21, 2010 at 09:16:30AM +1000, Herbert Xu wrote:
>> On Thu, May 20, 2010 at 01:29:18PM -0400, Neil Horman wrote:
>> >
>> > So, I'm testing this patch out now, and unfotunately it doesn't seem to be
>> > working.  Every frame seems to be holding a classid of 0.  Trying to figure out
>> > why now.
>> 
>> Not very surprising since tun.c doesn't go through the normal
>> socket interface.  I'll send a additional patch for that.
>> 
> I don't think thats it.  I think its a chicken and egg situation.  I think the
> problem is that tasks can't be assigned to cgroups until their created, and in
> that time a sock can be created.  Its a natural race.  If you create a socket
> before you assign it to a cgroup, that socket retains a classid of zero.  I'm
> going to try modify the patch to update sockets owned by tasks when the cgroup
> is assigned.

Neil, you must not be using Herbert's most recent patch.

Either that or you haven't even read it.

Herbert's most recent patch doesn't create this chicken and egg
problem you mention because it explicitly watches for cgroupid changes
at all socket I/O operations including sendmsg() and sendmsg().  And
if it sees a different cgroupid at a socket I/O call, it updates the
cgroupid value in the socket.

So you very much can change the cgroup of the process mid-socket
ownership and it will work.

The only problem is, as Herbert stated, tun.  Because it does it's
networking I/O directly by calling netif_receive_skb() so it won't
hit any of Herbert's cgroup check points.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-21  5:49                                     ` David Miller
@ 2010-05-21 10:51                                       ` Neil Horman
  2010-05-21 11:08                                         ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Neil Horman @ 2010-05-21 10:51 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, eric.dumazet, bmb, tgraf, nhorman, netdev

On Thu, May 20, 2010 at 10:49:44PM -0700, David Miller wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Thu, 20 May 2010 20:39:39 -0400
> 
> > On Fri, May 21, 2010 at 09:16:30AM +1000, Herbert Xu wrote:
> >> On Thu, May 20, 2010 at 01:29:18PM -0400, Neil Horman wrote:
> >> >
> >> > So, I'm testing this patch out now, and unfotunately it doesn't seem to be
> >> > working.  Every frame seems to be holding a classid of 0.  Trying to figure out
> >> > why now.
> >> 
> >> Not very surprising since tun.c doesn't go through the normal
> >> socket interface.  I'll send a additional patch for that.
> >> 
> > I don't think thats it.  I think its a chicken and egg situation.  I think the
> > problem is that tasks can't be assigned to cgroups until their created, and in
> > that time a sock can be created.  Its a natural race.  If you create a socket
> > before you assign it to a cgroup, that socket retains a classid of zero.  I'm
> > going to try modify the patch to update sockets owned by tasks when the cgroup
> > is assigned.
> 
> Neil, you must not be using Herbert's most recent patch.
> 
I'm not, I was testing the version prior. I wrote my note before he posted the
update for the tun driver

> Either that or you haven't even read it.
> 
I read it, we just described the issue in diferent terms.

Neil

> Herbert's most recent patch doesn't create this chicken and egg
> problem you mention because it explicitly watches for cgroupid changes
> at all socket I/O operations including sendmsg() and sendmsg().  And
> if it sees a different cgroupid at a socket I/O call, it updates the
> cgroupid value in the socket.
> 
> So you very much can change the cgroup of the process mid-socket
> ownership and it will work.
> 
> The only problem is, as Herbert stated, tun.  Because it does it's
> networking I/O directly by calling netif_receive_skb() so it won't
> hit any of Herbert's cgroup check points.
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-21 10:51                                       ` Neil Horman
@ 2010-05-21 11:08                                         ` Herbert Xu
  2010-05-21 12:59                                           ` Neil Horman
  2010-05-21 16:40                                           ` Neil Horman
  0 siblings, 2 replies; 55+ messages in thread
From: Herbert Xu @ 2010-05-21 11:08 UTC (permalink / raw)
  To: Neil Horman; +Cc: David Miller, eric.dumazet, bmb, tgraf, nhorman, netdev

On Fri, May 21, 2010 at 06:51:28AM -0400, Neil Horman wrote:
>
> I read it, we just described the issue in diferent terms.

Yeah I wasn't clear enough with my email without an actual patch :)

Thanks for testing Neil!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-21 11:08                                         ` Herbert Xu
@ 2010-05-21 12:59                                           ` Neil Horman
  2010-05-21 16:40                                           ` Neil Horman
  1 sibling, 0 replies; 55+ messages in thread
From: Neil Horman @ 2010-05-21 12:59 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Neil Horman, David Miller, eric.dumazet, bmb, tgraf, netdev

On Fri, May 21, 2010 at 09:08:47PM +1000, Herbert Xu wrote:
> On Fri, May 21, 2010 at 06:51:28AM -0400, Neil Horman wrote:
> >
> > I read it, we just described the issue in diferent terms.
> 
> Yeah I wasn't clear enough with my email without an actual patch :)
> 
No worries, its not an easy issue to describe :)

> Thanks for testing Neil!
Thanks for the patch, I'll let you know how it works out in just a bit here!
Neil

> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-21 11:08                                         ` Herbert Xu
  2010-05-21 12:59                                           ` Neil Horman
@ 2010-05-21 16:40                                           ` Neil Horman
  2010-05-22  1:49                                             ` cls_cgroup: Store classid in struct sock Herbert Xu
  1 sibling, 1 reply; 55+ messages in thread
From: Neil Horman @ 2010-05-21 16:40 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, eric.dumazet, bmb, tgraf, nhorman, netdev

On Fri, May 21, 2010 at 09:08:47PM +1000, Herbert Xu wrote:
> On Fri, May 21, 2010 at 06:51:28AM -0400, Neil Horman wrote:
> >
> > I read it, we just described the issue in diferent terms.
> 
> Yeah I wasn't clear enough with my email without an actual patch :)
> 
> Thanks for testing Neil!

So a bit of bad news.  I think you're patch is correct herbert but its still not
working.  When I initially started messing with this it was on 2.6.32.  But
since you're patches I (naturally) started testing them on 2.6.34.  Whereas in
2.6.32 the tasks classid value for the net_cls cgroup subsys was set properly,
in 2.6.34 its not.  It appears since we started using the cgroup subsys
registration calls, we never registered an attach method, so we never set the
tasks classid, so the comparison always fails.

There may also be an issue with the setting of the classid (possible use of the
wrong subsys id value when grabbing our cgroup_subsys_state), but I'm checking
on that now.


Best
Neil

> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* cls_cgroup: Store classid in struct sock
  2010-05-21 16:40                                           ` Neil Horman
@ 2010-05-22  1:49                                             ` Herbert Xu
  2010-05-22 12:26                                               ` Neil Horman
  2010-05-24  6:44                                               ` David Miller
  0 siblings, 2 replies; 55+ messages in thread
From: Herbert Xu @ 2010-05-22  1:49 UTC (permalink / raw)
  To: Neil Horman; +Cc: David Miller, eric.dumazet, bmb, tgraf, nhorman, netdev

On Fri, May 21, 2010 at 12:40:54PM -0400, Neil Horman wrote:
> 
> There may also be an issue with the setting of the classid (possible use of the
> wrong subsys id value when grabbing our cgroup_subsys_state), but I'm checking
> on that now.

Actually, I think it's because my patch mistook CONFIG_CGROUP
for CONFIG_CGROUPS.

Here is a fixed version (also includes a build fix on sk->classid).

cls_cgroup: Store classid in struct sock

Up until now cls_cgroup has relied on fetching the classid out of
the current executing thread.  This runs into trouble when a packet
processing is delayed in which case it may execute out of another
thread's context.

Furthermore, even when a packet is not delayed we may fail to
classify it if soft IRQs have been disabled, because this scenario
is indistinguishable from one where a packet unrelated to the
current thread is processed by a real soft IRQ.

In fact, the current semantics is inherently broken, as a single
skb may be constructed out of the writes of two different tasks.
A different manifestation of this problem is when the TCP stack
transmits in response of an incoming ACK.  This is currently
unclassified.

As we already have a concept of packet ownership for accounting
purposes in the skb->sk pointer, this is a natural place to store
the classid in a persistent manner.

This patch adds the cls_cgroup classid in struct sock, filling up
an existing hole on 64-bit :)

The value is set at socket creation time.  So all sockets created
via socket(2) automatically gains the ID of the thread creating it.
Whenever another process touches the socket by either reading or
writing to it, we will change the socket classid to that of the
process if it has a valid (non-zero) classid.

For sockets created on inbound connections through accept(2), we
inherit the classid of the original listening socket through
sk_clone, possibly preceding the actual accept(2) call.

In order to minimise risks, I have not made this the authoritative
classid.  For now it is only used as a backup when we execute
with soft IRQs disabled.  Once we're completely happy with its
semantics we can use it as the sole classid.

Footnote: I have rearranged the error path on cls_group module
creation.  If we didn't do this, then there is a window where
someone could create a tc rule using cls_group before the cgroup
subsystem has been registered.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h
new file mode 100644
index 0000000..f0e244e
--- /dev/null
+++ b/include/net/cls_cgroup.h
@@ -0,0 +1,63 @@
+/*
+ * cls_cgroup.h			Control Group Classifier
+ * 
+ * Authors:	Thomas Graf <tgraf@suug.ch>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option) 
+ * any later version.
+ *
+ */
+
+#ifndef _NET_CLS_CGROUP_H
+#define _NET_CLS_CGROUP_H
+
+#include <linux/cgroup.h>
+#include <linux/hardirq.h>
+#include <linux/rcupdate.h>
+
+struct cgroup_cls_state
+{
+	struct cgroup_subsys_state css;
+	u32 classid;
+};
+
+#ifdef CONFIG_CGROUPS
+#ifdef CONFIG_NET_CLS_CGROUP
+static inline u32 task_cls_classid(struct task_struct *p)
+{
+	if (in_interrupt())
+		return 0;
+
+	return container_of(task_subsys_state(p, net_cls_subsys_id),
+			    struct cgroup_cls_state, css).classid;
+}
+#else
+extern int net_cls_subsys_id;
+
+static inline u32 task_cls_classid(struct task_struct *p)
+{
+	int id;
+	u32 classid;
+
+	if (in_interrupt())
+		return 0;
+
+	rcu_read_lock();
+	id = rcu_dereference(net_cls_subsys_id);
+	if (id >= 0)
+		classid = container_of(task_subsys_state(p, id),
+				       struct cgroup_cls_state, css)->classid;
+	rcu_read_unlock();
+
+	return classid;
+}
+#endif
+#else
+static inline u32 task_cls_classid(struct task_struct *p)
+{
+	return 0;
+}
+#endif
+#endif  /* _NET_CLS_CGROUP_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index 5697caf..d24f382 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -312,7 +312,7 @@ struct sock {
 	void			*sk_security;
 #endif
 	__u32			sk_mark;
-	/* XXX 4 bytes hole on 64 bit */
+	u32			sk_classid;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk, int bytes);
 	void			(*sk_write_space)(struct sock *sk);
@@ -1074,6 +1074,14 @@ extern void *sock_kmalloc(struct sock *sk, int size,
 extern void sock_kfree_s(struct sock *sk, void *mem, int size);
 extern void sk_send_sigurg(struct sock *sk);
 
+#ifdef CONFIG_CGROUPS
+extern void sock_update_classid(struct sock *sk);
+#else
+static inline void sock_update_classid(struct sock *sk)
+{
+}
+#endif
+
 /*
  * Functions to fill in entries in struct proto_ops when a protocol
  * does not implement a particular function.
diff --git a/net/core/sock.c b/net/core/sock.c
index bf88a16..a05ae7f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -123,6 +123,7 @@
 #include <linux/net_tstamp.h>
 #include <net/xfrm.h>
 #include <linux/ipsec.h>
+#include <net/cls_cgroup.h>
 
 #include <linux/filter.h>
 
@@ -217,6 +218,11 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
 
+#if defined(CONFIG_CGROUPS) && !defined(CONFIG_NET_CLS_CGROUP)
+int net_cls_subsys_id = -1;
+EXPORT_SYMBOL_GPL(net_cls_subsys_id);
+#endif
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -1050,6 +1056,16 @@ static void sk_prot_free(struct proto *prot, struct sock *sk)
 	module_put(owner);
 }
 
+#ifdef CONFIG_CGROUPS
+void sock_update_classid(struct sock *sk)
+{
+	u32 classid = task_cls_classid(current);
+
+	if (classid && classid != sk->sk_classid)
+		sk->sk_classid = classid;
+}
+#endif
+
 /**
  *	sk_alloc - All socket objects are allocated here
  *	@net: the applicable net namespace
@@ -1073,6 +1089,8 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		sock_lock_init(sk);
 		sock_net_set(sk, get_net(net));
 		atomic_set(&sk->sk_wmem_alloc, 1);
+
+		sock_update_classid(sk);
 	}
 
 	return sk;
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 2211803..766124b 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -16,14 +16,11 @@
 #include <linux/errno.h>
 #include <linux/skbuff.h>
 #include <linux/cgroup.h>
+#include <linux/rcupdate.h>
 #include <net/rtnetlink.h>
 #include <net/pkt_cls.h>
-
-struct cgroup_cls_state
-{
-	struct cgroup_subsys_state css;
-	u32 classid;
-};
+#include <net/sock.h>
+#include <net/cls_cgroup.h>
 
 static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
 					       struct cgroup *cgrp);
@@ -112,6 +109,10 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
 	struct cls_cgroup_head *head = tp->root;
 	u32 classid;
 
+	rcu_read_lock();
+	classid = task_cls_state(current)->classid;
+	rcu_read_unlock();
+
 	/*
 	 * Due to the nature of the classifier it is required to ignore all
 	 * packets originating from softirq context as accessing `current'
@@ -122,12 +123,12 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
 	 * calls by looking at the number of nested bh disable calls because
 	 * softirqs always disables bh.
 	 */
-	if (softirq_count() != SOFTIRQ_OFFSET)
-		return -1;
-
-	rcu_read_lock();
-	classid = task_cls_state(current)->classid;
-	rcu_read_unlock();
+	if (softirq_count() != SOFTIRQ_OFFSET) {
+	 	/* If there is an sk_classid we'll use that. */
+		if (!skb->sk)
+			return -1;
+		classid = skb->sk->sk_classid;
+	}
 
 	if (!classid)
 		return -1;
@@ -289,18 +290,35 @@ static struct tcf_proto_ops cls_cgroup_ops __read_mostly = {
 
 static int __init init_cgroup_cls(void)
 {
-	int ret = register_tcf_proto_ops(&cls_cgroup_ops);
-	if (ret)
-		return ret;
+	int ret;
+
 	ret = cgroup_load_subsys(&net_cls_subsys);
 	if (ret)
-		unregister_tcf_proto_ops(&cls_cgroup_ops);
+		goto out;
+
+#ifndef CONFIG_NET_CLS_CGROUP
+	/* We can't use rcu_assign_pointer because this is an int. */
+	smp_wmb();
+	net_cls_subsys_id = net_cls_subsys.subsys_id;
+#endif
+
+	ret = register_tcf_proto_ops(&cls_cgroup_ops);
+	if (ret)
+		cgroup_unload_subsys(&net_cls_subsys);
+
+out:
 	return ret;
 }
 
 static void __exit exit_cgroup_cls(void)
 {
 	unregister_tcf_proto_ops(&cls_cgroup_ops);
+
+#ifndef CONFIG_NET_CLS_CGROUP
+	net_cls_subsys_id = -1;
+	synchronize_rcu();
+#endif
+
 	cgroup_unload_subsys(&net_cls_subsys);
 }
 
diff --git a/net/socket.c b/net/socket.c
index f9f7d08..367d547 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -94,6 +94,7 @@
 
 #include <net/compat.h>
 #include <net/wext.h>
+#include <net/cls_cgroup.h>
 
 #include <net/sock.h>
 #include <linux/netfilter.h>
@@ -558,6 +559,8 @@ static inline int __sock_sendmsg(struct kiocb *iocb, struct socket *sock,
 	struct sock_iocb *si = kiocb_to_siocb(iocb);
 	int err;
 
+	sock_update_classid(sock->sk);
+
 	si->sock = sock;
 	si->scm = NULL;
 	si->msg = msg;
@@ -684,6 +687,8 @@ static inline int __sock_recvmsg_nosec(struct kiocb *iocb, struct socket *sock,
 {
 	struct sock_iocb *si = kiocb_to_siocb(iocb);
 
+	sock_update_classid(sock->sk);
+
 	si->sock = sock;
 	si->scm = NULL;
 	si->msg = msg;
@@ -777,6 +782,8 @@ static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
 	if (unlikely(!sock->ops->splice_read))
 		return -EINVAL;
 
+	sock_update_classid(sock->sk);
+
 	return sock->ops->splice_read(sock, ppos, pipe, len, flags);
 }
 
@@ -3069,6 +3076,8 @@ int kernel_setsockopt(struct socket *sock, int level, int optname,
 int kernel_sendpage(struct socket *sock, struct page *page, int offset,
 		    size_t size, int flags)
 {
+	sock_update_classid(sock->sk);
+
 	if (sock->ops->sendpage)
 		return sock->ops->sendpage(sock, page, offset, size, flags);

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: cls_cgroup: Store classid in struct sock
  2010-05-22  1:49                                             ` cls_cgroup: Store classid in struct sock Herbert Xu
@ 2010-05-22 12:26                                               ` Neil Horman
  2010-05-24  5:42                                                 ` Herbert Xu
  2010-05-24  6:44                                               ` David Miller
  1 sibling, 1 reply; 55+ messages in thread
From: Neil Horman @ 2010-05-22 12:26 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, eric.dumazet, bmb, tgraf, nhorman, netdev

On Sat, May 22, 2010 at 11:49:57AM +1000, Herbert Xu wrote:
> On Fri, May 21, 2010 at 12:40:54PM -0400, Neil Horman wrote:
> > 
> > There may also be an issue with the setting of the classid (possible use of the
> > wrong subsys id value when grabbing our cgroup_subsys_state), but I'm checking
> > on that now.
> 
> Actually, I think it's because my patch mistook CONFIG_CGROUP
> for CONFIG_CGROUPS.
> 
Thats definately part of it yes.  But its not all of it, although I think the
rest doesn't have anything to do with your patch.  I fixed that straight away,
but I was still alwasy getting classids of zero in the qemu processes.  Looking
at the cgroup classifier, I'm wondering how this cgroup code ever worked in the
first place.  When we register the cgroup subsystem, we don't register an attach
method, so we never get a chance to assign task_cls_sate(tsk)->classid to any
non-zero value.  I've got a version of the patch that add that, but for some
reason, even though I set task_cls_state(tsk)->classid to the cgroup classid on
attach, task_cls_classid still returns zero.  I'm not sure why that is yet, but
I'm looking into it.

Neil

> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cls_cgroup: Store classid in struct sock
  2010-05-22 12:26                                               ` Neil Horman
@ 2010-05-24  5:42                                                 ` Herbert Xu
  0 siblings, 0 replies; 55+ messages in thread
From: Herbert Xu @ 2010-05-24  5:42 UTC (permalink / raw)
  To: Neil Horman; +Cc: David Miller, eric.dumazet, bmb, tgraf, nhorman, netdev

On Sat, May 22, 2010 at 08:26:32AM -0400, Neil Horman wrote:
>
> first place.  When we register the cgroup subsystem, we don't register an attach
> method, so we never get a chance to assign task_cls_sate(tsk)->classid to any
> non-zero value.  I've got a version of the patch that add that, but for some

No I don't think you need an attach method.

The task_struct has a pointer to the cgroups, which in turn has
a pointer to the cgroup_subsys_state that we allocated in the
cls_cgroup module.

Did you try building cls_cgroup into the kernel?

Perhaps there is a bug in how we register it at run-time, or
perhaps the cgroups infrastructure itself is buggy.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-20  6:52                               ` Herbert Xu
  2010-05-20  8:10                                 ` Thomas Graf
@ 2010-05-24  6:44                                 ` David Miller
  1 sibling, 0 replies; 55+ messages in thread
From: David Miller @ 2010-05-24  6:44 UTC (permalink / raw)
  To: herbert; +Cc: eric.dumazet, bmb, tgraf, nhorman, nhorman, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 20 May 2010 16:52:42 +1000

> cls_cgroup: Store classid in struct sock
 ...
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: tun: Use netif_receive_skb instead of netif_rx
  2010-05-21  1:16                                       ` Herbert Xu
@ 2010-05-24  6:44                                         ` David Miller
  0 siblings, 0 replies; 55+ messages in thread
From: David Miller @ 2010-05-24  6:44 UTC (permalink / raw)
  To: herbert; +Cc: nhorman, eric.dumazet, bmb, tgraf, nhorman, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 21 May 2010 11:16:32 +1000

> tun: Update classid on packet injection
> 
> This patch makes tun update its socket classid every time we
> inject a packet into the network stack.  This is so that any
> updates made by the admin to the process writing packets to
> tun is effected.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cls_cgroup: Store classid in struct sock
  2010-05-22  1:49                                             ` cls_cgroup: Store classid in struct sock Herbert Xu
  2010-05-22 12:26                                               ` Neil Horman
@ 2010-05-24  6:44                                               ` David Miller
  2010-05-24  6:55                                                 ` David Miller
  1 sibling, 1 reply; 55+ messages in thread
From: David Miller @ 2010-05-24  6:44 UTC (permalink / raw)
  To: herbert; +Cc: nhorman, eric.dumazet, bmb, tgraf, nhorman, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 22 May 2010 11:49:57 +1000

> On Fri, May 21, 2010 at 12:40:54PM -0400, Neil Horman wrote:
>> 
>> There may also be an issue with the setting of the classid (possible use of the
>> wrong subsys id value when grabbing our cgroup_subsys_state), but I'm checking
>> on that now.
> 
> Actually, I think it's because my patch mistook CONFIG_CGROUP
> for CONFIG_CGROUPS.
> 
> Here is a fixed version (also includes a build fix on sk->classid).
> 
> cls_cgroup: Store classid in struct sock

BTW I made sure to apply this version.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cls_cgroup: Store classid in struct sock
  2010-05-24  6:44                                               ` David Miller
@ 2010-05-24  6:55                                                 ` David Miller
  2010-05-24  7:07                                                   ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: David Miller @ 2010-05-24  6:55 UTC (permalink / raw)
  To: herbert; +Cc: nhorman, eric.dumazet, bmb, tgraf, nhorman, netdev

From: David Miller <davem@davemloft.net>
Date: Sun, 23 May 2010 23:44:41 -0700 (PDT)

> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Sat, 22 May 2010 11:49:57 +1000
> 
>> On Fri, May 21, 2010 at 12:40:54PM -0400, Neil Horman wrote:
>>> 
>>> There may also be an issue with the setting of the classid (possible use of the
>>> wrong subsys id value when grabbing our cgroup_subsys_state), but I'm checking
>>> on that now.
>> 
>> Actually, I think it's because my patch mistook CONFIG_CGROUP
>> for CONFIG_CGROUPS.
>> 
>> Here is a fixed version (also includes a build fix on sk->classid).
>> 
>> cls_cgroup: Store classid in struct sock
> 
> BTW I made sure to apply this version.

Actually I had to revert, doesn't build:

  CC [M]  drivers/net/pppoe.o
In file included from net/socket.c:97:
include/net/cls_cgroup.h:22: error: field ‘css’ has incomplete type
make[1]: *** [net/socket.o] Error 1
make: *** [net] Error 2
make: *** Waiting for unfinished jobs....

Probably you only tested the build with cgroups enabled?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cls_cgroup: Store classid in struct sock
  2010-05-24  6:55                                                 ` David Miller
@ 2010-05-24  7:07                                                   ` Herbert Xu
  2010-05-24  7:14                                                     ` David Miller
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2010-05-24  7:07 UTC (permalink / raw)
  To: David Miller; +Cc: nhorman, eric.dumazet, bmb, tgraf, nhorman, netdev

On Sun, May 23, 2010 at 11:55:57PM -0700, David Miller wrote:
>
> Probably you only tested the build with cgroups enabled?

Indeed, that does seem to be the problem here.  It turns out
that their struct is only declared when CONFIG_CGROUPS is defined,
how annoying.

Oh well I guess I'll follow their example :)

cls_cgroup: Store classid in struct sock

Up until now cls_cgroup has relied on fetching the classid out of
the current executing thread.  This runs into trouble when a packet
processing is delayed in which case it may execute out of another
thread's context.

Furthermore, even when a packet is not delayed we may fail to
classify it if soft IRQs have been disabled, because this scenario
is indistinguishable from one where a packet unrelated to the
current thread is processed by a real soft IRQ.

In fact, the current semantics is inherently broken, as a single
skb may be constructed out of the writes of two different tasks.
A different manifestation of this problem is when the TCP stack
transmits in response of an incoming ACK.  This is currently
unclassified.

As we already have a concept of packet ownership for accounting
purposes in the skb->sk pointer, this is a natural place to store
the classid in a persistent manner.

This patch adds the cls_cgroup classid in struct sock, filling up
an existing hole on 64-bit :)

The value is set at socket creation time.  So all sockets created
via socket(2) automatically gains the ID of the thread creating it.
Whenever another process touches the socket by either reading or
writing to it, we will change the socket classid to that of the
process if it has a valid (non-zero) classid.

For sockets created on inbound connections through accept(2), we
inherit the classid of the original listening socket through
sk_clone, possibly preceding the actual accept(2) call.

In order to minimise risks, I have not made this the authoritative
classid.  For now it is only used as a backup when we execute
with soft IRQs disabled.  Once we're completely happy with its
semantics we can use it as the sole classid.

Footnote: I have rearranged the error path on cls_group module
creation.  If we didn't do this, then there is a window where
someone could create a tc rule using cls_group before the cgroup
subsystem has been registered.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h
new file mode 100644
index 0000000..f9a0b01
--- /dev/null
+++ b/include/net/cls_cgroup.h
@@ -0,0 +1,63 @@
+/*
+ * cls_cgroup.h			Control Group Classifier
+ * 
+ * Authors:	Thomas Graf <tgraf@suug.ch>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option) 
+ * any later version.
+ *
+ */
+
+#ifndef _NET_CLS_CGROUP_H
+#define _NET_CLS_CGROUP_H
+
+#include <linux/cgroup.h>
+#include <linux/hardirq.h>
+#include <linux/rcupdate.h>
+
+#ifdef CONFIG_CGROUPS
+struct cgroup_cls_state
+{
+	struct cgroup_subsys_state css;
+	u32 classid;
+};
+
+#ifdef CONFIG_NET_CLS_CGROUP
+static inline u32 task_cls_classid(struct task_struct *p)
+{
+	if (in_interrupt())
+		return 0;
+
+	return container_of(task_subsys_state(p, net_cls_subsys_id),
+			    struct cgroup_cls_state, css).classid;
+}
+#else
+extern int net_cls_subsys_id;
+
+static inline u32 task_cls_classid(struct task_struct *p)
+{
+	int id;
+	u32 classid;
+
+	if (in_interrupt())
+		return 0;
+
+	rcu_read_lock();
+	id = rcu_dereference(net_cls_subsys_id);
+	if (id >= 0)
+		classid = container_of(task_subsys_state(p, id),
+				       struct cgroup_cls_state, css)->classid;
+	rcu_read_unlock();
+
+	return classid;
+}
+#endif
+#else
+static inline u32 task_cls_classid(struct task_struct *p)
+{
+	return 0;
+}
+#endif
+#endif  /* _NET_CLS_CGROUP_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index 5697caf..d24f382 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -312,7 +312,7 @@ struct sock {
 	void			*sk_security;
 #endif
 	__u32			sk_mark;
-	/* XXX 4 bytes hole on 64 bit */
+	u32			sk_classid;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk, int bytes);
 	void			(*sk_write_space)(struct sock *sk);
@@ -1074,6 +1074,14 @@ extern void *sock_kmalloc(struct sock *sk, int size,
 extern void sock_kfree_s(struct sock *sk, void *mem, int size);
 extern void sk_send_sigurg(struct sock *sk);
 
+#ifdef CONFIG_CGROUPS
+extern void sock_update_classid(struct sock *sk);
+#else
+static inline void sock_update_classid(struct sock *sk)
+{
+}
+#endif
+
 /*
  * Functions to fill in entries in struct proto_ops when a protocol
  * does not implement a particular function.
diff --git a/net/core/sock.c b/net/core/sock.c
index bf88a16..a05ae7f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -123,6 +123,7 @@
 #include <linux/net_tstamp.h>
 #include <net/xfrm.h>
 #include <linux/ipsec.h>
+#include <net/cls_cgroup.h>
 
 #include <linux/filter.h>
 
@@ -217,6 +218,11 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
 
+#if defined(CONFIG_CGROUPS) && !defined(CONFIG_NET_CLS_CGROUP)
+int net_cls_subsys_id = -1;
+EXPORT_SYMBOL_GPL(net_cls_subsys_id);
+#endif
+
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
 	struct timeval tv;
@@ -1050,6 +1056,16 @@ static void sk_prot_free(struct proto *prot, struct sock *sk)
 	module_put(owner);
 }
 
+#ifdef CONFIG_CGROUPS
+void sock_update_classid(struct sock *sk)
+{
+	u32 classid = task_cls_classid(current);
+
+	if (classid && classid != sk->sk_classid)
+		sk->sk_classid = classid;
+}
+#endif
+
 /**
  *	sk_alloc - All socket objects are allocated here
  *	@net: the applicable net namespace
@@ -1073,6 +1089,8 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		sock_lock_init(sk);
 		sock_net_set(sk, get_net(net));
 		atomic_set(&sk->sk_wmem_alloc, 1);
+
+		sock_update_classid(sk);
 	}
 
 	return sk;
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 2211803..766124b 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -16,14 +16,11 @@
 #include <linux/errno.h>
 #include <linux/skbuff.h>
 #include <linux/cgroup.h>
+#include <linux/rcupdate.h>
 #include <net/rtnetlink.h>
 #include <net/pkt_cls.h>
-
-struct cgroup_cls_state
-{
-	struct cgroup_subsys_state css;
-	u32 classid;
-};
+#include <net/sock.h>
+#include <net/cls_cgroup.h>
 
 static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
 					       struct cgroup *cgrp);
@@ -112,6 +109,10 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
 	struct cls_cgroup_head *head = tp->root;
 	u32 classid;
 
+	rcu_read_lock();
+	classid = task_cls_state(current)->classid;
+	rcu_read_unlock();
+
 	/*
 	 * Due to the nature of the classifier it is required to ignore all
 	 * packets originating from softirq context as accessing `current'
@@ -122,12 +123,12 @@ static int cls_cgroup_classify(struct sk_buff *skb, struct tcf_proto *tp,
 	 * calls by looking at the number of nested bh disable calls because
 	 * softirqs always disables bh.
 	 */
-	if (softirq_count() != SOFTIRQ_OFFSET)
-		return -1;
-
-	rcu_read_lock();
-	classid = task_cls_state(current)->classid;
-	rcu_read_unlock();
+	if (softirq_count() != SOFTIRQ_OFFSET) {
+	 	/* If there is an sk_classid we'll use that. */
+		if (!skb->sk)
+			return -1;
+		classid = skb->sk->sk_classid;
+	}
 
 	if (!classid)
 		return -1;
@@ -289,18 +290,35 @@ static struct tcf_proto_ops cls_cgroup_ops __read_mostly = {
 
 static int __init init_cgroup_cls(void)
 {
-	int ret = register_tcf_proto_ops(&cls_cgroup_ops);
-	if (ret)
-		return ret;
+	int ret;
+
 	ret = cgroup_load_subsys(&net_cls_subsys);
 	if (ret)
-		unregister_tcf_proto_ops(&cls_cgroup_ops);
+		goto out;
+
+#ifndef CONFIG_NET_CLS_CGROUP
+	/* We can't use rcu_assign_pointer because this is an int. */
+	smp_wmb();
+	net_cls_subsys_id = net_cls_subsys.subsys_id;
+#endif
+
+	ret = register_tcf_proto_ops(&cls_cgroup_ops);
+	if (ret)
+		cgroup_unload_subsys(&net_cls_subsys);
+
+out:
 	return ret;
 }
 
 static void __exit exit_cgroup_cls(void)
 {
 	unregister_tcf_proto_ops(&cls_cgroup_ops);
+
+#ifndef CONFIG_NET_CLS_CGROUP
+	net_cls_subsys_id = -1;
+	synchronize_rcu();
+#endif
+
 	cgroup_unload_subsys(&net_cls_subsys);
 }
 
diff --git a/net/socket.c b/net/socket.c
index f9f7d08..367d547 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -94,6 +94,7 @@
 
 #include <net/compat.h>
 #include <net/wext.h>
+#include <net/cls_cgroup.h>
 
 #include <net/sock.h>
 #include <linux/netfilter.h>
@@ -558,6 +559,8 @@ static inline int __sock_sendmsg(struct kiocb *iocb, struct socket *sock,
 	struct sock_iocb *si = kiocb_to_siocb(iocb);
 	int err;
 
+	sock_update_classid(sock->sk);
+
 	si->sock = sock;
 	si->scm = NULL;
 	si->msg = msg;
@@ -684,6 +687,8 @@ static inline int __sock_recvmsg_nosec(struct kiocb *iocb, struct socket *sock,
 {
 	struct sock_iocb *si = kiocb_to_siocb(iocb);
 
+	sock_update_classid(sock->sk);
+
 	si->sock = sock;
 	si->scm = NULL;
 	si->msg = msg;
@@ -777,6 +782,8 @@ static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
 	if (unlikely(!sock->ops->splice_read))
 		return -EINVAL;
 
+	sock_update_classid(sock->sk);
+
 	return sock->ops->splice_read(sock, ppos, pipe, len, flags);
 }
 
@@ -3069,6 +3076,8 @@ int kernel_setsockopt(struct socket *sock, int level, int optname,
 int kernel_sendpage(struct socket *sock, struct page *page, int offset,
 		    size_t size, int flags)
 {
+	sock_update_classid(sock->sk);
+
 	if (sock->ops->sendpage)
 		return sock->ops->sendpage(sock, page, offset, size, flags);
 
Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: cls_cgroup: Store classid in struct sock
  2010-05-24  7:07                                                   ` Herbert Xu
@ 2010-05-24  7:14                                                     ` David Miller
  2010-05-24 11:09                                                       ` Neil Horman
  0 siblings, 1 reply; 55+ messages in thread
From: David Miller @ 2010-05-24  7:14 UTC (permalink / raw)
  To: herbert; +Cc: nhorman, eric.dumazet, bmb, tgraf, nhorman, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 24 May 2010 17:07:43 +1000

> On Sun, May 23, 2010 at 11:55:57PM -0700, David Miller wrote:
>>
>> Probably you only tested the build with cgroups enabled?
> 
> Indeed, that does seem to be the problem here.  It turns out
> that their struct is only declared when CONFIG_CGROUPS is defined,
> how annoying.
> 
> Oh well I guess I'll follow their example :)
> 
> cls_cgroup: Store classid in struct sock

Thanks for fixing this up, applied.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cls_cgroup: Store classid in struct sock
  2010-05-24  7:14                                                     ` David Miller
@ 2010-05-24 11:09                                                       ` Neil Horman
  2010-05-24 11:24                                                         ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Neil Horman @ 2010-05-24 11:09 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, eric.dumazet, bmb, tgraf, nhorman, netdev

On Mon, May 24, 2010 at 12:14:29AM -0700, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Mon, 24 May 2010 17:07:43 +1000
> 
> > On Sun, May 23, 2010 at 11:55:57PM -0700, David Miller wrote:
> >>
> >> Probably you only tested the build with cgroups enabled?
> > 
> > Indeed, that does seem to be the problem here.  It turns out
> > that their struct is only declared when CONFIG_CGROUPS is defined,
> > how annoying.
> > 
> > Oh well I guess I'll follow their example :)
> > 
> > cls_cgroup: Store classid in struct sock
> 
> Thanks for fixing this up, applied.
> 


Excuse all my noise from before, Herberts patch works fine.  Apparently the
problem was mine.  QEMU creates several threads during startup, but adding the
processid to a cgroup doesn't cause its child threads to get added
automatically.  Doing this by hand causes this to work.

Thanks
Neil


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: cls_cgroup: Store classid in struct sock
  2010-05-24 11:09                                                       ` Neil Horman
@ 2010-05-24 11:24                                                         ` Herbert Xu
  0 siblings, 0 replies; 55+ messages in thread
From: Herbert Xu @ 2010-05-24 11:24 UTC (permalink / raw)
  To: Neil Horman; +Cc: David Miller, eric.dumazet, bmb, tgraf, nhorman, netdev

On Mon, May 24, 2010 at 07:09:50AM -0400, Neil Horman wrote:
>
> Excuse all my noise from before, Herberts patch works fine.  Apparently the
> problem was mine.  QEMU creates several threads during startup, but adding the
> processid to a cgroup doesn't cause its child threads to get added
> automatically.  Doing this by hand causes this to work.

Thanks a lot for testing Neil!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2010-05-24 11:24 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-19  7:57 tun: Use netif_receive_skb instead of netif_rx Herbert Xu
2010-05-19  8:09 ` Eric Dumazet
2010-05-19  8:18   ` Eric Dumazet
2010-05-19  8:21     ` Herbert Xu
2010-05-19 12:05     ` Neil Horman
2010-05-19 12:55       ` Neil Horman
2010-05-19 18:00         ` Neil Horman
2010-05-19 20:24           ` Neil Horman
2010-05-19 20:49           ` Thomas Graf
2010-05-19 21:00             ` Brian Bloniarz
2010-05-20  2:55               ` David Miller
2010-05-20  2:57                 ` Herbert Xu
2010-05-20  3:05                   ` David Miller
2010-05-20  3:34                     ` Herbert Xu
2010-05-20  3:42                       ` Herbert Xu
2010-05-20  3:46                       ` David Miller
2010-05-20  4:54                       ` Eric Dumazet
2010-05-20  5:01                         ` Herbert Xu
2010-05-20  5:15                       ` Eric Dumazet
2010-05-20  5:20                         ` Herbert Xu
2010-05-20  5:36                           ` Eric Dumazet
2010-05-20  5:46                             ` Herbert Xu
2010-05-20  6:03                               ` Eric Dumazet
2010-05-20  6:11                                 ` Herbert Xu
2010-05-20  6:19                                   ` Herbert Xu
2010-05-20  6:52                               ` Herbert Xu
2010-05-20  8:10                                 ` Thomas Graf
2010-05-20  9:40                                   ` Thomas Graf
2010-05-24  6:44                                 ` David Miller
2010-05-20 17:29                               ` Neil Horman
2010-05-20 23:16                                 ` Herbert Xu
2010-05-21  0:39                                   ` Neil Horman
2010-05-21  1:02                                     ` Herbert Xu
2010-05-21  1:16                                       ` Herbert Xu
2010-05-24  6:44                                         ` David Miller
2010-05-21  5:49                                     ` David Miller
2010-05-21 10:51                                       ` Neil Horman
2010-05-21 11:08                                         ` Herbert Xu
2010-05-21 12:59                                           ` Neil Horman
2010-05-21 16:40                                           ` Neil Horman
2010-05-22  1:49                                             ` cls_cgroup: Store classid in struct sock Herbert Xu
2010-05-22 12:26                                               ` Neil Horman
2010-05-24  5:42                                                 ` Herbert Xu
2010-05-24  6:44                                               ` David Miller
2010-05-24  6:55                                                 ` David Miller
2010-05-24  7:07                                                   ` Herbert Xu
2010-05-24  7:14                                                     ` David Miller
2010-05-24 11:09                                                       ` Neil Horman
2010-05-24 11:24                                                         ` Herbert Xu
2010-05-19 14:10       ` tun: Use netif_receive_skb instead of netif_rx Eric Dumazet
2010-05-19 14:31         ` Neil Horman
2010-05-19  8:20   ` Herbert Xu
2010-05-19  8:27     ` Eric Dumazet
2010-05-19  8:44       ` Herbert Xu
2010-05-19 20:14     ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.