From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965700AbbJ1Nda (ORCPT ); Wed, 28 Oct 2015 09:33:30 -0400 Received: from mail-io0-f173.google.com ([209.85.223.173]:35975 "EHLO mail-io0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932528AbbJ1Nd2 (ORCPT ); Wed, 28 Oct 2015 09:33:28 -0400 MIME-Version: 1.0 In-Reply-To: <1445962519-16133-1-git-send-email-dan.streetman@canonical.com> References: <1445962519-16133-1-git-send-email-dan.streetman@canonical.com> From: Dan Streetman Date: Wed, 28 Oct 2015 09:32:47 -0400 Message-ID: Subject: Re: [PATCH] xfrm: dst_entries_init() per-net dst_ops To: Steffen Klassert Cc: Herbert Xu , "David S. Miller" , Alexey Kuznetsov , James Morris , Hideaki YOSHIFUJI , Patrick McHardy , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Dan Streetman , Dan Streetman Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 27, 2015 at 12:15 PM, wrote: > From: Dan Streetman > > The ipv4 and ipv6 xfrms each create a template dst_ops object, and > perform dst_entries_init() on the template objects. Then each net > namespace has its net.xfrm.xfrm[46]_dst_ops field set to the template > values. The problem with that is the dst_ops.pcpuc_entries field is > a percpu counter and cannot be used correctly by simply copying it to > another object. > > The result of this is a very subtle bug; changes to the dst entries > counter from one net namespace may sometimes get applied to a different > net namespace dst entries counter. This is because of how the percpu > counter works; it has a main count field as well as a pointer to the > percpu variables. Each net namespace maintains its own main count > variable, but all point to one set of percpu variables. When any net > namespace happens to change one of the percpu variables to outside its > small batch range, its count is moved to the net namespace's main count > variable. So with multiple net namespaces operating concurrently, the > dst_ops entries counter can stray from the actual value that it should > be; if counts are consistently moved from one net namespace to another > (which my testing showed is likely), then one net namespace winds up > with a negative dst_ops count (which is reported as 0) while another > winds up with a continually increasing count, eventually reaching its > gc_thresh limit, which causes all new traffic on the net namespace to > fail with -ENOBUFS. > > This removes the dst_entries_init (and dst_entries_destroy) call for > the template dst_ops objects; their counters will never be used. > Instead dst_entries_init is called for each net namespace's dst_ops > object, right after copying its values from the template, and Well I'm not sure why my test kernel booted, while the test robot found the bug of GFP_KERNEL percpu counter alloc during atomic context. Thanks test robot! I'll update the patch and resend. > dst_entries_destroy is called when the net namespace is removed. > > Signed-off-by: Dan Streetman > Signed-off-by: Dan Streetman > --- > net/ipv4/xfrm4_policy.c | 5 +++-- > net/ipv6/xfrm6_policy.c | 10 ++++------ > net/xfrm/xfrm_policy.c | 25 +++++++++++++++++++++++-- > 3 files changed, 30 insertions(+), 10 deletions(-) > > diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c > index f2606b9..5f747ee 100644 > --- a/net/ipv4/xfrm4_policy.c > +++ b/net/ipv4/xfrm4_policy.c > @@ -235,6 +235,9 @@ static void xfrm4_dst_ifdown(struct dst_entry *dst, struct net_device *dev, > xfrm_dst_ifdown(dst, dev); > } > > +/* This is used as a template only; the dst_entries counter is not > + * initialized for this, but must be on per-net copies of this > + */ > static struct dst_ops xfrm4_dst_ops = { > .family = AF_INET, > .gc = xfrm4_garbage_collect, > @@ -325,8 +328,6 @@ static void __init xfrm4_policy_init(void) > > void __init xfrm4_init(void) > { > - dst_entries_init(&xfrm4_dst_ops); > - > xfrm4_state_init(); > xfrm4_policy_init(); > xfrm4_protocol_init(); > diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c > index 2cc5840..b895ec1 100644 > --- a/net/ipv6/xfrm6_policy.c > +++ b/net/ipv6/xfrm6_policy.c > @@ -279,6 +279,9 @@ static void xfrm6_dst_ifdown(struct dst_entry *dst, struct net_device *dev, > xfrm_dst_ifdown(dst, dev); > } > > +/* This is used as a template only; the dst_entries counter is not > + * initialized for this, but must be on per-net copies of this > + */ > static struct dst_ops xfrm6_dst_ops = { > .family = AF_INET6, > .gc = xfrm6_garbage_collect, > @@ -376,13 +379,9 @@ int __init xfrm6_init(void) > { > int ret; > > - dst_entries_init(&xfrm6_dst_ops); > - > ret = xfrm6_policy_init(); > - if (ret) { > - dst_entries_destroy(&xfrm6_dst_ops); > + if (ret) > goto out; > - } > ret = xfrm6_state_init(); > if (ret) > goto out_policy; > @@ -411,5 +410,4 @@ void xfrm6_fini(void) > xfrm6_protocol_fini(); > xfrm6_policy_fini(); > xfrm6_state_fini(); > - dst_entries_destroy(&xfrm6_dst_ops); > } > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c > index 09bfcba..5381719 100644 > --- a/net/xfrm/xfrm_policy.c > +++ b/net/xfrm/xfrm_policy.c > @@ -2896,12 +2896,32 @@ static void __net_init xfrm_dst_ops_init(struct net *net) > > rcu_read_lock(); > afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET]); > - if (afinfo) > + if (afinfo) { > net->xfrm.xfrm4_dst_ops = *afinfo->dst_ops; > + dst_entries_init(&net->xfrm.xfrm4_dst_ops); > + } > #if IS_ENABLED(CONFIG_IPV6) > afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET6]); > - if (afinfo) > + if (afinfo) { > net->xfrm.xfrm6_dst_ops = *afinfo->dst_ops; > + dst_entries_init(&net->xfrm.xfrm6_dst_ops); > + } > +#endif > + rcu_read_unlock(); > +} > + > +static void xfrm_dst_ops_fini(struct net *net) > +{ > + struct xfrm_policy_afinfo *afinfo; > + > + rcu_read_lock(); > + afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET]); > + if (afinfo) > + dst_entries_destroy(&net->xfrm.xfrm4_dst_ops); > +#if IS_ENABLED(CONFIG_IPV6) > + afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET6]); > + if (afinfo) > + dst_entries_destroy(&net->xfrm.xfrm6_dst_ops); > #endif > rcu_read_unlock(); > } > @@ -3085,6 +3105,7 @@ static void __net_exit xfrm_net_exit(struct net *net) > { > flow_cache_fini(net); > xfrm_sysctl_fini(net); > + xfrm_dst_ops_fini(net); > xfrm_policy_fini(net); > xfrm_state_fini(net); > xfrm_statistics_fini(net); > -- > 2.5.0 >