All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
@ 2012-07-17 13:14 David Miller
  2012-07-17 18:03 ` David Miller
  2012-07-17 20:41 ` [PATCH 0/5] Long term PMTU/redirect storage in ipv4 Julian Anastasov
  0 siblings, 2 replies; 17+ messages in thread
From: David Miller @ 2012-07-17 13:14 UTC (permalink / raw)
  To: netdev


These patches implement the final mechanism necessary to really allow
us to go without the route cache in ipv4.

We need a place to have long-term storage of PMTU/redirect information
which is independent of the routes themselves, yet does not get us
back into a situation where we have to write to metrics or anything
like that.

For this we use an "next-hop exception" table in the FIB nexthops.

Currently it is a simple linked list and uses a single global lock
for synchronization, but that can be easily adjusted as-needed.

The one thing I desperately want to avoid is having to create clone
routes in the FIB trie for this purpose, because that is very
expensive.   However, I'm willing to entertain such an idea later
if this current scheme proves to have downsides that the FIB trie
variant would not have.

In order to accomodate this any such scheme, we need to be able to
produce a full flow key at PMTU/redirect time.  That required an
adjustment of the interface call-sites used to propagate these events.

For a PMTU/redirect with a fully specified socket, we pass that socket
and use it to produce the flow key.

Otherwise we use a passed in SKB to formulate the key.  There are two
cases that need to be distinguished, ICMP message processing (in which
case the IP header is at skb->data) and output packet processing
(mostly tunnels, and in all such cases the IP header is at ip_hdr(skb)).

We also have to make the code able to handle the case where the dst
itself passed into the dst_ops->{update_pmtu,redirect} method is
invalidated.  This matters for calls from sockets that have cached
that route.  We provide a inet{,6} helper function for this purpose,
and edit SCTP specially since it caches routes at the transport rather
than socket level.

Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
  2012-07-17 13:14 [PATCH 0/5] Long term PMTU/redirect storage in ipv4 David Miller
@ 2012-07-17 18:03 ` David Miller
  2012-07-18  4:58   ` net-next and IPv6 Eric Dumazet
  2012-07-17 20:41 ` [PATCH 0/5] Long term PMTU/redirect storage in ipv4 Julian Anastasov
  1 sibling, 1 reply; 17+ messages in thread
From: David Miller @ 2012-07-17 18:03 UTC (permalink / raw)
  To: netdev

From: David Miller <davem@davemloft.net>
Date: Tue, 17 Jul 2012 06:14:18 -0700 (PDT)

> These patches implement the final mechanism necessary to really allow
> us to go without the route cache in ipv4.

Ok I pushed this out to net-next with the v2 of patch #5 and the merge
commit message adjusted to suit.

I think the routing cache will die in net-next for real some time
later this week.

I'll start respinning those patches.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
  2012-07-17 13:14 [PATCH 0/5] Long term PMTU/redirect storage in ipv4 David Miller
  2012-07-17 18:03 ` David Miller
@ 2012-07-17 20:41 ` Julian Anastasov
  2012-07-17 20:46   ` David Miller
  1 sibling, 1 reply; 17+ messages in thread
From: Julian Anastasov @ 2012-07-17 20:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev


	Hello,

On Tue, 17 Jul 2012, David Miller wrote:

> These patches implement the final mechanism necessary to really allow
> us to go without the route cache in ipv4.
> 
> We need a place to have long-term storage of PMTU/redirect information
> which is independent of the routes themselves, yet does not get us
> back into a situation where we have to write to metrics or anything
> like that.
> 
> For this we use an "next-hop exception" table in the FIB nexthops.
> 
> Currently it is a simple linked list and uses a single global lock
> for synchronization, but that can be easily adjusted as-needed.
> 
> The one thing I desperately want to avoid is having to create clone
> routes in the FIB trie for this purpose, because that is very
> expensive.   However, I'm willing to entertain such an idea later
> if this current scheme proves to have downsides that the FIB trie
> variant would not have.

	IIRC, struct fib_info was shared by different
prefixes. It saves a lot of memory when thousands of
routes are created to same GW. Now if we end up with 1 or
2 fib_info structures for default routes, the nh_exceptions list
can become very long. May be fib_info is not a good place
to hide such data.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
  2012-07-17 20:41 ` [PATCH 0/5] Long term PMTU/redirect storage in ipv4 Julian Anastasov
@ 2012-07-17 20:46   ` David Miller
  2012-07-17 22:14     ` Julian Anastasov
  0 siblings, 1 reply; 17+ messages in thread
From: David Miller @ 2012-07-17 20:46 UTC (permalink / raw)
  To: ja; +Cc: netdev

From: Julian Anastasov <ja@ssi.bg>
Date: Tue, 17 Jul 2012 23:41:16 +0300 (EEST)

> 	IIRC, struct fib_info was shared by different
> prefixes. It saves a lot of memory when thousands of
> routes are created to same GW. Now if we end up with 1 or
> 2 fib_info structures for default routes, the nh_exceptions list
> can become very long. May be fib_info is not a good place
> to hide such data.

Your analysis of what fib_info is and how it's intended to
work is accurate.

But we don't use a linked list for the exceptions in the final
version, we use a reclaiming RCU'd hash table like we use for TCP
metrics.

See the updated version of patch #5 and what I actually committed to
net-next.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
  2012-07-17 22:14     ` Julian Anastasov
@ 2012-07-17 22:09       ` David Miller
  2012-07-18  1:06         ` Julian Anastasov
  0 siblings, 1 reply; 17+ messages in thread
From: David Miller @ 2012-07-17 22:09 UTC (permalink / raw)
  To: ja; +Cc: netdev

From: Julian Anastasov <ja@ssi.bg>
Date: Wed, 18 Jul 2012 01:14:04 +0300 (EEST)

> 	Aha, I see. Something around fnhe_oldest() and its
> daddr arg does not look good. If the goal is to hijack
> some entry, probably for another daddr and comparing it with
> tcpm_new(), may be we should remove this daddr arg and fully
> reset all parameters such as fnhe_pmtu, fnhe_gw, fnhe_expires
> because the find_or_create_fnhe() callers modify only specific
> fields, we should not end up with wrong gateway inherited from
> another daddr, for example.

Better would be to use a seqlock when reading it's values.

Either way, patches welcome :-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
  2012-07-17 20:46   ` David Miller
@ 2012-07-17 22:14     ` Julian Anastasov
  2012-07-17 22:09       ` David Miller
  0 siblings, 1 reply; 17+ messages in thread
From: Julian Anastasov @ 2012-07-17 22:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev


	Hello,

On Tue, 17 Jul 2012, David Miller wrote:

> > 	IIRC, struct fib_info was shared by different
> > prefixes. It saves a lot of memory when thousands of
> > routes are created to same GW. Now if we end up with 1 or
> > 2 fib_info structures for default routes, the nh_exceptions list
> > can become very long. May be fib_info is not a good place
> > to hide such data.
> 
> Your analysis of what fib_info is and how it's intended to
> work is accurate.
> 
> But we don't use a linked list for the exceptions in the final
> version, we use a reclaiming RCU'd hash table like we use for TCP
> metrics.
> 
> See the updated version of patch #5 and what I actually committed to
> net-next.

	Aha, I see. Something around fnhe_oldest() and its
daddr arg does not look good. If the goal is to hijack
some entry, probably for another daddr and comparing it with
tcpm_new(), may be we should remove this daddr arg and fully
reset all parameters such as fnhe_pmtu, fnhe_gw, fnhe_expires
because the find_or_create_fnhe() callers modify only specific
fields, we should not end up with wrong gateway inherited from
another daddr, for example.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
  2012-07-17 22:09       ` David Miller
@ 2012-07-18  1:06         ` Julian Anastasov
  2012-07-18  3:46           ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Julian Anastasov @ 2012-07-18  1:06 UTC (permalink / raw)
  To: David Miller; +Cc: netdev


	Hello,

On Tue, 17 Jul 2012, David Miller wrote:

> From: Julian Anastasov <ja@ssi.bg>
> Date: Wed, 18 Jul 2012 01:14:04 +0300 (EEST)
> 
> > 	Aha, I see. Something around fnhe_oldest() and its
> > daddr arg does not look good. If the goal is to hijack
> > some entry, probably for another daddr and comparing it with
> > tcpm_new(), may be we should remove this daddr arg and fully
> > reset all parameters such as fnhe_pmtu, fnhe_gw, fnhe_expires
> > because the find_or_create_fnhe() callers modify only specific
> > fields, we should not end up with wrong gateway inherited from
> > another daddr, for example.
> 
> Better would be to use a seqlock when reading it's values.
> 
> Either way, patches welcome :-)

	I created patch with seqlock usage. This version
is with global seqlock because I'm not sure if 2048 locks
per NH are good idea. This is only compile tested.
After comments may be I have to resubmit in separate message.


Subject: [PATCH] ipv4: use seqlock for nh_exceptions

From: Julian Anastasov <ja@ssi.bg>

	Use global seqlock for the nh_exceptions. Call
fnhe_oldest with the right hash chain. Correct the diff
value for dst_set_expires.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 include/net/ip_fib.h |    2 +-
 net/ipv4/route.c     |  117 ++++++++++++++++++++++++++++++++------------------
 2 files changed, 76 insertions(+), 43 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index e9ee1ca..2daf096 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -51,7 +51,7 @@ struct fib_nh_exception {
 	struct fib_nh_exception __rcu	*fnhe_next;
 	__be32				fnhe_daddr;
 	u32				fnhe_pmtu;
-	u32				fnhe_gw;
+	__be32				fnhe_gw;
 	unsigned long			fnhe_expires;
 	unsigned long			fnhe_stamp;
 };
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f67e702..e037c73 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1334,8 +1334,9 @@ static void ip_rt_build_flow_key(struct flowi4 *fl4, const struct sock *sk,
 }
 
 static DEFINE_SPINLOCK(fnhe_lock);
+static DEFINE_SEQLOCK(fnhe_seqlock);
 
-static struct fib_nh_exception *fnhe_oldest(struct fnhe_hash_bucket *hash, __be32 daddr)
+static struct fib_nh_exception *fnhe_oldest(struct fnhe_hash_bucket *hash)
 {
 	struct fib_nh_exception *fnhe, *oldest;
 
@@ -1358,47 +1359,76 @@ static inline u32 fnhe_hashfun(__be32 daddr)
 	return hval & (FNHE_HASH_SIZE - 1);
 }
 
-static struct fib_nh_exception *find_or_create_fnhe(struct fib_nh *nh, __be32 daddr)
+static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
+				  u32 pmtu, unsigned long expires)
 {
 	struct fnhe_hash_bucket *hash = nh->nh_exceptions;
 	struct fib_nh_exception *fnhe;
 	int depth;
 	u32 hval;
 
-	if (!hash) {
-		hash = nh->nh_exceptions = kzalloc(FNHE_HASH_SIZE * sizeof(*hash),
-						   GFP_ATOMIC);
-		if (!hash)
-			return NULL;
-	}
+	if (!hash)
+		goto start;
 
+repeat:
 	hval = fnhe_hashfun(daddr);
 	hash += hval;
 
 	depth = 0;
+	write_seqlock_bh(&fnhe_seqlock);
 	for (fnhe = rcu_dereference(hash->chain); fnhe;
 	     fnhe = rcu_dereference(fnhe->fnhe_next)) {
 		if (fnhe->fnhe_daddr == daddr)
-			goto out;
+			break;
 		depth++;
 	}
 
-	if (depth > FNHE_RECLAIM_DEPTH) {
-		fnhe = fnhe_oldest(hash + hval, daddr);
-		goto out_daddr;
-	}
-	fnhe = kzalloc(sizeof(*fnhe), GFP_ATOMIC);
-	if (!fnhe)
-		return NULL;
+	if (fnhe) {
+		if (gw)
+			fnhe->fnhe_gw = gw;
+		if (pmtu) {
+			fnhe->fnhe_pmtu = pmtu;
+			fnhe->fnhe_expires = expires;
+		}
+	} else {
+		if (depth > FNHE_RECLAIM_DEPTH)
+			fnhe = fnhe_oldest(hash);
+		else {
+			fnhe = kzalloc(sizeof(*fnhe), GFP_ATOMIC);
+			if (!fnhe) {
+				write_sequnlock_bh(&fnhe_seqlock);
+				return;
+			}
 
-	fnhe->fnhe_next = hash->chain;
-	rcu_assign_pointer(hash->chain, fnhe);
+			fnhe->fnhe_next = hash->chain;
+			rcu_assign_pointer(hash->chain, fnhe);
+		}
+		fnhe->fnhe_daddr = daddr;
+		fnhe->fnhe_gw = gw;
+		fnhe->fnhe_pmtu = pmtu;
+		fnhe->fnhe_expires = expires;
+	}
 
-out_daddr:
-	fnhe->fnhe_daddr = daddr;
-out:
 	fnhe->fnhe_stamp = jiffies;
-	return fnhe;
+	write_sequnlock_bh(&fnhe_seqlock);
+	return;
+
+start:
+	spin_lock_bh(&fnhe_lock);
+	hash = nh->nh_exceptions;
+	if (hash) {
+		spin_unlock_bh(&fnhe_lock);
+		goto repeat;
+	}
+	nh->nh_exceptions = kzalloc(FNHE_HASH_SIZE * sizeof(*hash),
+				    GFP_ATOMIC);
+	if (!nh->nh_exceptions) {
+		spin_unlock_bh(&fnhe_lock);
+		return;
+	}
+	hash = nh->nh_exceptions;
+	spin_unlock_bh(&fnhe_lock);
+	goto repeat;
 }
 
 static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flowi4 *fl4)
@@ -1452,13 +1482,9 @@ static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flow
 		} else {
 			if (fib_lookup(net, fl4, &res) == 0) {
 				struct fib_nh *nh = &FIB_RES_NH(res);
-				struct fib_nh_exception *fnhe;
 
-				spin_lock_bh(&fnhe_lock);
-				fnhe = find_or_create_fnhe(nh, fl4->daddr);
-				if (fnhe)
-					fnhe->fnhe_gw = new_gw;
-				spin_unlock_bh(&fnhe_lock);
+				update_or_create_fnhe(nh, fl4->daddr, new_gw,
+						      0, 0);
 			}
 			rt->rt_gateway = new_gw;
 			rt->rt_flags |= RTCF_REDIRECTED;
@@ -1663,15 +1689,9 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
 
 	if (fib_lookup(dev_net(rt->dst.dev), fl4, &res) == 0) {
 		struct fib_nh *nh = &FIB_RES_NH(res);
-		struct fib_nh_exception *fnhe;
 
-		spin_lock_bh(&fnhe_lock);
-		fnhe = find_or_create_fnhe(nh, fl4->daddr);
-		if (fnhe) {
-			fnhe->fnhe_pmtu = mtu;
-			fnhe->fnhe_expires = jiffies + ip_rt_mtu_expires;
-		}
-		spin_unlock_bh(&fnhe_lock);
+		update_or_create_fnhe(nh, fl4->daddr, 0, mtu,
+				      jiffies + ip_rt_mtu_expires);
 	}
 	rt->rt_pmtu = mtu;
 	dst_set_expires(&rt->dst, ip_rt_mtu_expires);
@@ -1898,6 +1918,7 @@ static void rt_bind_exception(struct rtable *rt, struct fib_nh *nh, __be32 daddr
 {
 	struct fnhe_hash_bucket *hash = nh->nh_exceptions;
 	struct fib_nh_exception *fnhe;
+	unsigned int seq;
 	u32 hval;
 
 	hval = fnhe_hashfun(daddr);
@@ -1905,17 +1926,29 @@ static void rt_bind_exception(struct rtable *rt, struct fib_nh *nh, __be32 daddr
 	for (fnhe = rcu_dereference(hash[hval].chain); fnhe;
 	     fnhe = rcu_dereference(fnhe->fnhe_next)) {
 		if (fnhe->fnhe_daddr == daddr) {
-			if (fnhe->fnhe_pmtu) {
-				unsigned long expires = fnhe->fnhe_expires;
-				unsigned long diff = jiffies - expires;
+			__be32 fnhe_daddr, gw;
+			u32 pmtu;
+			unsigned long expires;
+
+			do {
+				seq = read_seqbegin(&fnhe_seqlock);
+				fnhe_daddr = fnhe->fnhe_daddr;
+				gw = fnhe->fnhe_gw;
+				pmtu = fnhe->fnhe_pmtu;
+				expires = fnhe->fnhe_expires;
+			} while (read_seqretry(&fnhe_seqlock, seq));
+			if (daddr != fnhe_daddr)
+				break;
+			if (pmtu) {
+				unsigned long diff = expires - jiffies;
 
 				if (time_before(jiffies, expires)) {
-					rt->rt_pmtu = fnhe->fnhe_pmtu;
+					rt->rt_pmtu = pmtu;
 					dst_set_expires(&rt->dst, diff);
 				}
 			}
-			if (fnhe->fnhe_gw)
-				rt->rt_gateway = fnhe->fnhe_gw;
+			if (gw)
+				rt->rt_gateway = gw;
 			fnhe->fnhe_stamp = jiffies;
 			break;
 		}
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
  2012-07-18  1:06         ` Julian Anastasov
@ 2012-07-18  3:46           ` Eric Dumazet
  2012-07-18  7:28             ` Julian Anastasov
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2012-07-18  3:46 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: David Miller, netdev

On Wed, 2012-07-18 at 04:06 +0300, Julian Anastasov wrote:

> 
> 	I created patch with seqlock usage. This version
> is with global seqlock because I'm not sure if 2048 locks
> per NH are good idea. This is only compile tested.
> After comments may be I have to resubmit in separate message.
> 
> 
> Subject: [PATCH] ipv4: use seqlock for nh_exceptions
> 
> From: Julian Anastasov <ja@ssi.bg>
> 
> 	Use global seqlock for the nh_exceptions. Call
> fnhe_oldest with the right hash chain. Correct the diff
> value for dst_set_expires.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---
>  include/net/ip_fib.h |    2 +-
>  net/ipv4/route.c     |  117 ++++++++++++++++++++++++++++++++------------------
>  2 files changed, 76 insertions(+), 43 deletions(-)
> 

...

> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index f67e702..e037c73 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1334,8 +1334,9 @@ static void ip_rt_build_flow_key(struct flowi4 *fl4, const struct sock *sk,
>  }
>  
>  static DEFINE_SPINLOCK(fnhe_lock);
> +static DEFINE_SEQLOCK(fnhe_seqlock);

Hi Julian

I find this patch too complex.

You could only change fnhe_lock to a seqlock

 net/ipv4/route.c |   35 +++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f67e702..a96fc9d 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1333,7 +1333,7 @@ static void ip_rt_build_flow_key(struct flowi4 *fl4, const struct sock *sk,
 		build_sk_flow_key(fl4, sk);
 }
 
-static DEFINE_SPINLOCK(fnhe_lock);
+static DEFINE_SEQLOCK(fnhe_seqlock);
 
 static struct fib_nh_exception *fnhe_oldest(struct fnhe_hash_bucket *hash, __be32 daddr)
 {
@@ -1454,11 +1454,11 @@ static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flow
 				struct fib_nh *nh = &FIB_RES_NH(res);
 				struct fib_nh_exception *fnhe;
 
-				spin_lock_bh(&fnhe_lock);
+				write_seqlock_bh(&fnhe_seqlock);
 				fnhe = find_or_create_fnhe(nh, fl4->daddr);
 				if (fnhe)
 					fnhe->fnhe_gw = new_gw;
-				spin_unlock_bh(&fnhe_lock);
+				write_sequnlock_bh(&fnhe_seqlock);
 			}
 			rt->rt_gateway = new_gw;
 			rt->rt_flags |= RTCF_REDIRECTED;
@@ -1665,13 +1665,13 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
 		struct fib_nh *nh = &FIB_RES_NH(res);
 		struct fib_nh_exception *fnhe;
 
-		spin_lock_bh(&fnhe_lock);
+		write_seqlock_bh(&fnhe_seqlock);
 		fnhe = find_or_create_fnhe(nh, fl4->daddr);
 		if (fnhe) {
 			fnhe->fnhe_pmtu = mtu;
 			fnhe->fnhe_expires = jiffies + ip_rt_mtu_expires;
 		}
-		spin_unlock_bh(&fnhe_lock);
+		write_sequnlock_bh(&fnhe_seqlock);
 	}
 	rt->rt_pmtu = mtu;
 	dst_set_expires(&rt->dst, ip_rt_mtu_expires);
@@ -1904,18 +1904,29 @@ static void rt_bind_exception(struct rtable *rt, struct fib_nh *nh, __be32 daddr
 
 	for (fnhe = rcu_dereference(hash[hval].chain); fnhe;
 	     fnhe = rcu_dereference(fnhe->fnhe_next)) {
-		if (fnhe->fnhe_daddr == daddr) {
-			if (fnhe->fnhe_pmtu) {
-				unsigned long expires = fnhe->fnhe_expires;
-				unsigned long diff = jiffies - expires;
+		unsigned int seq;
+		__be32 fnhe_daddr, gw;
+		u32 pmtu;
+		unsigned long expires;
+
+		do {
+			seq = read_seqbegin(&fnhe_seqlock);
+			fnhe_daddr = fnhe->fnhe_daddr;
+			gw = fnhe->fnhe_gw;
+			pmtu = fnhe->fnhe_pmtu;
+			expires = fnhe->fnhe_expires;
+		} while (read_seqretry(&fnhe_seqlock, seq));
+		if (fnhe_daddr == daddr) {
+			if (pmtu) {
+				unsigned long diff = expires - jiffies;
 
 				if (time_before(jiffies, expires)) {
-					rt->rt_pmtu = fnhe->fnhe_pmtu;
+					rt->rt_pmtu = pmtu;
 					dst_set_expires(&rt->dst, diff);
 				}
 			}
-			if (fnhe->fnhe_gw)
-				rt->rt_gateway = fnhe->fnhe_gw;
+			if (gw)
+				rt->rt_gateway = gw;
 			fnhe->fnhe_stamp = jiffies;
 			break;
 		}

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* net-next and IPv6
  2012-07-17 18:03 ` David Miller
@ 2012-07-18  4:58   ` Eric Dumazet
  2012-07-18  7:04     ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2012-07-18  4:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev


IPv6 doesnt work anymore for me, at least TCP doesnt work

ssh ::1

ping6 is ok

tcpdump shows garbled source IP and ip-proto

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: net-next and IPv6
  2012-07-18  4:58   ` net-next and IPv6 Eric Dumazet
@ 2012-07-18  7:04     ` Eric Dumazet
  2012-07-18  7:23       ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2012-07-18  7:04 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Wed, 2012-07-18 at 06:58 +0200, Eric Dumazet wrote:
> IPv6 doesnt work anymore for me, at least TCP doesnt work
> 
> ssh ::1
> 
> ping6 is ok
> 
> tcpdump shows garbled source IP and ip-proto
> 
> 

Probable bug coming from

commit 35ad9b9cf7d8a2e6259a0d24022e910adb6f3489
Author: David S. Miller <davem@davemloft.net>
Date:   Mon Jul 16 03:44:56 2012 -0700

    ipv6: Add helper inet6_csk_update_pmtu().
    
    This is the ipv6 version of inet_csk_update_pmtu().
    
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: net-next and IPv6
  2012-07-18  7:04     ` Eric Dumazet
@ 2012-07-18  7:23       ` Eric Dumazet
  2012-07-18  7:38         ` [PATCH net-next] ipv6: fix inet6_csk_xmit() Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2012-07-18  7:23 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Wed, 2012-07-18 at 09:05 +0200, Eric Dumazet wrote:
> On Wed, 2012-07-18 at 06:58 +0200, Eric Dumazet wrote:
> > IPv6 doesnt work anymore for me, at least TCP doesnt work
> > 
> > ssh ::1
> > 
> > ping6 is ok
> > 
> > tcpdump shows garbled source IP and ip-proto
> > 
> > 
> 
> Probable bug coming from
> 
> commit 35ad9b9cf7d8a2e6259a0d24022e910adb6f3489
> Author: David S. Miller <davem@davemloft.net>
> Date:   Mon Jul 16 03:44:56 2012 -0700
> 
>     ipv6: Add helper inet6_csk_update_pmtu().
>     
>     This is the ipv6 version of inet_csk_update_pmtu().
>     
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> 

OK, I am testing a fix

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
  2012-07-18  3:46           ` Eric Dumazet
@ 2012-07-18  7:28             ` Julian Anastasov
  2012-07-18  7:30               ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Julian Anastasov @ 2012-07-18  7:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev


	Hello,

On Wed, 18 Jul 2012, Eric Dumazet wrote:

> Hi Julian
> 
> I find this patch too complex.
> 
> You could only change fnhe_lock to a seqlock

	I'll change it, I was not sure if we are going to
use some array of seqlocks and also without adding locks in
the struct fib_nh.

> -static DEFINE_SPINLOCK(fnhe_lock);
> +static DEFINE_SEQLOCK(fnhe_seqlock);
>  
>  static struct fib_nh_exception *fnhe_oldest(struct fnhe_hash_bucket *hash, __be32 daddr)
>  {
> @@ -1454,11 +1454,11 @@ static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flow
>  				struct fib_nh *nh = &FIB_RES_NH(res);
>  				struct fib_nh_exception *fnhe;
>  
> -				spin_lock_bh(&fnhe_lock);
> +				write_seqlock_bh(&fnhe_seqlock);
>  				fnhe = find_or_create_fnhe(nh, fl4->daddr);
>  				if (fnhe)
>  					fnhe->fnhe_gw = new_gw;
> -				spin_unlock_bh(&fnhe_lock);
> +				write_sequnlock_bh(&fnhe_seqlock);
>  			}
>  			rt->rt_gateway = new_gw;
>  			rt->rt_flags |= RTCF_REDIRECTED;
> @@ -1665,13 +1665,13 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
>  		struct fib_nh *nh = &FIB_RES_NH(res);
>  		struct fib_nh_exception *fnhe;
>  
> -		spin_lock_bh(&fnhe_lock);
> +		write_seqlock_bh(&fnhe_seqlock);
>  		fnhe = find_or_create_fnhe(nh, fl4->daddr);
>  		if (fnhe) {
>  			fnhe->fnhe_pmtu = mtu;
>  			fnhe->fnhe_expires = jiffies + ip_rt_mtu_expires;
>  		}
> -		spin_unlock_bh(&fnhe_lock);
> +		write_sequnlock_bh(&fnhe_seqlock);
>  	}
>  	rt->rt_pmtu = mtu;
>  	dst_set_expires(&rt->dst, ip_rt_mtu_expires);
> @@ -1904,18 +1904,29 @@ static void rt_bind_exception(struct rtable *rt, struct fib_nh *nh, __be32 daddr
>  
>  	for (fnhe = rcu_dereference(hash[hval].chain); fnhe;
>  	     fnhe = rcu_dereference(fnhe->fnhe_next)) {
> -		if (fnhe->fnhe_daddr == daddr) {
> -			if (fnhe->fnhe_pmtu) {
> -				unsigned long expires = fnhe->fnhe_expires;
> -				unsigned long diff = jiffies - expires;
> +		unsigned int seq;
> +		__be32 fnhe_daddr, gw;
> +		u32 pmtu;
> +		unsigned long expires;
> +
> +		do {
> +			seq = read_seqbegin(&fnhe_seqlock);
> +			fnhe_daddr = fnhe->fnhe_daddr;
> +			gw = fnhe->fnhe_gw;
> +			pmtu = fnhe->fnhe_pmtu;
> +			expires = fnhe->fnhe_expires;
> +		} while (read_seqretry(&fnhe_seqlock, seq));

	This is going to read all values in the chain
before reaching daddr? Or may be FNHE_RECLAIM_DEPTH is
small and nobody will increase it. May be I can create
some func that searches daddr in chain instead. Do you still
prefer to remove the first daddr check or it is only
that the code is intended too much?

> +		if (fnhe_daddr == daddr) {

	Also, do we need some rcu locking in
__ip_rt_update_pmtu or may be ipv4_update_pmtu is
called always under rcu lock?

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
  2012-07-18  7:28             ` Julian Anastasov
@ 2012-07-18  7:30               ` Eric Dumazet
  2012-07-18  8:36                 ` Julian Anastasov
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2012-07-18  7:30 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: David Miller, netdev

On Wed, 2012-07-18 at 10:28 +0300, Julian Anastasov wrote:

> 	This is going to read all values in the chain
> before reaching daddr? Or may be FNHE_RECLAIM_DEPTH is
> small and nobody will increase it. May be I can create
> some func that searches daddr in chain instead. Do you still
> prefer to remove the first daddr check or it is only
> that the code is intended too much?
> 

I would not bother, since real cost is the initial cache line miss.
Once you read one field, reading others is really fast.

> > +		if (fnhe_daddr == daddr) {
> 
> 	Also, do we need some rcu locking in
> __ip_rt_update_pmtu or may be ipv4_update_pmtu is
> called always under rcu lock?

Sorry, I dont understand, we use the full lock
write_seqlock_bh(&fnhe_seqlock);/write_sequnlock_bh(&fnhe_seqlock);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH net-next] ipv6: fix inet6_csk_xmit()
  2012-07-18  7:23       ` Eric Dumazet
@ 2012-07-18  7:38         ` Eric Dumazet
  2012-07-18 16:00           ` David Miller
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2012-07-18  7:38 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Neal Cardwell, Yuchung Cheng

From: Eric Dumazet <edumazet@google.com>

We should provide to inet6_csk_route_socket a struct flowi6 pointer,
so that net6_csk_xmit() works correctly instead of sending garbage.

Also add some consts 

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
---
 include/linux/ipv6.h             |    4 +-
 include/net/ip6_route.h          |    3 +-
 net/ipv6/inet6_connection_sock.c |   40 +++++++++++++++--------------
 3 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index bc6c8fd..379e433 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -299,9 +299,9 @@ struct ipv6_pinfo {
 	struct in6_addr 	rcv_saddr;
 	struct in6_addr		daddr;
 	struct in6_pktinfo	sticky_pktinfo;
-	struct in6_addr		*daddr_cache;
+	const struct in6_addr		*daddr_cache;
 #ifdef CONFIG_IPV6_SUBTREES
-	struct in6_addr		*saddr_cache;
+	const struct in6_addr		*saddr_cache;
 #endif
 
 	__be32			flow_label;
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index b6b6f7d..5fa2af0 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -158,7 +158,8 @@ extern void rt6_remove_prefsrc(struct inet6_ifaddr *ifp);
  *	Store a destination cache entry in a socket
  */
 static inline void __ip6_dst_store(struct sock *sk, struct dst_entry *dst,
-				   struct in6_addr *daddr, struct in6_addr *saddr)
+				   const struct in6_addr *daddr,
+				   const struct in6_addr *saddr)
 {
 	struct ipv6_pinfo *np = inet6_sk(sk);
 	struct rt6_info *rt = (struct rt6_info *) dst;
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 4a0c4d2..0251a60 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -171,7 +171,8 @@ EXPORT_SYMBOL_GPL(inet6_csk_addr2sockaddr);
 
 static inline
 void __inet6_csk_dst_store(struct sock *sk, struct dst_entry *dst,
-			   struct in6_addr *daddr, struct in6_addr *saddr)
+			   const struct in6_addr *daddr,
+			   const struct in6_addr *saddr)
 {
 	__ip6_dst_store(sk, dst, daddr, saddr);
 
@@ -203,31 +204,31 @@ struct dst_entry *__inet6_csk_dst_check(struct sock *sk, u32 cookie)
 	return dst;
 }
 
-static struct dst_entry *inet6_csk_route_socket(struct sock *sk)
+static struct dst_entry *inet6_csk_route_socket(struct sock *sk,
+						struct flowi6 *fl6)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct ipv6_pinfo *np = inet6_sk(sk);
 	struct in6_addr *final_p, final;
 	struct dst_entry *dst;
-	struct flowi6 fl6;
 
-	memset(&fl6, 0, sizeof(fl6));
-	fl6.flowi6_proto = sk->sk_protocol;
-	fl6.daddr = np->daddr;
-	fl6.saddr = np->saddr;
-	fl6.flowlabel = np->flow_label;
-	IP6_ECN_flow_xmit(sk, fl6.flowlabel);
-	fl6.flowi6_oif = sk->sk_bound_dev_if;
-	fl6.flowi6_mark = sk->sk_mark;
-	fl6.fl6_sport = inet->inet_sport;
-	fl6.fl6_dport = inet->inet_dport;
-	security_sk_classify_flow(sk, flowi6_to_flowi(&fl6));
+	memset(fl6, 0, sizeof(*fl6));
+	fl6->flowi6_proto = sk->sk_protocol;
+	fl6->daddr = np->daddr;
+	fl6->saddr = np->saddr;
+	fl6->flowlabel = np->flow_label;
+	IP6_ECN_flow_xmit(sk, fl6->flowlabel);
+	fl6->flowi6_oif = sk->sk_bound_dev_if;
+	fl6->flowi6_mark = sk->sk_mark;
+	fl6->fl6_sport = inet->inet_sport;
+	fl6->fl6_dport = inet->inet_dport;
+	security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
 
-	final_p = fl6_update_dst(&fl6, np->opt, &final);
+	final_p = fl6_update_dst(fl6, np->opt, &final);
 
 	dst = __inet6_csk_dst_check(sk, np->dst_cookie);
 	if (!dst) {
-		dst = ip6_dst_lookup_flow(sk, &fl6, final_p, false);
+		dst = ip6_dst_lookup_flow(sk, fl6, final_p, false);
 
 		if (!IS_ERR(dst))
 			__inet6_csk_dst_store(sk, dst, NULL, NULL);
@@ -243,7 +244,7 @@ int inet6_csk_xmit(struct sk_buff *skb, struct flowi *fl_unused)
 	struct dst_entry *dst;
 	int res;
 
-	dst = inet6_csk_route_socket(sk);
+	dst = inet6_csk_route_socket(sk, &fl6);
 	if (IS_ERR(dst)) {
 		sk->sk_err_soft = -PTR_ERR(dst);
 		sk->sk_route_caps = 0;
@@ -265,12 +266,13 @@ EXPORT_SYMBOL_GPL(inet6_csk_xmit);
 
 struct dst_entry *inet6_csk_update_pmtu(struct sock *sk, u32 mtu)
 {
-	struct dst_entry *dst = inet6_csk_route_socket(sk);
+	struct flowi6 fl6;
+	struct dst_entry *dst = inet6_csk_route_socket(sk, &fl6);
 
 	if (IS_ERR(dst))
 		return NULL;
 	dst->ops->update_pmtu(dst, sk, NULL, mtu);
 
-	return inet6_csk_route_socket(sk);
+	return inet6_csk_route_socket(sk, &fl6);
 }
 EXPORT_SYMBOL_GPL(inet6_csk_update_pmtu);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
  2012-07-18  7:30               ` Eric Dumazet
@ 2012-07-18  8:36                 ` Julian Anastasov
  2012-07-18 16:07                   ` David Miller
  0 siblings, 1 reply; 17+ messages in thread
From: Julian Anastasov @ 2012-07-18  8:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev


	Hello,

On Wed, 18 Jul 2012, Eric Dumazet wrote:

> On Wed, 2012-07-18 at 10:28 +0300, Julian Anastasov wrote:
> 
> > 	This is going to read all values in the chain
> > before reaching daddr? Or may be FNHE_RECLAIM_DEPTH is
> > small and nobody will increase it. May be I can create
> > some func that searches daddr in chain instead. Do you still
> > prefer to remove the first daddr check or it is only
> > that the code is intended too much?
> > 
> 
> I would not bother, since real cost is the initial cache line miss.
> Once you read one field, reading others is really fast.

	Is the cost of read_seqbegin a problem? Here is a
2nd version, I still keep this first check for now.

> > > +		if (fnhe_daddr == daddr) {
> > 
> > 	Also, do we need some rcu locking in
> > __ip_rt_update_pmtu or may be ipv4_update_pmtu is
> > called always under rcu lock?
> 
> Sorry, I dont understand, we use the full lock
> write_seqlock_bh(&fnhe_seqlock);/write_sequnlock_bh(&fnhe_seqlock);

	No, it is not related to the fnhe locking, I mean:

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c7a40c3..c911caf 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1686,12 +1686,14 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
 	if (mtu < ip_rt_min_pmtu)
 		mtu = ip_rt_min_pmtu;
 
+	rcu_read_lock();
 	if (fib_lookup(dev_net(rt->dst.dev), fl4, &res) == 0) {
 		struct fib_nh *nh = &FIB_RES_NH(res);
 
 		update_or_create_fnhe(nh, fl4->daddr, 0, mtu,
 				      jiffies + ip_rt_mtu_expires);
 	}
+	rcu_read_unlock();
 	rt->rt_pmtu = mtu;
 	dst_set_expires(&rt->dst, ip_rt_mtu_expires);
 }


	How about the following, is the first daddr check
still a problem?

Subject: [PATCH v2] ipv4: use seqlock for nh_exceptions

From: Julian Anastasov <ja@ssi.bg>

	Use global seqlock for the nh_exceptions. Call
fnhe_oldest with the right hash chain. Correct the diff
value for dst_set_expires.

v2: after suggestions from Eric Dumazet:
* get rid of spin lock fnhe_lock, rearrange update_or_create_fnhe
* continue daddr search in rt_bind_exception

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 include/net/ip_fib.h |    2 +-
 net/ipv4/route.c     |  119 +++++++++++++++++++++++++++++---------------------
 2 files changed, 70 insertions(+), 51 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index e9ee1ca..2daf096 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -51,7 +51,7 @@ struct fib_nh_exception {
 	struct fib_nh_exception __rcu	*fnhe_next;
 	__be32				fnhe_daddr;
 	u32				fnhe_pmtu;
-	u32				fnhe_gw;
+	__be32				fnhe_gw;
 	unsigned long			fnhe_expires;
 	unsigned long			fnhe_stamp;
 };
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f67e702..1485db1 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1333,9 +1333,9 @@ static void ip_rt_build_flow_key(struct flowi4 *fl4, const struct sock *sk,
 		build_sk_flow_key(fl4, sk);
 }
 
-static DEFINE_SPINLOCK(fnhe_lock);
+static DEFINE_SEQLOCK(fnhe_seqlock);
 
-static struct fib_nh_exception *fnhe_oldest(struct fnhe_hash_bucket *hash, __be32 daddr)
+static struct fib_nh_exception *fnhe_oldest(struct fnhe_hash_bucket *hash)
 {
 	struct fib_nh_exception *fnhe, *oldest;
 
@@ -1358,47 +1358,63 @@ static inline u32 fnhe_hashfun(__be32 daddr)
 	return hval & (FNHE_HASH_SIZE - 1);
 }
 
-static struct fib_nh_exception *find_or_create_fnhe(struct fib_nh *nh, __be32 daddr)
+static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
+				  u32 pmtu, unsigned long expires)
 {
-	struct fnhe_hash_bucket *hash = nh->nh_exceptions;
+	struct fnhe_hash_bucket *hash;
 	struct fib_nh_exception *fnhe;
 	int depth;
-	u32 hval;
+	u32 hval = fnhe_hashfun(daddr);
+
+	write_seqlock_bh(&fnhe_seqlock);
 
+	hash = nh->nh_exceptions;
 	if (!hash) {
-		hash = nh->nh_exceptions = kzalloc(FNHE_HASH_SIZE * sizeof(*hash),
-						   GFP_ATOMIC);
+		hash = kzalloc(FNHE_HASH_SIZE * sizeof(*hash), GFP_ATOMIC);
 		if (!hash)
-			return NULL;
+			goto out_unlock;
+		nh->nh_exceptions = hash;
 	}
 
-	hval = fnhe_hashfun(daddr);
 	hash += hval;
 
 	depth = 0;
 	for (fnhe = rcu_dereference(hash->chain); fnhe;
 	     fnhe = rcu_dereference(fnhe->fnhe_next)) {
 		if (fnhe->fnhe_daddr == daddr)
-			goto out;
+			break;
 		depth++;
 	}
 
-	if (depth > FNHE_RECLAIM_DEPTH) {
-		fnhe = fnhe_oldest(hash + hval, daddr);
-		goto out_daddr;
+	if (fnhe) {
+		if (gw)
+			fnhe->fnhe_gw = gw;
+		if (pmtu) {
+			fnhe->fnhe_pmtu = pmtu;
+			fnhe->fnhe_expires = expires;
+		}
+	} else {
+		if (depth > FNHE_RECLAIM_DEPTH)
+			fnhe = fnhe_oldest(hash);
+		else {
+			fnhe = kzalloc(sizeof(*fnhe), GFP_ATOMIC);
+			if (!fnhe)
+				goto out_unlock;
+
+			fnhe->fnhe_next = hash->chain;
+			rcu_assign_pointer(hash->chain, fnhe);
+		}
+		fnhe->fnhe_daddr = daddr;
+		fnhe->fnhe_gw = gw;
+		fnhe->fnhe_pmtu = pmtu;
+		fnhe->fnhe_expires = expires;
 	}
-	fnhe = kzalloc(sizeof(*fnhe), GFP_ATOMIC);
-	if (!fnhe)
-		return NULL;
-
-	fnhe->fnhe_next = hash->chain;
-	rcu_assign_pointer(hash->chain, fnhe);
 
-out_daddr:
-	fnhe->fnhe_daddr = daddr;
-out:
 	fnhe->fnhe_stamp = jiffies;
-	return fnhe;
+
+out_unlock:
+	write_sequnlock_bh(&fnhe_seqlock);
+	return;
 }
 
 static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flowi4 *fl4)
@@ -1452,13 +1468,9 @@ static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flow
 		} else {
 			if (fib_lookup(net, fl4, &res) == 0) {
 				struct fib_nh *nh = &FIB_RES_NH(res);
-				struct fib_nh_exception *fnhe;
 
-				spin_lock_bh(&fnhe_lock);
-				fnhe = find_or_create_fnhe(nh, fl4->daddr);
-				if (fnhe)
-					fnhe->fnhe_gw = new_gw;
-				spin_unlock_bh(&fnhe_lock);
+				update_or_create_fnhe(nh, fl4->daddr, new_gw,
+						      0, 0);
 			}
 			rt->rt_gateway = new_gw;
 			rt->rt_flags |= RTCF_REDIRECTED;
@@ -1663,15 +1675,9 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
 
 	if (fib_lookup(dev_net(rt->dst.dev), fl4, &res) == 0) {
 		struct fib_nh *nh = &FIB_RES_NH(res);
-		struct fib_nh_exception *fnhe;
 
-		spin_lock_bh(&fnhe_lock);
-		fnhe = find_or_create_fnhe(nh, fl4->daddr);
-		if (fnhe) {
-			fnhe->fnhe_pmtu = mtu;
-			fnhe->fnhe_expires = jiffies + ip_rt_mtu_expires;
-		}
-		spin_unlock_bh(&fnhe_lock);
+		update_or_create_fnhe(nh, fl4->daddr, 0, mtu,
+				      jiffies + ip_rt_mtu_expires);
 	}
 	rt->rt_pmtu = mtu;
 	dst_set_expires(&rt->dst, ip_rt_mtu_expires);
@@ -1904,21 +1910,34 @@ static void rt_bind_exception(struct rtable *rt, struct fib_nh *nh, __be32 daddr
 
 	for (fnhe = rcu_dereference(hash[hval].chain); fnhe;
 	     fnhe = rcu_dereference(fnhe->fnhe_next)) {
-		if (fnhe->fnhe_daddr == daddr) {
-			if (fnhe->fnhe_pmtu) {
-				unsigned long expires = fnhe->fnhe_expires;
-				unsigned long diff = jiffies - expires;
-
-				if (time_before(jiffies, expires)) {
-					rt->rt_pmtu = fnhe->fnhe_pmtu;
-					dst_set_expires(&rt->dst, diff);
-				}
+		__be32 fnhe_daddr, gw;
+		u32 pmtu;
+		unsigned long expires;
+		unsigned int seq;
+
+		if (fnhe->fnhe_daddr != daddr)
+			continue;
+		do {
+			seq = read_seqbegin(&fnhe_seqlock);
+			fnhe_daddr = fnhe->fnhe_daddr;
+			gw = fnhe->fnhe_gw;
+			pmtu = fnhe->fnhe_pmtu;
+			expires = fnhe->fnhe_expires;
+		} while (read_seqretry(&fnhe_seqlock, seq));
+		if (daddr != fnhe_daddr)
+			continue;
+		if (pmtu) {
+			unsigned long diff = expires - jiffies;
+
+			if (time_before(jiffies, expires)) {
+				rt->rt_pmtu = pmtu;
+				dst_set_expires(&rt->dst, diff);
 			}
-			if (fnhe->fnhe_gw)
-				rt->rt_gateway = fnhe->fnhe_gw;
-			fnhe->fnhe_stamp = jiffies;
-			break;
 		}
+		if (gw)
+			rt->rt_gateway = gw;
+		fnhe->fnhe_stamp = jiffies;
+		break;
 	}
 }
 
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next] ipv6: fix inet6_csk_xmit()
  2012-07-18  7:38         ` [PATCH net-next] ipv6: fix inet6_csk_xmit() Eric Dumazet
@ 2012-07-18 16:00           ` David Miller
  0 siblings, 0 replies; 17+ messages in thread
From: David Miller @ 2012-07-18 16:00 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, ncardwell, ycheng

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 18 Jul 2012 09:38:04 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> We should provide to inet6_csk_route_socket a struct flowi6 pointer,
> so that net6_csk_xmit() works correctly instead of sending garbage.
> 
> Also add some consts 
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Yuchung Cheng <ycheng@google.com>

Thanks a lot for fixing this Eric.

Applied.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
  2012-07-18  8:36                 ` Julian Anastasov
@ 2012-07-18 16:07                   ` David Miller
  0 siblings, 0 replies; 17+ messages in thread
From: David Miller @ 2012-07-18 16:07 UTC (permalink / raw)
  To: ja; +Cc: eric.dumazet, netdev

From: Julian Anastasov <ja@ssi.bg>
Date: Wed, 18 Jul 2012 11:36:08 +0300 (EEST)

> 	Is the cost of read_seqbegin a problem? Here is a
> 2nd version, I still keep this first check for now.

No, the read side of seqlocks are extremely cheap, it's just a plain
read and compare of a read-mostly integer.

> Subject: [PATCH v2] ipv4: use seqlock for nh_exceptions
> 
> From: Julian Anastasov <ja@ssi.bg>
> 
> 	Use global seqlock for the nh_exceptions. Call
> fnhe_oldest with the right hash chain. Correct the diff
> value for dst_set_expires.
> 
> v2: after suggestions from Eric Dumazet:
> * get rid of spin lock fnhe_lock, rearrange update_or_create_fnhe
> * continue daddr search in rt_bind_exception
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---

I think if you get a seqlock mis-compare, you will need to branch back
to rescan the hash chain from the beginning.

Otherwise I like these changes a lot.

We should perhaps consider doing something similar in the TCP metrics
code.

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2012-07-18 16:07 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-17 13:14 [PATCH 0/5] Long term PMTU/redirect storage in ipv4 David Miller
2012-07-17 18:03 ` David Miller
2012-07-18  4:58   ` net-next and IPv6 Eric Dumazet
2012-07-18  7:04     ` Eric Dumazet
2012-07-18  7:23       ` Eric Dumazet
2012-07-18  7:38         ` [PATCH net-next] ipv6: fix inet6_csk_xmit() Eric Dumazet
2012-07-18 16:00           ` David Miller
2012-07-17 20:41 ` [PATCH 0/5] Long term PMTU/redirect storage in ipv4 Julian Anastasov
2012-07-17 20:46   ` David Miller
2012-07-17 22:14     ` Julian Anastasov
2012-07-17 22:09       ` David Miller
2012-07-18  1:06         ` Julian Anastasov
2012-07-18  3:46           ` Eric Dumazet
2012-07-18  7:28             ` Julian Anastasov
2012-07-18  7:30               ` Eric Dumazet
2012-07-18  8:36                 ` Julian Anastasov
2012-07-18 16:07                   ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.