Re: [PATCH] xfrm: cache bundle lookup results in flow cache

From: "Timo Teräs" <timo.teras@iki.fi>
To: Herbert Xu <herbert@gondor.apana.org.au>
Cc: netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>
Subject: Re: [PATCH] xfrm: cache bundle lookup results in flow cache
Date: Sat, 20 Mar 2010 18:26:00 +0200	[thread overview]
Message-ID: <4BA4F718.3020700@iki.fi> (raw)
In-Reply-To: <20100320151751.GB2950@gondor.apana.org.au>

Herbert Xu wrote:
> On Fri, Mar 19, 2010 at 11:53:44AM +0200, Timo Teräs wrote:
>> But now, thinking more, it would probably make more sense to
>> just cache xfrm_dst's and keep ref to the policy on them. So
>> in general I agree with your recommendation. The only immediate
>> problem I can think now is that the resolved would need to
>> atomically check if xfrm_dst is valid, if not, resolve new one.
>> But creating new xfrm_dst involves taking locks and can sleep
>> so it cannot be inside the main resolver.
> 
> OK this brings out my favourite topic :)
> 
> The reason we have to sleep is to resolve the template.  So if
> we had queueing for pending xfrm_dst objects, we wouldn't have
> to sleep at all when creating the top-level xfrm_dst.
> 
> Packets using that xfrm_dst can wait in the queue until it is
> fully resolved.  Now obviously this code doesn't exist so this
> is all just a wet dream.

Right. That sounds useful.

> Setting my favourite topic aside, I have to come to the conclusion
> that your patch still doesn't fully resolve the problem you set out
> to fix.
> 
> The crux of the issue is the linked list of all bundles in a
> policy and the obvious problems stemming from walking a linked
> list that is unbounded.
> 
> The reason I think it doesn't fully resolve this is because of
> the flow cache.  Being a per-cpu cache, when you create the xfrm
> dst the first time around, you'll at most put it in one CPU's
> cache.
> 
> The next CPU that comes along will still have to walk that same
> bundle linked list.  So we're back to square one.

Not exactly, each CPU does one slow lookup after which it
finds it fast. But yes, it's not perfect solution. Especially,
if cpu happens to get switched between the initial lookup and
the update.

> Now Dave, my impression is that we picked the per-cpu design
> because it was the best data structure we had back in 2002,
> right?
> 
> If so I'd like us to think about the possibility of switching
> over to a different design, in particular, an RCU-based hash
> table, similar to the one I just used for bridge multicasting.
> 
> This would eliminate the need for walking the bundle list apart
> from the case when we're destroying the policy, which can be
> done in process context.

Right. This would speed the bundle lookup in all cases.

Except... we can have override policy on per-socket basis.
We should include the per-socket override in the flow lookups
so that those sockets get also boost from the cache. Though
usual use case is to disable all policies (so e.g. IKE can
be talked without policies applying).

> Actually I just realised that the other way we can fix this is
> to make xfrm_dst objects per-cpu just like IPv4 routes.  That
> is, when you fail to find an xfrm_dst object in the per-cpu
> cache, you dont' bother calling xfrm_find_bundle but just make
> a new bundle.
> 
> This is probably much easier than replacing the whole flow cache.
> Can any one think of any problems with duplicate xfrm_dst objects?

Sounds like a very good idea. If we instantiate new xfrm_dst,
all that it shares with others is xfrm_state and xfrm_policy
(inner objects will be unique). Since that's what happens anyway
I don't see any problem with this.

So should go ahead and:
1. modify flow cache to be more generic (have virtual put and get
   for each object; and remove the atomic_t pointer)
2. modify flow cache to have slow and fast resolvers so we can
   copy with the current sleeping requirement
3. cache bundles instead of policies for outgoing stuff
4. kill find_bundle and just instantiate new ones if we get cache
   miss
5. put all bundles to global hlist (since only place that walks
   through them is gc, and stale bundle can be dst_free'd right
   away); use genid's for policy to flush old bundles
6. dst_free and unlink bundle immediately if it's found to be stale

- Timo