Re: [PATCH nf] x_tables: Properly close read section with read_seqcount_retry

From: Will Deacon <will@kernel.org>
To: Florian Westphal <fw@strlen.de>
Cc: subashab@codeaurora.org, pablo@netfilter.org,
	Sean Tranchetti <stranche@codeaurora.org>,
	netfilter-devel@vger.kernel.org, peterz@infradead.org,
	tglx@linutronix.de
Subject: Re: [PATCH nf] x_tables: Properly close read section with read_seqcount_retry
Date: Wed, 18 Nov 2020 12:54:06 +0000	[thread overview]
Message-ID: <20201118125406.GA2029@willie-the-truck> (raw)
In-Reply-To: <20201118124228.GJ22792@breakpoint.cc>

On Wed, Nov 18, 2020 at 01:42:28PM +0100, Florian Westphal wrote:
> Will Deacon <will@kernel.org> wrote:
> > On Mon, Nov 16, 2020 at 07:20:28PM +0100, Florian Westphal wrote:
> > > subashab@codeaurora.org <subashab@codeaurora.org> wrote:
> > > > > > Unfortunately we are seeing it on ARM64 regression systems which
> > > > > > runs a
> > > > > > variety of
> > > > > > usecases so the exact steps are not known.
> > > > > 
> > > > > Ok.  Would you be willing to run some of those with your suggested
> > > > > change to see if that resolves the crashes or is that so rare that this
> > > > > isn't practical?
> > > > 
> > > > I can try that out. Let me know if you have any other suggestions as well
> > > > and I can try that too.
> > > > 
> > > > I assume we cant add locks here as it would be in the packet processing
> > > > path.
> > > 
> > > Yes.  We can add a synchronize_net() in xt_replace_table if needed
> > > though, before starting to put the references on the old ruleset
> > > This would avoid the free of the jumpstack while skbs are still
> > > in-flight.
> > 
> > I tried to make sense of what's going on here, and it looks to me like
> > the interaction between xt_replace_table() and ip6t_do_table() relies on
> > store->load order that is _not_ enforced in the code.
> > 
> > xt_replace_table() does:
> > 
> > 	table->private = newinfo;
> > 
> > 	/* make sure all cpus see new ->private value */
> > 	smp_wmb();
> > 
> > 	/*
> > 	 * Even though table entries have now been swapped, other CPU's
> > 	 * may still be using the old entries...
> > 	 */
> > 	local_bh_enable();
> > 
> > 	/* ... so wait for even xt_recseq on all cpus */
> > 	for_each_possible_cpu(cpu) {
> > 		seqcount_t *s = &per_cpu(xt_recseq, cpu);
> > 		u32 seq = raw_read_seqcount(s);
> > 
> > 		if (seq & 1) {
> > 			do {
> > 				cond_resched();
> > 				cpu_relax();
> > 			} while (seq == raw_read_seqcount(s));
> > 		}
> > 	}
> > 
> > and I think the idea here is that we swizzle the table->private pointer
> > to point to the new data, and then wait for all pre-existing readers to
> > finish with the old table (isn't this what RCU is for?).
> 
> Yes and yes.   Before the current 'for_each_possible_cpu() + seqcount
> abuse the abuse was just implicit; xt_replace_table did NOT wait for
> readers to go away and relied on arp, eb and ip(6)tables to store the
> counters back to userspace after xt_replace_table().

Thanks for the background.

> This meant a read_seqcount_begin/retry for each counter & eventually
> gave complaits from k8s users that x_tables rule replacement was too
> slow.
> 
> [..]
> 
> > Given that this appears to be going wrong in practice, I've included a quick
> > hack below which should fix the ordering. However, I think it would be
> > better if we could avoid abusing the seqcount code for this sort of thing.
> 
> I'd be ok with going via the simpler solution & wait if k8s users
> complain that its too slow.  Those memory blobs can be huge so I would
> not use call_rcu here.

If you can stomach the cost of synchronize_rcu() then this at least
gets rid of that for_each_possible_cpu() loop! Also...

> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index af22dbe85e2c..b5911985d1eb 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -1384,7 +1384,7 @@ xt_replace_table(struct xt_table *table,
>          * private.
>          */
>         smp_wmb();
> -       table->private = newinfo;
> +       WRITE_ONCE(table->private, newinfo);

... you could make this rcu_assign_pointer() and get rid of the preceding
smp_wmb()...

>  
>         /* make sure all cpus see new ->private value */
>         smp_wmb();

... and this smp_wmb() is no longer needed because synchronize_rcu()
takes care of the ordering.

> @@ -1394,19 +1394,7 @@ xt_replace_table(struct xt_table *table,
>          * may still be using the old entries...
>          */
>         local_bh_enable();
> -
> -       /* ... so wait for even xt_recseq on all cpus */
> -       for_each_possible_cpu(cpu) {
> -               seqcount_t *s = &per_cpu(xt_recseq, cpu);
> -               u32 seq = raw_read_seqcount(s);
> -
> -               if (seq & 1) {
> -                       do {
> -                               cond_resched();
> -                               cpu_relax();
> -                       } while (seq == raw_read_seqcount(s));
> -               }
> -       }

Do we still need xt_write_recseq_{begin,end}() with this removed?

Cheers,

Will