Re: [RFC net-next 0/2] prevent sync issues with hw offload of flower

From: Vlad Buslov <vladbu@mellanox.com>
To: John Hurley <john.hurley@netronome.com>
Cc: Vlad Buslov <vladbu@mellanox.com>, Jiri Pirko <jiri@mellanox.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"simon.horman@netronome.com" <simon.horman@netronome.com>,
	"jakub.kicinski@netronome.com" <jakub.kicinski@netronome.com>,
	"oss-drivers@netronome.com" <oss-drivers@netronome.com>
Subject: Re: [RFC net-next 0/2] prevent sync issues with hw offload of flower
Date: Fri, 4 Oct 2019 15:58:32 +0000	[thread overview]
Message-ID: <vbf36g8jyeh.fsf@mellanox.com> (raw)
In-Reply-To: <CAK+XE=nrzH8B=2GRcvmgOus67HSh_QXfBsawO_qicp8nSyZ_FA@mail.gmail.com>

On Fri 04 Oct 2019 at 18:39, John Hurley <john.hurley@netronome.com> wrote:
> On Thu, Oct 3, 2019 at 6:19 PM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>>
>> On Thu 03 Oct 2019 at 19:59, John Hurley <john.hurley@netronome.com> wrote:
>> > On Thu, Oct 3, 2019 at 5:26 PM Vlad Buslov <vladbu@mellanox.com> wrote:
>> >>
>> >>
>> >> On Thu 03 Oct 2019 at 02:14, John Hurley <john.hurley@netronome.com> wrote:
>> >> > Hi,
>> >> >
>> >> > Putting this out an RFC built on net-next. It fixes some issues
>> >> > discovered in testing when using the TC API of OvS to generate flower
>> >> > rules and subsequently offloading them to HW. Rules seen contain the same
>> >> > match fields or may be rule modifications run as a delete plus an add.
>> >> > We're seeing race conditions whereby the rules present in kernel flower
>> >> > are out of sync with those offloaded. Note that there are some issues
>> >> > that will need fixed in the RFC before it becomes a patch such as
>> >> > potential races between releasing locks and re-taking them. However, I'm
>> >> > putting this out for comments or potential alternative solutions.
>> >> >
>> >> > The main cause of the races seem to be in the chain table of cls_api. If
>> >> > a tcf_proto is destroyed then it is removed from its chain. If a new
>> >> > filter is then added to the same chain with the same priority and protocol
>> >> > a new tcf_proto will be created - this may happen before the first is
>> >> > fully removed and the hw offload message sent to the driver. In cls_flower
>> >> > this means that the fl_ht_insert_unique() function can pass as its
>> >> > hashtable is associated with the tcf_proto. We are then in a position
>> >> > where the 'delete' and the 'add' are in a race to get offloaded. We also
>> >> > noticed that doing an offload add, then checking if a tcf_proto is
>> >> > concurrently deleting, then remove the offload if it is, can extend the
>> >> > out of order messages. Drivers do not expect to get duplicate rules.
>> >> > However, the kernel TC datapath they are not duplicates so we can get out
>> >> > of sync here.
>> >> >
>> >> > The RFC fixes this by adding a pre_destroy hook to cls_api that is called
>> >> > when a tcf_proto is signaled to be destroyed but before it is removed from
>> >> > its chain (which is essentially the lock for allowing duplicates in
>> >> > flower). Flower then uses this new hook to send the hw delete messages
>> >> > from tcf_proto destroys, preventing them racing with duplicate adds. It
>> >> > also moves the check for 'deleting' to before the sending the hw add
>> >> > message.
>> >> >
>> >> > John Hurley (2):
>> >> >   net: sched: add tp_op for pre_destroy
>> >> >   net: sched: fix tp destroy race conditions in flower
>> >> >
>> >> >  include/net/sch_generic.h |  3 +++
>> >> >  net/sched/cls_api.c       | 29 ++++++++++++++++++++++++-
>> >> >  net/sched/cls_flower.c    | 55 ++++++++++++++++++++++++++---------------------
>> >> >  3 files changed, 61 insertions(+), 26 deletions(-)
>> >>
>> >> Hi John,
>> >>
>> >> Thanks for working on this!
>> >>
>> >> Are there any other sources for race conditions described in this
>> >> letter? When you describe tcf_proto deletion you say "main cause" but
>> >> don't provide any others. If tcf_proto is the only problematic part,
>> >
>> > Hi Vlad,
>> > Thanks for the input.
>> > The tcf_proto deletion was the cause from the tests we ran. That's not
>> > to say there are not more I wasn't seeing in my analysis.
>> >
>> >> then it might be worth to look into alternative ways to force concurrent
>> >> users to wait for proto deletion/destruction to be properly finished.
>> >> Maybe having some table that maps chain id + prio to completion would be
>> >> simpler approach? With such infra tcf_proto_create() can wait for
>> >> previous proto with same prio and chain to be fully destroyed (including
>> >> offloads) before creating a new one.
>> >
>> > I think a problem with this is that the chain removal functions call
>> > tcf_proto_put() (which calls destroy when ref is 0) so, if other
>> > concurrent processes (like a dump) have references to the tcf_proto
>> > then we may not get the hw offload even by the time the chain deletion
>> > function has finished. We would need to make sure this was tracked -
>> > say after the tcf_proto_destroy function has completed.
>> > How would you suggest doing the wait? With a replay flag as happens in
>> > some other places?
>> >
>> > To me it seems the main problem is that the tcf_proto being in a chain
>> > almost acts like the lock to prevent duplicates filters getting to the
>> > driver. We need some mechanism to ensure a delete has made it to HW
>> > before we release this 'lock'.
>>
>> Maybe something like:
>
> Ok, I'll need to give this more thought.
> Initially it does sound like overkill.
>
>>
>> 1. Extend block with hash table with key being chain id and prio
>> combined and value is some structure that contains struct completion
>> (completed in tcf_proto_destroy() where we sure that all rules were
>> removed from hw) and a reference counter.
>>
>
> Maybe it could live in each chain rather than block to be more fine grained?
> Or would this potentially cause a similar issue on deletion of chains?

IMO just having one per block is straightforward enough, unless there is
a reason to make it per chain.

>
>> 2. When cls API wants to delete proto instance
>> (tcf_chain_tp_delete_empty(), chain flush, etc.), new member is added to
>> table from 1. with chain+prio of proto that is being deleted (atomically
>> with detaching of proto from chain).
>>
>> 3. When inserting new proto, verify that there are no corresponding
>> entry in hash table with same chain+prio. If there is, increment
>> reference counter and wait for completion. Release reference counter
>> when completed.
>
> How would the 'wait' work? Loop back via replay flag?

What is "loop back via replay flag"?
Anyway, I was suggesting to use struct completion from completion.h,
which has following functions in its API:

/**
 * wait_for_completion: - waits for completion of a task
 * @x:  holds the state of this particular completion
 *
 * This waits to be signaled for completion of a specific task. It is NOT
 * interruptible and there is no timeout.
 *
 * See also similar routines (i.e. wait_for_completion_timeout()) with timeout
 * and interrupt capability. Also see complete().
 */
void __sched wait_for_completion(struct completion *x)

/**
 * complete_all: - signals all threads waiting on this completion
 * @x:  holds the state of this particular completion
 *
 * This will wake up all threads waiting on this particular completion event.
 *
 * If this function wakes up a task, it executes a full memory barrier before
 * accessing the task state.
 *
 * Since complete_all() sets the completion of @x permanently to done
 * to allow multiple waiters to finish, a call to reinit_completion()
 * must be used on @x if @x is to be used again. The code must make
 * sure that all waiters have woken and finished before reinitializing
 * @x. Also note that the function completion_done() can not be used
 * to know if there are still waiters after complete_all() has been called.
 */
void complete_all(struct completion *x)

> It feels a bit like we are adding a lot more complexity to this and
> almost hacking something in to work around a (relatively) newly
> introduced problem.

I'm not insisting on any particular approach, just suggesting something
which in my opinion is easier to implement than reshuffling locking in
flower and directly targets the problem you described in cover letter -
new filters can be inserted while concurrent destruction of tp with same
chain id and prio is in progress.

>
>>
>> >
>> >>
>> >> Regards,
>> >> Vlad