From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 876B8C433EF for ; Thu, 28 Apr 2022 18:19:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240415AbiD1SW3 (ORCPT ); Thu, 28 Apr 2022 14:22:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33766 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1348985AbiD1SW3 (ORCPT ); Thu, 28 Apr 2022 14:22:29 -0400 Received: from mail-io1-xd2e.google.com (mail-io1-xd2e.google.com [IPv6:2607:f8b0:4864:20::d2e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6B9FDBC844 for ; Thu, 28 Apr 2022 11:19:13 -0700 (PDT) Received: by mail-io1-xd2e.google.com with SMTP id e3so3631122ios.6 for ; Thu, 28 Apr 2022 11:19:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=aF0nIfXJ6Sz4whzIRMHo5qW2fKTEAjXy4xvjao8nAXI=; b=lH81/H6oblrlgWu0nAiPLWzURUu+MDKdrCow9d5D7S5k2qnXc8ZkNzxGZRbivFdcEw sYtCAr7I1V2x2rs1KsZcsS+g5z1H+x5bjUz4F2822PcP3jmXLhnq/1UdRVnuhIeaoA7N Vp8XOTpJCWl5utBNTiOzM8ZNX/qNJJ+9z8EWsyWS1WkO4FRI1s9h0HON/TzDAThlZcdI 4m2tOh8LlUvIL7SVTLAGOQioLsH7j9vo5Q2i9ekPzjvPAN6d7mzmZqv61i98/W5WBjvV u/rnSP+IFfoZNheJur6miDnezG1gz07IRzcN/Efd8wEj2yVh43oFjnk0JYkoO/ev/VVV B6pg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=aF0nIfXJ6Sz4whzIRMHo5qW2fKTEAjXy4xvjao8nAXI=; b=yrWY+X+jAopcpWIpsOIEXJxv8+BJ0uQmoCapUQFqIkjzN5OerIdN4KnYAR07n/MSiI ojCaUU3LbtKn5QCB63cn10iGzLfDAsmQ3YBtzlG4WXmS/x+GoxUVbQRVX2MfOlPqIsmQ CKYDUPjxibJ0KmQ4XATuR9V567N3em8ejWgcQGb0m+lyJJIfWrOuf6rFPq4jaav5INDy ienGYUs+AYxVz7qsBk+lhBCgPFoRpiMWhDCEPTasFPeb6UqKAA53UWOr6dchYPFQWVas jd7OowAmxr1M+xyIWqrpEUm39EQhnbjshYWlzqnGT/PPcXBKo2CyFHGwZ7Kf9ixg8LfL ZNHA== X-Gm-Message-State: AOAM531pRss4xziTr+7UxF25uMNkdFr+UcyHLkBZWQnSguaRjB0dPM/E 2nw+fdwrAWHaX8yMY2cr/QZ/BP4qXHtuIaZwCZY= X-Google-Smtp-Source: ABdhPJxZQsxn8mMQmwAgh4QzjEgqbVU8e+tywBtu6XiJlSPjXG69VU/9ZNU+TP05MV4JQEvHInFKEfTis7zTQeNZw+c= X-Received: by 2002:a05:6638:2104:b0:326:1e94:efa6 with SMTP id n4-20020a056638210400b003261e94efa6mr15708338jaj.234.1651169952761; Thu, 28 Apr 2022 11:19:12 -0700 (PDT) MIME-Version: 1.0 References: <972caeb1e9338721bb719b118e0e40705f860f50.1651103126.git.delyank@fb.com> In-Reply-To: <972caeb1e9338721bb719b118e0e40705f860f50.1651103126.git.delyank@fb.com> From: Andrii Nakryiko Date: Thu, 28 Apr 2022 11:19:01 -0700 Message-ID: Subject: Re: [PATCH bpf-next 2/5] bpf: implement sleepable uprobes by chaining tasks and normal rcu To: Delyan Kratunov Cc: "daniel@iogearbox.net" , "ast@kernel.org" , "andrii@kernel.org" , "bpf@vger.kernel.org" , "paulmck@kernel.org" Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov wrote: > > uprobes work by raising a trap, setting a task flag from within the > interrupt handler, and processing the actual work for the uprobe on the > way back to userspace. As a result, uprobe handlers already execute in a > user context. The primary obstacle to sleepable bpf uprobe programs is > therefore on the bpf side. > > Namely, the bpf_prog_array attached to the uprobe is protected by normal > rcu and runs with disabled preemption. In order for uprobe bpf programs > to become actually sleepable, we need it to be protected by the tasks_trace > rcu flavor instead (and kfree() called after a corresponding grace period). > > One way to achieve this is by tracking an array-has-contained-sleepable-prog > flag in bpf_prog_array and switching rcu flavors based on it. However, this > is deemed somewhat unwieldly and the rcu flavor transition would be hard > to reason about. > > Instead, based on Alexei's proposal, we change the free path for > bpf_prog_array to chain a tasks_trace and normal grace periods > one after the other. Users who iterate under tasks_trace read section would > be safe, as would users who iterate under normal read sections (from > non-sleepable locations). The downside is that we take the tasks_trace latency > unconditionally but that's deemed acceptable under expected workloads. One example where this actually can become a problem is cgroup BPF programs. There you can make single attachment to root cgroup, but it will create one "effective" prog_array for each descendant (and will keep destroying and creating them as child cgroups are created). So there is this invisible multiplier which we can't really control. So paying the (however small, but) price of call_rcu_tasks_trace() in bpf_prog_array_free() which is used for purely non-sleepable cases seems unfortunate. But I think an alternative to tracking this "has sleepable" bit on a per bpf_prog_array case is to have bpf_prog_array_free_sleepable() implementation independent of bpf_prog_array_free() itself and call that sleepable variant from uprobe detach handler, limiting the impact to things that actually might be running as sleepable and which most likely won't churn through a huge amount of arrays. WDYT? Otherwise all looks good and surprisingly straightforward thanks to the fact uprobe is already running in a sleepable context, awesome! > > The other interesting implication is wrt non-sleepable uprobe > programs. Because they need access to dynamically sized rcu-protected > maps, we conditionally disable preemption and take an rcu read section > around them, in addition to the overarching tasks_trace section. > > Signed-off-by: Delyan Kratunov > --- > include/linux/bpf.h | 60 ++++++++++++++++++++++++++++++++++++ > include/linux/trace_events.h | 1 + > kernel/bpf/core.c | 10 +++++- > kernel/trace/bpf_trace.c | 23 ++++++++++++++ > kernel/trace/trace_uprobe.c | 4 +-- > 5 files changed, 94 insertions(+), 4 deletions(-) > [...]