From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_NEOMUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A04BEC00449 for ; Wed, 3 Oct 2018 17:05:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 50A872098A for ; Wed, 3 Oct 2018 17:05:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="PrnCeOT+" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 50A872098A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727234AbeJCXyk (ORCPT ); Wed, 3 Oct 2018 19:54:40 -0400 Received: from mail-pg1-f195.google.com ([209.85.215.195]:38311 "EHLO mail-pg1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726847AbeJCXyk (ORCPT ); Wed, 3 Oct 2018 19:54:40 -0400 Received: by mail-pg1-f195.google.com with SMTP id r77-v6so1851704pgr.5; Wed, 03 Oct 2018 10:05:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=nIy8VN3aC0hVK/rpqOLNZuCzP9jGevjG6bDpsCC2T+U=; b=PrnCeOT+odfUd2W5o7DtOLTUaGDeraxNrzF3Rd3ovnJBQLr5GT4o7CGShB8RHrFT/c WcNDRFKrl3Ngqi2wFG+F6ytL1ofSN6Mlf5EXcJKqbh4tFlnfYu9ZuZDWTdloDBW30jgc YsLFUJST+tDE6ZvXdXSMIh34VXemRfX9gLmZAbE7tE74e749BJEO30Shm/wupDQ41oIa R3QgeRIrbf8wtxSgcv3pIdA+hcyz2O52vhWK6bCeziaR6JubMGd47EN37240NUQ/csib 4AqM3WFtupgXnw0ZJxIiDw6qVsisRnW5hEHAn26hdsd99nSWCGkW0jcCcO+jB9SPLPSY EN0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=nIy8VN3aC0hVK/rpqOLNZuCzP9jGevjG6bDpsCC2T+U=; b=A2iqbF02B01q7ddv6oU02bvaLlvcA+f1nbDLr8pT+jUfnrGywsRGtP0Rdu2s+qr+g0 68Wj0K1nlRq264W/C52z1oHG4liUacpjIVDSWbaA/puYLnRdgT6CD8/jqKYLXyRW7yVp hIVRwk4ChCbv4AMbPNQ2fTRi9oZqspO7DDpYXAEvBr63iw4UDKahJr4cu27g05B9qide uu4UDJ+q/HD99ctWZ7VM7qNxWP8RAR+5aX6/Mrq0mydxphwSAONZc5erRzmqTHQImOSO aCew7hAfUNEeRMkzLUgQ7CRJUdOSCT9TL70v4ADI1VcodnfOWTpEsaSXUZr1VPIsx0hw ifzA== X-Gm-Message-State: ABuFfoi6Vdgzm/gYWSKwvIQG2RMR09Nbpp0tgUc6DhDFDozfTFlNGivI obVSjEkCSscRhU8MVha8uC8b8qlx X-Google-Smtp-Source: ACcGV63NlYlf1y2tiCWQPRmzXnYBjWguWbo2yyRICSuXZsmtVISxVLuk46tXeuFP4kB+900/cS78mg== X-Received: by 2002:a63:5d03:: with SMTP id r3-v6mr2192525pgb.445.1538586324721; Wed, 03 Oct 2018 10:05:24 -0700 (PDT) Received: from ast-mbp.dhcp.thefacebook.com ([2620:10d:c090:180::1:4f27]) by smtp.gmail.com with ESMTPSA id e134-v6sm3505772pfh.26.2018.10.03.10.05.23 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 03 Oct 2018 10:05:23 -0700 (PDT) Date: Wed, 3 Oct 2018 10:05:22 -0700 From: Alexei Starovoitov To: Jann Horn Cc: Alexei Starovoitov , Daniel Borkmann , Network Development , kernel list , Michael Kerrisk-manpages , linux-man Subject: Re: BPF: RCU use-after-reallocation of hash table elements? Message-ID: <20181003170520.gc43zeffylshz5k2@ast-mbp.dhcp.thefacebook.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20180223 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 03, 2018 at 06:15:08PM +0200, Jann Horn wrote: > Hi! > > Note: I haven't tested any of this; feel free to tell me that I've > completely misunderstood how all this works. > > The BPF manpage, at the moment, states about BPF hash tables: unfortunately man pages have been obsolete for long time. the only doc that is mostly up to date is http://docs.cilium.io/en/latest/bpf/ > Unless a BPF hash table is created with the (undocumented) flag > BPF_F_NO_PREALLOC, the kernel now actually pre-allocates the hash > table elements. Hash table elements can be freed and reused for new > allocations (!) without waiting for an RCU grace period: Freed > elements are immediately pushed on the percpu freelist, and can be > immediately reused from there. The most obvious consequence of this is > that if a BPF program looks up a hash table entry and then reads the > value, the value can be replaced with a new value in between. A more > subtle consequence is that BPF map lookups can return false-positive > results: If the first half of the lookup key matches the old key, and > the second half of the lookup key matches the new key, then a BPF map > lookup can return a false-positive result, as far as I can tell. Correct. That is the case when one cpu deletes the pre-allocated element and immediately reuses the slot in update_elem while another cpu is still looking into it. > If what I'm saying is correct, I'm not sure what the best fix is. > > Add a grace period when freeing hash map entries, and add a new -EBUSY > return value for attempts to create hash map entries when all free > entries are waiting for the end of an RCU grace period? > > Add a grace period when freeing hash map entries, and use > rcu_synchronize() when inserting BPF hashmap entries from userspace > and all free entries are waiting for RCU? But that still leaves the > bpf_map_update_elem_proto helper that can be called from BPF. > Deprecate that helper for access to hash maps? > > Document the race, and advise people who use BPF for > non-performance-tracing purposes (where occasional false positives > actually matter) to use BPF_F_NO_PREALLOC? Definitely our fault that it's not documented clearly. There are subtle issues with kmalloc style as well. See free_htab_elem(): atomic_dec(&htab->count); l->htab = htab; call_rcu(&l->rcu, htab_elem_free_rcu); which means that for short time the total memory consumption of the no_prealloc hash map can be higher than max_entries limit while rcu callbacks are working their way through. For bpf prog that is invoked million times a second and constantly doing delete/update the rcu callback latency can be an issue. Hence the use of no_prealloc needs to be carefully considered as well. we tried to atomic_dec in rcu callback, but that caused even more issues, since already written bpf progs expected max_entries limit to be accurate. > Add some sort of sequence lock to BPF (yuck)? why yuck reaction? ;) Locks are mandatory primitives in concurrent programming and the lock support is coming to bpf as well. bpf progs themselves will be able to do lock/unlock of objects to maintain atomicity of multi field updates. Right now updating two counters atomically is pretty much impossible. hash map update can replace the element with a copy, but that's not usable for counters. Soon the program will be able to do: struct value_t { u32 var1, var2; bpf_lock_t lock; }; struct value_t * value = bpf_map_lookup_elem(map, &key); bpf_lock(&value->lock); value->var1++; value->var2++; bpf_unlock(&value->lock); and verifier will be smart enough to make sure locks don't escape. Also alloc/free will be available to bpf progs and new program types won't have to run as one giant rcu_lock+preempt_disable section. Progs will be able to do rcu_lock/unlock and preempt_enable/disable from inside the program.