From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mauricio Vasquez Subject: Re: [RFC PATCH bpf-next v2 0/4] Implement bpf queue/stack maps Date: Fri, 7 Sep 2018 15:40:57 -0500 Message-ID: <4b3edda0-16ba-8689-e5ff-ef2bdfb9316b@polito.it> References: <153575074884.30050.17670029209466860207.stgit@kernel> <20180907001317.fj7f6fg6ihljompp@ast-mbp.dhcp.thefacebook.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Cc: Alexei Starovoitov , Daniel Borkmann , netdev@vger.kernel.org, joe@wand.net.nz To: Alexei Starovoitov Return-path: Received: from fm2nodo5.polito.it ([130.192.180.19]:45064 "EHLO fm2nodo5.polito.it" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725986AbeIHBYD (ORCPT ); Fri, 7 Sep 2018 21:24:03 -0400 In-Reply-To: <20180907001317.fj7f6fg6ihljompp@ast-mbp.dhcp.thefacebook.com> Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: On 09/06/2018 07:13 PM, Alexei Starovoitov wrote: > On Fri, Aug 31, 2018 at 11:25:48PM +0200, Mauricio Vasquez B wrote: >> In some applications this is needed have a pool of free elements, like for >> example the list of free L4 ports in a SNAT. None of the current maps allow >> to do it as it is not possibleto get an any element without having they key >> it is associated to. >> >> This patchset implements two new kind of eBPF maps: queue and stack. >> Those maps provide to eBPF programs the peek, push and pop operations, and for >> userspace applications a new bpf_map_lookup_and_delete_elem() is added. >> >> Signed-off-by: Mauricio Vasquez B >> >> --- >> >> I am sending this as an RFC because there is still an issue I am not sure how >> to solve. >> >> The queue/stack maps have a linked list for saving the nodes, and a >> preallocation schema based on the pcpu_freelist already implemented and used >> in the htabmap. Each time an element is pushed into the map, a node from the >> pcpu_freelist is taken and then added to the linked list. >> >> The pop operation takes and *removes* the first node from the linked list, then >> it uses *call_rcu* to postpose freeing the node, i.e, the node is only returned >> to the pcpu_freelist when the rcu callback is executed. This is needed because >> an element returned by the pop() operation should remain valid for the whole >> duration of the eBPF program. >> >> The problem is that elements are not immediately returned to the free list, so >> in some cases the push operation could fail because there are not free nodes >> in the pcpu_freelist. >> >> The following code snippet exposes that problem. >> >> ... >> /* Push MAP_SIZE elements */ >> for (i = 0; i < MAP_SIZE; i++) >> assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0); >> >> /* Pop all elements */ >> for (i = 0; i < MAP_SIZE; i++) >> assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == 0 && >> val == vals[i]); >> >> // sleep(1) <-- If I put this sleep, everything works. >> /* Push MAP_SIZE elements */ >> for (i = 0; i < MAP_SIZE; i++) >> assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0); >> ^^^ >> This fails because there are not available elements in pcpu_freelist >> ... >> >> I think a possible solution is to oversize the pcpu_freelist (no idea by how >> much, maybe double or, or make it 1.5 time the max elements in the map?) >> I also have concerns about it, it would waste that memory in many cases and >> this is also probably that it doesn't solve the issue because that code snippet >> is puhsing and popping elements too fast, so even if the pcpu_freelist is much >> large a certain time instant all the elements could be used. >> >> Is this really an important issue? >> Any idea of how to solve it? > It is important issue indeed and a difficult one to solve. > We have the same issue with hash map. > If the prog is doing: > value = lookup(key); > delete(key); > // here the prog shouldn't be accessing the value anymore, since the memory > // could have been reused, but value pointer is still valid and points to > // allocated memory Just to notice that for the queue map it is a little bit worse because there isn't a way to mark an element to be reused, hence in some cases the pool of free elements could be exhausted. > bpf_map_pop_elem() is trying to do lookup_and_delete and preserve > validity of value without races. > With pcpu_freelist I don't think there is a solution. > We can have this queue/stack map without prealloc and use kmalloc/kfree > back and forth. Performance will not be as great, but for your use case, > I suspect, it will be good enough. I agree, for our use case we are not that worried about the performance, it is still in the dataplane but let's say it is not in the "hot" path. > The key issue with kmalloc/kfree is unbounded time of rcu callbacks. > If somebody starts doing push/pop for every packet, the rcu subsystem > will struggle and nothing we can do about it. > > The only way I could think of to resolve this problem is to reuse > the logic that Joe is working on for socket lookups inside the program. > Joe, > how is that going? Could you repost the latest patches? > > In such case the api for stack map will look like: > > elem = bpf_map_pop_elem(stack); > // access elem > bpf_map_free_elem(elem); > // here prog is not allowed to access elem and verifier will catch that > > elem = bpf_map_alloc_elem(stack); > // populate elem > bpf_map_push_elem(elem); > // here prog is not allowed to access elem and verifier will catch that > > Then both pre-allocated elems and kmalloc/kfree will work fine > and no unbounded rcu issues in both cases. > > I read the Joe's proposal and using that for this problem looks like a nice solution. I think a good trade-off for now would be to go ahead with a queue/stack map without preallocating support (or maybe include it having always in mind that this issue has to be solved in the near future) and then, as a separated work, try to use Joe's proposal in the map helpers. What do you think? Thanks, Mauricio.