From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C3E1C43611 for ; Wed, 12 May 2021 23:10:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 11CBE61419 for ; Wed, 12 May 2021 23:10:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347160AbhELXEg (ORCPT ); Wed, 12 May 2021 19:04:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60578 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1444034AbhELWxV (ORCPT ); Wed, 12 May 2021 18:53:21 -0400 Received: from mail-qk1-x72e.google.com (mail-qk1-x72e.google.com [IPv6:2607:f8b0:4864:20::72e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 033FFC0610ED for ; Wed, 12 May 2021 15:43:56 -0700 (PDT) Received: by mail-qk1-x72e.google.com with SMTP id i67so23905891qkc.4 for ; Wed, 12 May 2021 15:43:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mojatatu-com.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=NbyJ52lo/KmaI1lmzIR9nAP6ZETday/C62CbdVi1cY4=; b=nZ2WySbdboM/kcmqJL4SqhJI1Rym8v3DmhYoAmIIPvqQcimNJQNlPVYwB6XOCkdykT OLU9Kx3G/XBshnMbUryhNJ9boStmuwIU3/h4/6KsfR8ZeKL1Rp7EoY2gmy7F/6y1QiF4 +VG81yKBAjfMSW38/pljaVk8hk6tHyqaZn4ZYxK+fxzW/wY3GfyTJNOkEo4jScx/56fE yPHfvWqrDioe04B+djZsqiw+zVURjx5aIJ/hL061UMXIFFnnuXbWPzqOy8CdBptEX20y zZR7URSrpGQO4RGpoK6ZUk4ccCa6aOHs9cvGXTJvfOdJ34vczdGMBYZKEaur//PCdLNG u2YQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=NbyJ52lo/KmaI1lmzIR9nAP6ZETday/C62CbdVi1cY4=; b=dk6WBwwL8yP2Y7qqJrifZ4Oymbn4gYpnpzOYv6EOKRLrCcFotp/q26WOlMLVgy4Dh/ zNUeA+x4xgNqjXY7a25Xo4FsQAhPR6bXrRO+iBq/UzRGS+J8AwpE1eFOjUbol8529TeM NigIl7QulOrRIW7d5i15MYzfawuYQ/q7eeFL9D9i4cNa2ZitIdIhzemg/9PWRt055U+J 20fUf6Gog3JmAm/M4m+Am5XaJCEOLgZ0adtkJxMh1FN5uwVfW6PB4plcvvU1ktkLW2EN KqQwpOBixCGaQ6RcMfH8XMuDmQjTxYUNSTR9hdWKFaxRZO5xL5tdTgoEwIoco1zOieRK 4tbw== X-Gm-Message-State: AOAM530BkoZy5X0DmP43b+H9j5ttZm+5UhynecqKVCdyAij3QZZ4ttBD pXt7R1Rc6k7/oZQYFIQBwstbmg== X-Google-Smtp-Source: ABdhPJy5vmsx5YHZHlI2wZ/cyOKopA5+3pAS/nbVCzQLLdOqqQxL6Rzh1aHKLGo61cz4ESaeMapnFA== X-Received: by 2002:a05:620a:70c:: with SMTP id 12mr35408930qkc.377.1620859436034; Wed, 12 May 2021 15:43:56 -0700 (PDT) Received: from [192.168.1.79] (bras-base-kntaon1617w-grc-25-174-95-97-70.dsl.bell.ca. [174.95.97.70]) by smtp.googlemail.com with ESMTPSA id x28sm1181491qtm.71.2021.05.12.15.43.54 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 12 May 2021 15:43:55 -0700 (PDT) Subject: Re: [RFC Patch bpf-next] bpf: introduce bpf timer To: Joe Stringer , Cong Wang Cc: Alexei Starovoitov , Linux Kernel Network Developers , bpf , Xiongchun Duan , Dongdong Wang , Muchun Song , Cong Wang , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Yonghong Song , Pedro Tammela References: <20210402192823.bqwgipmky3xsucs5@ast-mbp> <20210402234500.by3wigegeluy5w7j@ast-mbp> <20210412230151.763nqvaadrrg77kd@ast-mbp.dhcp.thefacebook.com> <20210427020159.hhgyfkjhzjk3lxgs@ast-mbp.dhcp.thefacebook.com> From: Jamal Hadi Salim Message-ID: Date: Wed, 12 May 2021 18:43:53 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.9.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On 2021-05-11 1:05 a.m., Joe Stringer wrote: > Hi Cong, > >> and let me quote the original report here: >> >> "The current implementation (as of v1.2) for managing the contents of >> the datapath connection tracking map leaves something to be desired: >> Once per minute, the userspace cilium-agent makes a series of calls to >> the bpf() syscall to fetch all of the entries in the map to determine >> whether they should be deleted. For each entry in the map, 2-3 calls >> must be made: One to fetch the next key, one to fetch the value, and >> perhaps one to delete the entry. The maximum size of the map is 1 >> million entries, and if the current count approaches this size then >> the garbage collection goroutine may spend a significant number of CPU >> cycles iterating and deleting elements from the conntrack map." > > I'm also curious to hear more details as I haven't seen any recent > discussion in the common Cilium community channels (GitHub / Slack) > around deficiencies in the conntrack garbage collection since we > addressed the LRU issues upstream and switched back to LRU maps. For our use case we cant use LRU. We need to account for every entry i.e we dont want it to be gc without our consent. i.e we want to control the GC. Your PR was pointing to LRU deleting some flow entries for TCP which were just idling for example. > There's an update to the report quoted from the same link above: > > "In recent releases, we've moved back to LRU for management of the CT > maps so the core problem is not as bad; furthermore we have > implemented a backoff for GC depending on the size and number of > entries in the conntrack table, so that in active environments the > userspace GC is frequent enough to prevent issues but in relatively > passive environments the userspace GC is only rarely run (to minimize > CPU impact)." > > By "core problem is not as bad", I would have been referring to the > way that failing to garbage collect hashtables in a timely manner can > lead to rejecting new connections due to lack of available map space. > Switching back to LRU mitigated this concern. With a reduced frequency > of running the garbage collection logic, the CPU impact is lower as > well. I don't think we've explored batched map ops for this use case > yet either, which would already serve to improve the CPU usage > situation without extending the kernel. > Will run some tests tomorrow to see the effect of batching vs nobatch and capture cost of syscalls and cpu. Note: even then, it is not a good general solution. Our entries can go as high as 16M. Our workflow is: 1) every 1-5 seconds you dump, 2) process for what needs to be deleted etc, then do updates (another 1-3 seconds worth of time). There is a point, depending on number of entries, where there your time cost of processing exceeds your polling period. The likelihood of entry state loss is high for even 1/2 sec loss of sync. > The main outstanding issue I'm aware of is that we will often have a > 1:1 mapping of entries in the CT map and the NAT map, and ideally we'd > like them to have tied fates but currently we have no mechanism to do > this with LRU. When LRU eviction occurs, the entries can get out of > sync until the next GC. Yes, this ties as well to our use case (not NAT for us, but semantically similar challenge). It goes the other way too, if userspace decides to adjust your NAT table you need to purge related entries from the cache. > I could imagine timers helping with this if we Yes, timers would solve this. I am not even arguing that we need timers to solve these issues. I am just saying it seems timers are just fundamental infra that is needed even outside the scope of this. cheers, jamal