All of lore.kernel.org
 help / color / mirror / Atom feed
From: Daniel Borkmann <daniel@iogearbox.net>
To: Cong Wang <xiyou.wangcong@gmail.com>, netdev@vger.kernel.org
Cc: bpf@vger.kernel.org, jhs@mojatatu.com, andrii@kernel.org,
	ast@kernel.org, kafai@fb.com, Cong Wang <cong.wang@bytedance.com>,
	Andrii Nakryiko <andrii.nakryiko@gmail.com>,
	Dongdong Wang <wangdongdong.6@bytedance.com>
Subject: Re: [Patch bpf-next v5 1/3] bpf: introduce timeout hash map
Date: Tue, 26 Jan 2021 23:04:57 +0100	[thread overview]
Message-ID: <d69d44ca-206c-d818-1177-c8f14d8be8d1@iogearbox.net> (raw)
In-Reply-To: <20210122205415.113822-2-xiyou.wangcong@gmail.com>

On 1/22/21 9:54 PM, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> This borrows the idea from conntrack and will be used for conntrack in
> ebpf too. Each element in a timeout map has a user-specified timeout
> in msecs, after it expires it will be automatically removed from the
> map. Cilium already does the same thing, it uses a regular map or LRU
> map to track connections and has its own GC in user-space. This does
> not scale well when we have millions of connections, as each removal
> needs a syscall. Even if we could batch the operations, it still needs
> to copy a lot of data between kernel and user space.
> 
> There are two cases to consider here:
> 
> 1. When the timeout map is idle, i.e. no one updates or accesses it,
>     we rely on the delayed work to scan the whole hash table and remove
>     these expired. The delayed work is scheduled every 1 sec when idle,
>     which is also what conntrack uses. It is fine to scan the whole
>     table as we do not actually remove elements during this scan,
>     instead we simply queue them to the lockless list and defer all the
>     removals to the next schedule.
> 
> 2. When the timeout map is actively accessed, we could reach expired
>     elements before the idle work automatically scans them, we can
>     simply skip them and schedule the delayed work immediately to do
>     the actual removals. We have to avoid taking locks on fast path.
> 
> The timeout of an element can be set or updated via bpf_map_update_elem()
> and we reuse the upper 32-bit of the 64-bit flag for the timeout value,
> as there are only a few bits are used currently. Note, a zero timeout
> means to expire immediately.
> 
> To avoid adding memory overhead to regular map, we have to reuse some
> field in struct htab_elem, that is, lru_node. Otherwise we would have
> to rewrite a lot of code.
> 
> For now, batch ops is not supported, we can add it later if needed.

Back in earlier conversation [0], I mentioned also LRU map flavors and to look
into adding a flag, so we wouldn't need new BPF_MAP_TYPE_TIMEOUT_HASH/*LRU types
that replicate existing types once again just with the timeout in addition, so
UAPI wise new map type is not great.

Given you mention Cilium above, only for kernels where there is no LRU hash map,
that is < 4.10, we rely on plain hash, everything else LRU + prealloc to mitigate
ddos by refusing to add new entries when full whereas less active ones will be
purged instead. Timeout /only/ for plain hash is less useful overall, did you
sketch a more generic approach in the meantime that would work for all the htab/lru
flavors (and ideally as non-delayed_work based)?

   [0] https://lore.kernel.org/bpf/20201214201118.148126-1-xiyou.wangcong@gmail.com/

[...]
> @@ -1012,6 +1081,8 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>   			copy_map_value_locked(map,
>   					      l_old->key + round_up(key_size, 8),
>   					      value, false);
> +			if (timeout_map)
> +				l_old->expires = msecs_to_expire(timeout);
>   			return 0;
>   		}
>   		/* fall through, grab the bucket lock and lookup again.
> @@ -1020,6 +1091,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>   		 */
>   	}
>   
> +again:
>   	ret = htab_lock_bucket(htab, b, hash, &flags);
>   	if (ret)
>   		return ret;
> @@ -1040,26 +1112,41 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>   		copy_map_value_locked(map,
>   				      l_old->key + round_up(key_size, 8),
>   				      value, false);
> +		if (timeout_map)
> +			l_old->expires = msecs_to_expire(timeout);
>   		ret = 0;
>   		goto err;
>   	}
>   
>   	l_new = alloc_htab_elem(htab, key, value, key_size, hash, false, false,
> -				l_old);
> +				timeout_map, l_old);
>   	if (IS_ERR(l_new)) {
> -		/* all pre-allocated elements are in use or memory exhausted */
>   		ret = PTR_ERR(l_new);
> +		if (ret == -EAGAIN) {
> +			htab_unlock_bucket(htab, b, hash, flags);
> +			htab_gc_elem(htab, l_old);
> +			mod_delayed_work(system_unbound_wq, &htab->gc_work, 0);
> +			goto again;

Also this one looks rather worrying, so the BPF prog is stalled here, loop-waiting
in (e.g. XDP) hot path for system_unbound_wq to kick in to make forward progress?

> +		}
> +		/* all pre-allocated elements are in use or memory exhausted */
>   		goto err;
>   	}
>   
> +	if (timeout_map)
> +		l_new->expires = msecs_to_expire(timeout);
> +

  reply	other threads:[~2021-01-26 22:56 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-22 20:54 [Patch bpf-next v5 0/3] bpf: introduce timeout hash map Cong Wang
2021-01-22 20:54 ` [Patch bpf-next v5 1/3] " Cong Wang
2021-01-26 22:04   ` Daniel Borkmann [this message]
2021-01-27  6:59     ` Cong Wang
2021-01-27 18:00       ` Alexei Starovoitov
2021-01-27 22:48         ` Daniel Borkmann
2021-01-28  2:45           ` Alexei Starovoitov
2021-01-28  6:28         ` Cong Wang
2021-01-29  2:54           ` Alexei Starovoitov
2021-01-29  5:57             ` Cong Wang
2021-01-29  6:21               ` Yonghong Song
     [not found]             ` <f7bc5873-7722-e359-b450-4db7dc3656d6@mojatatu.com>
     [not found]               ` <dc5ddf32-2d65-15a9-9448-5f2d3a10d227@mojatatu.com>
2021-01-30  3:14                 ` Alexei Starovoitov
2021-01-31 20:35                   ` Jamal Hadi Salim
2021-01-22 20:54 ` [Patch bpf-next v5 2/3] selftests/bpf: add test cases for bpf timeout map Cong Wang
2021-01-22 20:54 ` [Patch bpf-next v5 3/3] selftests/bpf: add timeout map check in map_ptr tests Cong Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d69d44ca-206c-d818-1177-c8f14d8be8d1@iogearbox.net \
    --to=daniel@iogearbox.net \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cong.wang@bytedance.com \
    --cc=jhs@mojatatu.com \
    --cc=kafai@fb.com \
    --cc=netdev@vger.kernel.org \
    --cc=wangdongdong.6@bytedance.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.