From: Jakub Sitnicki <jakub@cloudflare.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: bpf@vger.kernel.org, netdev@vger.kernel.org, kernel-team@cloudflare.com
Subject: Re: [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down
Date: Wed, 11 Mar 2020 12:51:32 +0100 [thread overview]
Message-ID: <87zhcnxg6z.fsf@cloudflare.com> (raw)
In-Reply-To: <5e67c3e83fb25_1e8a2b0e88e0a5bc84@john-XPS-13-9370.notmuch>
On Tue, Mar 10, 2020 at 05:44 PM CET, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Thu, Feb 06, 2020 at 08:43 PM CET, John Fastabend wrote:
>> > Jakub Sitnicki wrote:
>> >> Couple of fixes that came from recent discussion [0] on commit
>> >> 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down").
>> >>
>> >> This series doesn't address the sleeping while holding a spinlock
>> >> problem. We're still trying to decide how to fix that [1].
>> >>
>> >> Until then sockmap users might see the following warnings:
>> >>
>> >> | BUG: sleeping function called from invalid context at net/core/sock.c:2935
>>
>
> [...]
>
>> Hey John,
>
> Patch sent.
Thanks!
>
>>
>> > Untested at the moment, but this should also be fine per your suggestion
>> > (if I read it correctly). The reason we have stab->lock and bucket->locks
>> > here is to handle checking EEXIST in update/delete cases. We need to
>> > be careful that when an update happens and we check for EEXIST that the
>> > socket is added/removed during this check. So both map_update_common and
>> > sock_map_delete need to guard from being run together potentially deleting
>> > an entry we are checking, etc.
>>
>> Okay, thanks for explanation. IOW, we're serializing map writers.
>>
>> > But by the time we get here we just did a synchronize_rcu() in the
>> > line above so no updates/deletes should be in flight. So it seems safe
>> > to drop these locks because of the condition no updates in flight.
>>
>> This part is not clear to me. I might be missing something.
>>
>> Here's my thinking - for any map writes (update/delete) to start,
>> map->refcnt needs to be > 0, and the ref is not dropped until the write
>> operation has finished.
>>
>> Map FDs hold a ref to map until the FD gets released. And BPF progs hold
>> refs to maps until the prog gets unloaded.
>>
>> This would mean that map_free will get scheduled from __bpf_map_put only
>> when no one is holding a map ref, and could start a write that would be
>> happening concurrently with sock_{map,hash}_free:
>
> Sorry bringing back this old thread I'm not sure I followed the couple
> paragraphs here. Is this with regards to the lock or the rcu? II didn't
> want to just drop this thanks.
>
> We can't have new updates/lookups/deletes happening while we are free'ing
> a map that would cause all sorts of problems, use after free's, etc.
Happy to pick up the discussion back up.
Sorry for the delay in my reply. I wanted to take another hard look at
the code and make sure I'm not getting ahead of myself here.
Let me back up a little and try to organize the access paths to sockmap
we have, and when they happen in relation to sock_map_free.
A) Access via bpf_map_ops
When bpf_map, and its backing object - bpf_stab, is accessed via map ops
(map_update_elem, map_delete_elem, map_lookup_elem), either (i) a
process has an FD for the map, or (ii) a loaded BPF prog holds a map
reference. Also, we always grab a map ref when creating an FD for it.
This means that map->refcnt is > 0 while a call to one of the map_ops is
in progress.
Hence, bpf_map_free_deferred -> sock_map_free won't get called during
these operations. This fact allowed us to get rid of locking the stab in
sock_map_free.
B) Access via bpf_{sk|msg}_redirect_map
Similar to previous case. BPF prog invoking these helpers must hold a
map reference, so we know that map->refcnt is > 0, and sock_map_free
can't be in progress the same time.
C) Access via sk_psock_link
sk_psock_link has a pointer to bpf_map (link->map) and to an entry in
stab->sks (link->link_raw), but doesn't hold a ref to the map.
We need to ensure bpf_stab doesn't go away, while tcp_bpf_remove ->
sk_psock_unlink -> sock_{map|hash}_delete_from_link call chain is in
progress.
That explains why in sock_map_free, after walking the map and destroying
all links, we wait for the RCU grace period to end with a call to
synchronize_rcu before freeing the map:
/* wait for psock readers accessing its map link */
synchronize_rcu();
bpf_map_area_free(stab->sks);
kfree(stab);
What is tripping me up, however, is that we also have another call to
synchronize_rcu before walking the map:
/* After the sync no updates or deletes will be in-flight so it
* is safe to walk map and remove entries without risking a race
* in EEXIST update case.
*/
synchronize_rcu(); // <-- Is it needed?
for (i = 0; i < stab->map.max_entries; i++) {
// ...
}
/* wait for psock readers accessing its map link */
synchronize_rcu();
I'm not grasping what purpose the 1st synchronize_rcu call serves.
New readers can start accessing the map after the 1st synchronize_rcu,
and this seems fine since the map will not be freed until after the 2nd
synchronize_rcu call.
Okay, so we can have deletes in-flight, which the explanatory comment
for the 1st synchronize_rcu mentions. What about updates in-flight?
I don't think they can happen with (A) being the only case I know of
when we update the map.
Sorry this was a bit long. So the question is what am I missing?
Can updates happen despite no refs to the map being held?
Thanks,
-jkbs
>
>>
>> /* decrement map refcnt and schedule it for freeing via workqueue
>> * (unrelying map implementation ops->map_free() might sleep)
>> */
>> static void __bpf_map_put(struct bpf_map *map, bool do_idr_lock)
>> {
>> if (atomic64_dec_and_test(&map->refcnt)) {
>> /* bpf_map_free_id() must be called first */
>> bpf_map_free_id(map, do_idr_lock);
>> btf_put(map->btf);
>> INIT_WORK(&map->work, bpf_map_free_deferred);
>> schedule_work(&map->work);
>> }
>> }
>>
>> > So with patch below we keep the sync rcu but that is fine IMO these
>> > map free's are rare. Take a look and make sure it seems sane to you
>> > as well.
>>
>> I can't vouch for the need to keep synchronize_rcu here because I don't
>> understand that part, but otherwise the change LGTM.
>>
>> -jkbs
>>
>
> [...]
next prev parent reply other threads:[~2020-03-11 11:51 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-02-06 11:16 [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down Jakub Sitnicki
2020-02-06 11:16 ` [PATCH bpf 1/3] bpf, sockmap: Don't sleep while holding RCU lock on tear-down Jakub Sitnicki
2020-02-06 18:59 ` John Fastabend
2020-02-06 11:16 ` [PATCH bpf 2/3] bpf, sockhash: synchronize_rcu before free'ing map Jakub Sitnicki
2020-02-06 19:01 ` John Fastabend
2020-02-06 11:16 ` [PATCH bpf 3/3] selftests/bpf: Test freeing sockmap/sockhash with a socket in it Jakub Sitnicki
2020-02-06 19:03 ` John Fastabend
2020-02-09 2:41 ` Alexei Starovoitov
2020-02-09 4:12 ` Yonghong Song
2020-02-09 15:29 ` Jakub Sitnicki
2020-02-10 3:55 ` John Fastabend
2020-02-10 11:52 ` Jakub Sitnicki
2020-02-10 23:38 ` Daniel Borkmann
2020-02-06 19:43 ` [PATCH bpf 0/3] Fix locking order and synchronization on sockmap/sockhash tear-down John Fastabend
2020-02-07 10:45 ` Jakub Sitnicki
2020-03-10 16:44 ` John Fastabend
2020-03-11 11:51 ` Jakub Sitnicki [this message]
2020-03-10 11:30 ` Jakub Sitnicki
2020-02-07 21:56 ` Daniel Borkmann
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87zhcnxg6z.fsf@cloudflare.com \
--to=jakub@cloudflare.com \
--cc=bpf@vger.kernel.org \
--cc=john.fastabend@gmail.com \
--cc=kernel-team@cloudflare.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).