All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yonghong Song <yhs@fb.com>
To: Martin KaFai Lau <kafai@fb.com>
Cc: <bpf@vger.kernel.org>, Alexei Starovoitov <ast@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>, <kernel-team@fb.com>
Subject: Re: [PATCH bpf v2] bpf: fix a potential deadlock with bpf_map_do_batch
Date: Wed, 19 Feb 2020 10:15:22 -0800	[thread overview]
Message-ID: <084e8e50-152b-8df3-006f-6fa953b3a923@fb.com> (raw)
In-Reply-To: <20200219171826.xwxcivb5hb6z6vst@kafai-mbp.dhcp.thefacebook.com>



On 2/19/20 9:18 AM, Martin KaFai Lau wrote:
> On Wed, Feb 19, 2020 at 08:23:01AM -0800, Yonghong Song wrote:
>> Commit 057996380a42 ("bpf: Add batch ops to all htab bpf map")
>> added lookup_and_delete batch operation for hash table.
>> The current implementation has bpf_lru_push_free() inside
>> the bucket lock, which may cause a deadlock.
>>
>> syzbot reports:
>>     -> #2 (&htab->buckets[i].lock#2){....}:
>>         __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>>         _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159
>>         htab_lru_map_delete_node+0xce/0x2f0 kernel/bpf/hashtab.c:593
>>         __bpf_lru_list_shrink_inactive kernel/bpf/bpf_lru_list.c:220 [inline]
>>         __bpf_lru_list_shrink+0xf9/0x470 kernel/bpf/bpf_lru_list.c:266
>>         bpf_lru_list_pop_free_to_local kernel/bpf/bpf_lru_list.c:340 [inline]
>>         bpf_common_lru_pop_free kernel/bpf/bpf_lru_list.c:447 [inline]
>>         bpf_lru_pop_free+0x87c/0x1670 kernel/bpf/bpf_lru_list.c:499
>>         prealloc_lru_pop+0x2c/0xa0 kernel/bpf/hashtab.c:132
>>         __htab_lru_percpu_map_update_elem+0x67e/0xa90 kernel/bpf/hashtab.c:1069
>>         bpf_percpu_hash_update+0x16e/0x210 kernel/bpf/hashtab.c:1585
>>         bpf_map_update_value.isra.0+0x2d7/0x8e0 kernel/bpf/syscall.c:181
>>         generic_map_update_batch+0x41f/0x610 kernel/bpf/syscall.c:1319
>>         bpf_map_do_batch+0x3f5/0x510 kernel/bpf/syscall.c:3348
>>         __do_sys_bpf+0x9b7/0x41e0 kernel/bpf/syscall.c:3460
>>         __se_sys_bpf kernel/bpf/syscall.c:3355 [inline]
>>         __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:3355
>>         do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
>>         entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>>     -> #0 (&loc_l->lock){....}:
>>         check_prev_add kernel/locking/lockdep.c:2475 [inline]
>>         check_prevs_add kernel/locking/lockdep.c:2580 [inline]
>>         validate_chain kernel/locking/lockdep.c:2970 [inline]
>>         __lock_acquire+0x2596/0x4a00 kernel/locking/lockdep.c:3954
>>         lock_acquire+0x190/0x410 kernel/locking/lockdep.c:4484
>>         __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>>         _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159
>>         bpf_common_lru_push_free kernel/bpf/bpf_lru_list.c:516 [inline]
>>         bpf_lru_push_free+0x250/0x5b0 kernel/bpf/bpf_lru_list.c:555
>>         __htab_map_lookup_and_delete_batch+0x8d4/0x1540 kernel/bpf/hashtab.c:1374
>>         htab_lru_map_lookup_and_delete_batch+0x34/0x40 kernel/bpf/hashtab.c:1491
>>         bpf_map_do_batch+0x3f5/0x510 kernel/bpf/syscall.c:3348
>>         __do_sys_bpf+0x1f7d/0x41e0 kernel/bpf/syscall.c:3456
>>         __se_sys_bpf kernel/bpf/syscall.c:3355 [inline]
>>         __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:3355
>>         do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
>>         entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>>      Possible unsafe locking scenario:
>>
>>            CPU0                    CPU2
>>            ----                    ----
>>       lock(&htab->buckets[i].lock#2);
>>                                    lock(&l->lock);
>>                                    lock(&htab->buckets[i].lock#2);
>>       lock(&loc_l->lock);
>>
>>      *** DEADLOCK ***
>>
>> To fix the issue, for htab_lru_map_lookup_and_delete_batch() in CPU0,
>> let us do bpf_lru_push_free() out of the htab bucket lock. This can
>> avoid the above deadlock scenario.
> Patch looks good.  Some minor comments.
> 
>>
>> Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map")
>> Reported-by: syzbot+a38ff3d9356388f2fb83@syzkaller.appspotmail.com
>> Reported-by: syzbot+122b5421d14e68f29cd1@syzkaller.appspotmail.com
>> Suggested-by: Hillf Danton <hdanton@sina.com>
>> Suggested-by: Martin KaFai Lau <kafai@fb.com>
>> Acked-by: Brian Vazquez <brianvv@google.com>
>> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   kernel/bpf/hashtab.c | 20 +++++++++++++++++---
>>   1 file changed, 17 insertions(+), 3 deletions(-)
>>
>> Changelog:
>>   v1 -> v2:
>>      . coding style fix to have braces in both then and else
>>        branch, from Jakub.
>>
>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>> index 2d182c4ee9d9..a6e0d6aace62 100644
>> --- a/kernel/bpf/hashtab.c
>> +++ b/kernel/bpf/hashtab.c
>> @@ -56,6 +56,7 @@ struct htab_elem {
>>   			union {
>>   				struct bpf_htab *htab;
>>   				struct pcpu_freelist_node fnode;
>> +				struct htab_elem *link;
> Considering this usage is very specific, the name "link" sounds
> a bit general.  May be "batch_free_link" or "batch_flink"?

Okay, will use batch_flink, a little bit shorter.
All of "htab", "fnode" and "link" are used during element free.
So batch_flink should reasonably imply batch_free_link.

> 
>>   			};
>>   		};
>>   	};
>> @@ -1255,6 +1256,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
>>   	void __user *uvalues = u64_to_user_ptr(attr->batch.values);
>>   	void __user *ukeys = u64_to_user_ptr(attr->batch.keys);
>>   	void *ubatch = u64_to_user_ptr(attr->batch.in_batch);
>> +	struct htab_elem *node_to_free = NULL;
> Reverse xmas tree ordering.

Will fix.

> 
>>   	u32 batch, max_count, size, bucket_size;
>>   	u64 elem_map_flags, map_flags;
>>   	struct hlist_nulls_head *head;
>> @@ -1370,16 +1372,28 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
>>   		}
>>   		if (do_delete) {
>>   			hlist_nulls_del_rcu(&l->hash_node);
>> -			if (is_lru_map)
>> -				bpf_lru_push_free(&htab->lru, &l->lru_node);
>> -			else
>> +			if (is_lru_map) {
>> +				/* link to-be-freed elements together so
>> +				 * they can freed outside bucket lock region.
>> +				 */
> Thanks for the comments here.  I think a bit more details will be
> useful in the future.
> 
> May be adding the details to the existing htab_lru_map_delete_elem()
> which is also using a similar lock strategy, e.g.:
> /* The LRU list has a lock (lru_lock).  Each bucket of htab has a
>   * lock (buck_lock).  If both locks need to be acquired together,
>   * the lock order is always lru_lock -> buck_lock and this only
>   * happens in the bpf_lru_list.c logic.
>   *
>   * In hashtab.c, to avoid deadlock casued by lock ordering,
>   * both locks are not acquired together (i.e. one lock is always
>   * released first before acquiring another lock).
>   */
> static int htab_lru_map_delete_elem(struct bpf_map *map, void *key)
> {
> 	/* ... */
> }
>      
> and refer them here from here (__htab_map_lookup_and_delete_batch).

Okay. Let me add more detailed comments here.

> 
>> +				l->link = node_to_free;
>> +				node_to_free = l;
>> +			} else {
>>   				free_htab_elem(htab, l);
>> +			}
>>   		}
> [ ... ]
> 

      reply	other threads:[~2020-02-19 18:15 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-19 16:23 [PATCH bpf v2] bpf: fix a potential deadlock with bpf_map_do_batch Yonghong Song
2020-02-19 17:18 ` Martin KaFai Lau
2020-02-19 18:15   ` Yonghong Song [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=084e8e50-152b-8df3-006f-6fa953b3a923@fb.com \
    --to=yhs@fb.com \
    --cc=ast@fb.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kafai@fb.com \
    --cc=kernel-team@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.