From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexei Starovoitov Subject: Re: [PATCH net-next] bpf: expose sk_priority through struct bpf_sock_ops Date: Tue, 14 Nov 2017 06:51:00 +0800 Message-ID: References: <20171110191709.13051-1-vlad@dumitrescu.ro> <497dcbc0-57b7-69d8-6b2f-690ad4a7d4ba@iogearbox.net> <2ceada02-073e-6885-3837-e274903db80e@fb.com> <4997dac4-c80a-8f99-c053-ee9499cf5caf@iogearbox.net> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit Cc: "netdev@vger.kernel.org" , Craig Gallek To: Daniel Borkmann , Lawrence Brakmo , Vlad Dumitrescu , "davem@davemloft.net" Return-path: Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:60144 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752884AbdKMWvl (ORCPT ); Mon, 13 Nov 2017 17:51:41 -0500 In-Reply-To: <4997dac4-c80a-8f99-c053-ee9499cf5caf@iogearbox.net> Sender: netdev-owner@vger.kernel.org List-ID: On 11/14/17 4:20 AM, Daniel Borkmann wrote: > On 11/13/2017 09:09 PM, Lawrence Brakmo wrote: >> On 11/13/17, 11:01 AM, "Vlad Dumitrescu" wrote: >> >> On Sat, Nov 11, 2017 at 2:38 PM, Alexei Starovoitov wrote: >> > >> > On 11/12/17 4:46 AM, Daniel Borkmann wrote: >> >> >> >> On 11/11/2017 05:06 AM, Alexei Starovoitov wrote: >> >>> >> >>> On 11/11/17 6:07 AM, Daniel Borkmann wrote: >> >>>> >> >>>> On 11/10/2017 08:17 PM, Vlad Dumitrescu wrote: >> >>>>> >> >>>>> From: Vlad Dumitrescu >> >>>>> >> >>>>> Allows BPF_PROG_TYPE_SOCK_OPS programs to read sk_priority. >> >>>>> >> >>>>> Signed-off-by: Vlad Dumitrescu >> >>>>> --- >> >>>>> include/uapi/linux/bpf.h | 1 + >> >>>>> net/core/filter.c | 11 +++++++++++ >> >>>>> tools/include/uapi/linux/bpf.h | 1 + >> >>>>> 3 files changed, 13 insertions(+) >> >>>>> >> >>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >> >>>>> index e880ae6434ee..9757a2002513 100644 >> >>>>> --- a/include/uapi/linux/bpf.h >> >>>>> +++ b/include/uapi/linux/bpf.h >> >>>>> @@ -947,6 +947,7 @@ struct bpf_sock_ops { >> >>>>> __u32 local_ip6[4]; /* Stored in network byte order */ >> >>>>> __u32 remote_port; /* Stored in network byte order */ >> >>>>> __u32 local_port; /* stored in host byte order */ >> >>>>> + __u32 priority; >> >>>>> }; >> >>>>> /* List of known BPF sock_ops operators. >> >>>>> diff --git a/net/core/filter.c b/net/core/filter.c >> >>>>> index 61c791f9f628..a6329642d047 100644 >> >>>>> --- a/net/core/filter.c >> >>>>> +++ b/net/core/filter.c >> >>>>> @@ -4449,6 +4449,17 @@ static u32 sock_ops_convert_ctx_access(enum >> >>>>> bpf_access_type type, >> >>>>> *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg, >> >>>>> offsetof(struct sock_common, skc_num)); >> >>>>> break; >> >>>>> + >> >>>>> + case offsetof(struct bpf_sock_ops, priority): >> >>>>> + BUILD_BUG_ON(FIELD_SIZEOF(struct sock, sk_priority) != 4); >> >>>>> + >> >>>>> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( >> >>>>> + struct bpf_sock_ops_kern, sk), >> >>>>> + si->dst_reg, si->src_reg, >> >>>>> + offsetof(struct bpf_sock_ops_kern, sk)); >> >>>>> + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, >> >>>>> + offsetof(struct sock, sk_priority)); >> >>>>> + break; >> >>>> >> >>>> >> >>>> Hm, I don't think this would work, I actually think your initial patch >> >>>> was ok. >> >>>> bpf_setsockopt() as well as bpf_getsockopt() check for sk_fullsock(sk) >> >>>> right >> >>>> before accessing options on either socket or TCP level, and bail out >> >>>> with error >> >>>> otherwise; in such cases we'd read something else here and assume it's >> >>>> sk_priority. >> >>> >> >>> >> >>> even if it's not fullsock, it will just read zero, no? what's a problem >> >>> with that? >> >>> In non-fullsock hooks like BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB >> >>> the program author will know that it's meaningless to read sk_priority, >> >>> so returning zero with minimal checks is fine. >> >>> While adding extra runtime if (sk_fullsock(sk)) is unnecessary, >> >>> since the safety is not compromised. >> >> >> >> >> >> Hm, on my kernel, struct sock has the 4 bytes sk_priority at offset 440, >> >> struct request_sock itself is only 232 byte long in total, and the struct >> >> inet_timewait_sock is 208 byte long, so you'd be accessing out of bounds >> >> that way, so it cannot be ignored and assumed zero. >> > >> > >> > I thought we always pass fully allocated sock but technically not fullsock yet. My mistake. We do: tcp_timeout_init((struct sock *)req)) >> > so yeah ctx rewrite approach won't work. >> > Let's go back to access via helper. >> > >> >> TIL. Thanks! >> >> Is there anything else needed from me to get the helper approach accepted? >> >> I plan to add access to TCP state variables (cwnd, rtt, etc.) and I have been thinking >> about this issue. I think it is possible to access it directly as long as we use a value >> like 0xffffffff to represent an invalid value (e.g. not fullsock). The ctx rewrite just >> needs to add a conditional to determine what to return. I would probably add a >> field into the internal kernel struct to indicate if it is fullsock or not (we should >> know when we call tcp_call_bpf whether it is a fullsock or not based on context). >> >> Let me do a sample patch that I can send for review and get feedback from >> Alexi and Daniel. > > Agree, if the mov op from the ctx rewrite to read(/write) a sk member, for > example, is just a BPF_W, then we know upper reg bits are zero anyway for the > success case, so we might be able to utilize this for writing a signed error > back to the user if !fullsk. it can be __u64 in bpf_sock_ops too, while real read is 32-bit or less, then guaranteed no conflicts if we return (s64)-enoent or (s64)-einval in case of !fullsock. I like the idea of copying boolean value of sk_fullsock() into hidden part of bpf_sock_ops_kern, since it's been accessed by tcp_call_bpf() anyway.