[Question]: BPF_CGROUP_{GET,SET}SOCKOPT handling when optlen > PAGE

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [Question]: BPF_CGROUP_{GET,SET}SOCKOPT handling when optlen > PAGE_SIZE
@ 2022-10-27  1:14 Martin KaFai Lau
  2022-10-27  2:03 ` Stanislav Fomichev
  2022-10-27 20:46 ` Andrii Nakryiko
  0 siblings, 2 replies; 12+ messages in thread
From: Martin KaFai Lau @ 2022-10-27  1:14 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf

The cgroup-bpf {get,set}sockopt prog is useful to change the optname behavior. 
The bpf prog usually just handles a few specific optnames and ignores most 
others.  For the optnames that it ignores, it usually does not need to change 
the optlen.  The exception is when optlen > PAGE_SIZE (or optval_end - optval). 
The bpf prog needs to set the optlen to 0 for this case or else the kernel will 
return -EFAULT to the userspace.  It is usually not what the bpf prog wants 
because the bpf prog only expects error returning to userspace when it has 
explicitly 'return 0;' or used bpf_set_retval().  If a bpf prog always changes 
optlen for optnames that it does not care to 0,  it may risk if the latter bpf 
prog in the same cgroup may want to change/look-at it.

Would like to explore if there is an easier way for the bpf prog to handle it. 
eg. does it make sense to track if the bpf prog has changed the ctx->optlen 
before returning -EFAULT to the user space when ctx.optlen > max_optlen?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question]: BPF_CGROUP_{GET,SET}SOCKOPT handling when optlen > PAGE_SIZE
  2022-10-27  1:14 [Question]: BPF_CGROUP_{GET,SET}SOCKOPT handling when optlen > PAGE_SIZE Martin KaFai Lau
@ 2022-10-27  2:03 ` Stanislav Fomichev
  2022-10-27  6:15   ` Martin KaFai Lau
  2022-10-27 20:46 ` Andrii Nakryiko
  1 sibling, 1 reply; 12+ messages in thread
From: Stanislav Fomichev @ 2022-10-27  2:03 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: bpf

On Wed, Oct 26, 2022 at 6:14 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> The cgroup-bpf {get,set}sockopt prog is useful to change the optname behavior.
> The bpf prog usually just handles a few specific optnames and ignores most
> others.  For the optnames that it ignores, it usually does not need to change
> the optlen.  The exception is when optlen > PAGE_SIZE (or optval_end - optval).
> The bpf prog needs to set the optlen to 0 for this case or else the kernel will
> return -EFAULT to the userspace.  It is usually not what the bpf prog wants
> because the bpf prog only expects error returning to userspace when it has
> explicitly 'return 0;' or used bpf_set_retval().  If a bpf prog always changes
> optlen for optnames that it does not care to 0,  it may risk if the latter bpf
> prog in the same cgroup may want to change/look-at it.
>
> Would like to explore if there is an easier way for the bpf prog to handle it.
> eg. does it make sense to track if the bpf prog has changed the ctx->optlen
> before returning -EFAULT to the user space when ctx.optlen > max_optlen?

Good point on chaining being broken because of this requirement :-/

With tracking, we need to be careful, because the following situation
might be problematic:
Suppose setsockopt is larger than 4k, the program can rewrite some
byte in the first 4k, not touch optlen and expect this to work.
Currently, optlen=0 explicitly means "ignore whatever is in the bpf
buffer and use the original one".
If we can have a tracking that catches situations like this - we
should be able to drop that optlen=0 requirement.
IIRC, that's the only tricky part.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question]: BPF_CGROUP_{GET,SET}SOCKOPT handling when optlen > PAGE_SIZE
  2022-10-27  2:03 ` Stanislav Fomichev
@ 2022-10-27  6:15   ` Martin KaFai Lau
  2022-10-27 16:23     ` Stanislav Fomichev
  0 siblings, 1 reply; 12+ messages in thread
From: Martin KaFai Lau @ 2022-10-27  6:15 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf

On 10/26/22 7:03 PM, Stanislav Fomichev wrote:
> On Wed, Oct 26, 2022 at 6:14 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> The cgroup-bpf {get,set}sockopt prog is useful to change the optname behavior.
>> The bpf prog usually just handles a few specific optnames and ignores most
>> others.  For the optnames that it ignores, it usually does not need to change
>> the optlen.  The exception is when optlen > PAGE_SIZE (or optval_end - optval).
>> The bpf prog needs to set the optlen to 0 for this case or else the kernel will
>> return -EFAULT to the userspace.  It is usually not what the bpf prog wants
>> because the bpf prog only expects error returning to userspace when it has
>> explicitly 'return 0;' or used bpf_set_retval().  If a bpf prog always changes
>> optlen for optnames that it does not care to 0,  it may risk if the latter bpf
>> prog in the same cgroup may want to change/look-at it.
>>
>> Would like to explore if there is an easier way for the bpf prog to handle it.
>> eg. does it make sense to track if the bpf prog has changed the ctx->optlen
>> before returning -EFAULT to the user space when ctx.optlen > max_optlen?
> 
> Good point on chaining being broken because of this requirement :-/
> 
> With tracking, we need to be careful, because the following situation
> might be problematic:
> Suppose setsockopt is larger than 4k, the program can rewrite some
> byte in the first 4k, not touch optlen and expect this to work.

If the bpf prog rewrites the first 4k, it must change the ctx.optlen to get it 
work.  Otherwise, the kernel will return -EFAULT because the ctx.optlen is 
larger than the max_optlen (or optval_end - optval).

> Currently, optlen=0 explicitly means "ignore whatever is in the bpf
> buffer and use the original one" > If we can have a tracking that catches situations like this - we
> should be able to drop that optlen=0 requirement.
> IIRC, that's the only tricky part.

Ah, I meant, in __cgroup_bpf_run_filter_setsockopt, use "!ctx.optlen_changed && 
ctx.optlen > max_optlen" test to imply "ignore whatever is in the bpf
buffer and use the original one".  Add 'bool optlen_changed' to 'struct 
bpf_sockopt_kern' and set ctx.optlen_changed to true in 
cg_sockopt_convert_ctx_access() whenever there is BPF_WRITE to ctx.optlen. 
Would it work or may be I am still missing something in the writing first 4k 
case above?



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question]: BPF_CGROUP_{GET,SET}SOCKOPT handling when optlen > PAGE_SIZE
  2022-10-27  6:15   ` Martin KaFai Lau
@ 2022-10-27 16:23     ` Stanislav Fomichev
  2022-10-27 17:28       ` Martin KaFai Lau
  0 siblings, 1 reply; 12+ messages in thread
From: Stanislav Fomichev @ 2022-10-27 16:23 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: bpf

On Wed, Oct 26, 2022 at 11:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/26/22 7:03 PM, Stanislav Fomichev wrote:
> > On Wed, Oct 26, 2022 at 6:14 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> The cgroup-bpf {get,set}sockopt prog is useful to change the optname behavior.
> >> The bpf prog usually just handles a few specific optnames and ignores most
> >> others.  For the optnames that it ignores, it usually does not need to change
> >> the optlen.  The exception is when optlen > PAGE_SIZE (or optval_end - optval).
> >> The bpf prog needs to set the optlen to 0 for this case or else the kernel will
> >> return -EFAULT to the userspace.  It is usually not what the bpf prog wants
> >> because the bpf prog only expects error returning to userspace when it has
> >> explicitly 'return 0;' or used bpf_set_retval().  If a bpf prog always changes
> >> optlen for optnames that it does not care to 0,  it may risk if the latter bpf
> >> prog in the same cgroup may want to change/look-at it.
> >>
> >> Would like to explore if there is an easier way for the bpf prog to handle it.
> >> eg. does it make sense to track if the bpf prog has changed the ctx->optlen
> >> before returning -EFAULT to the user space when ctx.optlen > max_optlen?
> >
> > Good point on chaining being broken because of this requirement :-/
> >
> > With tracking, we need to be careful, because the following situation
> > might be problematic:
> > Suppose setsockopt is larger than 4k, the program can rewrite some
> > byte in the first 4k, not touch optlen and expect this to work.
>
> If the bpf prog rewrites the first 4k, it must change the ctx.optlen to get it
> work.  Otherwise, the kernel will return -EFAULT because the ctx.optlen is
> larger than the max_optlen (or optval_end - optval).
>
> > Currently, optlen=0 explicitly means "ignore whatever is in the bpf
> > buffer and use the original one" > If we can have a tracking that catches situations like this - we
> > should be able to drop that optlen=0 requirement.
> > IIRC, that's the only tricky part.
>
> Ah, I meant, in __cgroup_bpf_run_filter_setsockopt, use "!ctx.optlen_changed &&
> ctx.optlen > max_optlen" test to imply "ignore whatever is in the bpf
> buffer and use the original one".  Add 'bool optlen_changed' to 'struct
> bpf_sockopt_kern' and set ctx.optlen_changed to true in
> cg_sockopt_convert_ctx_access() whenever there is BPF_WRITE to ctx.optlen.
> Would it work or may be I am still missing something in the writing first 4k
> case above?

What if the program wants to keep optlen as is? Here is the
hypothetical case: ctx->optlen is 8k, we allocate/expose only the
first 4k, the program does ctx->optval[0] = 0xff and doesn't change
the optlen. It wants the rest of the payload to be passed as is with
only the first byte changed.
The condition "!ctx.optlen_changed && ctx.optlen > max_optlen" is
true, so, if we treat this as explicit optlen=0, we ignore the
program's changes.
But this is not what the program has intended, right? It wants to
amend something and pass the rest as is.

It seems like we need to have both optlen_changed and optval_changed.
If both are false, we should be able to safely do optlen=0 equivalent.
Tracking only optlen seems to be problematic?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question]: BPF_CGROUP_{GET,SET}SOCKOPT handling when optlen > PAGE_SIZE
  2022-10-27 16:23     ` Stanislav Fomichev
@ 2022-10-27 17:28       ` Martin KaFai Lau
  2022-10-27 17:40         ` Stanislav Fomichev
  0 siblings, 1 reply; 12+ messages in thread
From: Martin KaFai Lau @ 2022-10-27 17:28 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf

On 10/27/22 9:23 AM, Stanislav Fomichev wrote:
> On Wed, Oct 26, 2022 at 11:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 10/26/22 7:03 PM, Stanislav Fomichev wrote:
>>> On Wed, Oct 26, 2022 at 6:14 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>
>>>> The cgroup-bpf {get,set}sockopt prog is useful to change the optname behavior.
>>>> The bpf prog usually just handles a few specific optnames and ignores most
>>>> others.  For the optnames that it ignores, it usually does not need to change
>>>> the optlen.  The exception is when optlen > PAGE_SIZE (or optval_end - optval).
>>>> The bpf prog needs to set the optlen to 0 for this case or else the kernel will
>>>> return -EFAULT to the userspace.  It is usually not what the bpf prog wants
>>>> because the bpf prog only expects error returning to userspace when it has
>>>> explicitly 'return 0;' or used bpf_set_retval().  If a bpf prog always changes
>>>> optlen for optnames that it does not care to 0,  it may risk if the latter bpf
>>>> prog in the same cgroup may want to change/look-at it.
>>>>
>>>> Would like to explore if there is an easier way for the bpf prog to handle it.
>>>> eg. does it make sense to track if the bpf prog has changed the ctx->optlen
>>>> before returning -EFAULT to the user space when ctx.optlen > max_optlen?
>>>
>>> Good point on chaining being broken because of this requirement :-/
>>>
>>> With tracking, we need to be careful, because the following situation
>>> might be problematic:
>>> Suppose setsockopt is larger than 4k, the program can rewrite some
>>> byte in the first 4k, not touch optlen and expect this to work.
>>
>> If the bpf prog rewrites the first 4k, it must change the ctx.optlen to get it
>> work.  Otherwise, the kernel will return -EFAULT because the ctx.optlen is
>> larger than the max_optlen (or optval_end - optval).
>>
>>> Currently, optlen=0 explicitly means "ignore whatever is in the bpf
>>> buffer and use the original one" > If we can have a tracking that catches situations like this - we
>>> should be able to drop that optlen=0 requirement.
>>> IIRC, that's the only tricky part.
>>
>> Ah, I meant, in __cgroup_bpf_run_filter_setsockopt, use "!ctx.optlen_changed &&
>> ctx.optlen > max_optlen" test to imply "ignore whatever is in the bpf
>> buffer and use the original one".  Add 'bool optlen_changed' to 'struct
>> bpf_sockopt_kern' and set ctx.optlen_changed to true in
>> cg_sockopt_convert_ctx_access() whenever there is BPF_WRITE to ctx.optlen.
>> Would it work or may be I am still missing something in the writing first 4k
>> case above?
> 
> What if the program wants to keep optlen as is? Here is the
> hypothetical case: ctx->optlen is 8k, we allocate/expose only the
> first 4k, the program does ctx->optval[0] = 0xff and doesn't change
> the optlen. It wants the rest of the payload to be passed as is with
> only the first byte changed.

I think we are talking about the same case but we may have different 
understanding on how the current __cgroup_bpf_run_filter_setsockopt() is 
handling it.

I don't see the current kernel supports this now.  If the bpf prog does not 
change the ctx->optlen from 8k to something that is <= 4k, the kernel will just 
return -EFAULT in here, no?
	else if (ctx.optlen /* 8k */ > max_optlen /* 4k */ || ctx.optlen < -1) {
		/* optlen is out of bounds */
                 ret = -EFAULT;
         }

or you meant the future change needs to consider this new case and also support 
gluing the first 4k (that was exposed to the bpf prog) with the second 4k (that 
was not exposed to the bpf prog)?

> The condition "!ctx.optlen_changed && ctx.optlen > max_optlen" is
> true, so, if we treat this as explicit optlen=0, we ignore the
> program's changes.
> But this is not what the program has intended, right? It wants to
> amend something and pass the rest as is.
> 
> It seems like we need to have both optlen_changed and optval_changed.
> If both are false, we should be able to safely do optlen=0 equivalent.
> Tracking only optlen seems to be problematic?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question]: BPF_CGROUP_{GET,SET}SOCKOPT handling when optlen > PAGE_SIZE
  2022-10-27 17:28       ` Martin KaFai Lau
@ 2022-10-27 17:40         ` Stanislav Fomichev
  2022-10-27 18:48           ` Martin KaFai Lau
  0 siblings, 1 reply; 12+ messages in thread
From: Stanislav Fomichev @ 2022-10-27 17:40 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: bpf

On Thu, Oct 27, 2022 at 10:29 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/27/22 9:23 AM, Stanislav Fomichev wrote:
> > On Wed, Oct 26, 2022 at 11:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 10/26/22 7:03 PM, Stanislav Fomichev wrote:
> >>> On Wed, Oct 26, 2022 at 6:14 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>>>
> >>>> The cgroup-bpf {get,set}sockopt prog is useful to change the optname behavior.
> >>>> The bpf prog usually just handles a few specific optnames and ignores most
> >>>> others.  For the optnames that it ignores, it usually does not need to change
> >>>> the optlen.  The exception is when optlen > PAGE_SIZE (or optval_end - optval).
> >>>> The bpf prog needs to set the optlen to 0 for this case or else the kernel will
> >>>> return -EFAULT to the userspace.  It is usually not what the bpf prog wants
> >>>> because the bpf prog only expects error returning to userspace when it has
> >>>> explicitly 'return 0;' or used bpf_set_retval().  If a bpf prog always changes
> >>>> optlen for optnames that it does not care to 0,  it may risk if the latter bpf
> >>>> prog in the same cgroup may want to change/look-at it.
> >>>>
> >>>> Would like to explore if there is an easier way for the bpf prog to handle it.
> >>>> eg. does it make sense to track if the bpf prog has changed the ctx->optlen
> >>>> before returning -EFAULT to the user space when ctx.optlen > max_optlen?
> >>>
> >>> Good point on chaining being broken because of this requirement :-/
> >>>
> >>> With tracking, we need to be careful, because the following situation
> >>> might be problematic:
> >>> Suppose setsockopt is larger than 4k, the program can rewrite some
> >>> byte in the first 4k, not touch optlen and expect this to work.
> >>
> >> If the bpf prog rewrites the first 4k, it must change the ctx.optlen to get it
> >> work.  Otherwise, the kernel will return -EFAULT because the ctx.optlen is
> >> larger than the max_optlen (or optval_end - optval).
> >>
> >>> Currently, optlen=0 explicitly means "ignore whatever is in the bpf
> >>> buffer and use the original one" > If we can have a tracking that catches situations like this - we
> >>> should be able to drop that optlen=0 requirement.
> >>> IIRC, that's the only tricky part.
> >>
> >> Ah, I meant, in __cgroup_bpf_run_filter_setsockopt, use "!ctx.optlen_changed &&
> >> ctx.optlen > max_optlen" test to imply "ignore whatever is in the bpf
> >> buffer and use the original one".  Add 'bool optlen_changed' to 'struct
> >> bpf_sockopt_kern' and set ctx.optlen_changed to true in
> >> cg_sockopt_convert_ctx_access() whenever there is BPF_WRITE to ctx.optlen.
> >> Would it work or may be I am still missing something in the writing first 4k
> >> case above?
> >
> > What if the program wants to keep optlen as is? Here is the
> > hypothetical case: ctx->optlen is 8k, we allocate/expose only the
> > first 4k, the program does ctx->optval[0] = 0xff and doesn't change
> > the optlen. It wants the rest of the payload to be passed as is with
> > only the first byte changed.
>
> I think we are talking about the same case but we may have different
> understanding on how the current __cgroup_bpf_run_filter_setsockopt() is
> handling it.
>
> I don't see the current kernel supports this now.  If the bpf prog does not
> change the ctx->optlen from 8k to something that is <= 4k, the kernel will just
> return -EFAULT in here, no?
>         else if (ctx.optlen /* 8k */ > max_optlen /* 4k */ || ctx.optlen < -1) {
>                 /* optlen is out of bounds */
>                  ret = -EFAULT;
>          }
>
> or you meant the future change needs to consider this new case and also support
> gluing the first 4k (that was exposed to the bpf prog) with the second 4k (that
> was not exposed to the bpf prog)?
>
> > The condition "!ctx.optlen_changed && ctx.optlen > max_optlen" is
> > true, so, if we treat this as explicit optlen=0, we ignore the
> > program's changes.
> > But this is not what the program has intended, right? It wants to
> > amend something and pass the rest as is.

Right, I'm not talking about how it's handled now. Now optlen >
max_optlen triggers EFAULT.
But in the future, if we add tracking, we want 'optlen > max_optlen'
to behave as explicit 'optlen = 0' as long as the user hasn't changed
the optlen _and_ also hasn't changed anything in the buffer.

> > It seems like we need to have both optlen_changed and optval_changed.
> > If both are false, we should be able to safely do optlen=0 equivalent.
> > Tracking only optlen seems to be problematic?
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question]: BPF_CGROUP_{GET,SET}SOCKOPT handling when optlen > PAGE_SIZE
  2022-10-27 17:40         ` Stanislav Fomichev
@ 2022-10-27 18:48           ` Martin KaFai Lau
  2022-10-27 19:11             ` Stanislav Fomichev
  0 siblings, 1 reply; 12+ messages in thread
From: Martin KaFai Lau @ 2022-10-27 18:48 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf

On 10/27/22 10:40 AM, Stanislav Fomichev wrote:
> On Thu, Oct 27, 2022 at 10:29 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 10/27/22 9:23 AM, Stanislav Fomichev wrote:
>>> On Wed, Oct 26, 2022 at 11:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>
>>>> On 10/26/22 7:03 PM, Stanislav Fomichev wrote:
>>>>> On Wed, Oct 26, 2022 at 6:14 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>>>
>>>>>> The cgroup-bpf {get,set}sockopt prog is useful to change the optname behavior.
>>>>>> The bpf prog usually just handles a few specific optnames and ignores most
>>>>>> others.  For the optnames that it ignores, it usually does not need to change
>>>>>> the optlen.  The exception is when optlen > PAGE_SIZE (or optval_end - optval).
>>>>>> The bpf prog needs to set the optlen to 0 for this case or else the kernel will
>>>>>> return -EFAULT to the userspace.  It is usually not what the bpf prog wants
>>>>>> because the bpf prog only expects error returning to userspace when it has
>>>>>> explicitly 'return 0;' or used bpf_set_retval().  If a bpf prog always changes
>>>>>> optlen for optnames that it does not care to 0,  it may risk if the latter bpf
>>>>>> prog in the same cgroup may want to change/look-at it.
>>>>>>
>>>>>> Would like to explore if there is an easier way for the bpf prog to handle it.
>>>>>> eg. does it make sense to track if the bpf prog has changed the ctx->optlen
>>>>>> before returning -EFAULT to the user space when ctx.optlen > max_optlen?
>>>>>
>>>>> Good point on chaining being broken because of this requirement :-/
>>>>>
>>>>> With tracking, we need to be careful, because the following situation
>>>>> might be problematic:
>>>>> Suppose setsockopt is larger than 4k, the program can rewrite some
>>>>> byte in the first 4k, not touch optlen and expect this to work.
>>>>
>>>> If the bpf prog rewrites the first 4k, it must change the ctx.optlen to get it
>>>> work.  Otherwise, the kernel will return -EFAULT because the ctx.optlen is
>>>> larger than the max_optlen (or optval_end - optval).
>>>>
>>>>> Currently, optlen=0 explicitly means "ignore whatever is in the bpf
>>>>> buffer and use the original one" > If we can have a tracking that catches situations like this - we
>>>>> should be able to drop that optlen=0 requirement.
>>>>> IIRC, that's the only tricky part.
>>>>
>>>> Ah, I meant, in __cgroup_bpf_run_filter_setsockopt, use "!ctx.optlen_changed &&
>>>> ctx.optlen > max_optlen" test to imply "ignore whatever is in the bpf
>>>> buffer and use the original one".  Add 'bool optlen_changed' to 'struct
>>>> bpf_sockopt_kern' and set ctx.optlen_changed to true in
>>>> cg_sockopt_convert_ctx_access() whenever there is BPF_WRITE to ctx.optlen.
>>>> Would it work or may be I am still missing something in the writing first 4k
>>>> case above?
>>>
>>> What if the program wants to keep optlen as is? Here is the
>>> hypothetical case: ctx->optlen is 8k, we allocate/expose only the
>>> first 4k, the program does ctx->optval[0] = 0xff and doesn't change
>>> the optlen. It wants the rest of the payload to be passed as is with
>>> only the first byte changed.
>>
>> I think we are talking about the same case but we may have different
>> understanding on how the current __cgroup_bpf_run_filter_setsockopt() is
>> handling it.
>>
>> I don't see the current kernel supports this now.  If the bpf prog does not
>> change the ctx->optlen from 8k to something that is <= 4k, the kernel will just
>> return -EFAULT in here, no?
>>          else if (ctx.optlen /* 8k */ > max_optlen /* 4k */ || ctx.optlen < -1) {
>>                  /* optlen is out of bounds */
>>                   ret = -EFAULT;
>>           }
>>
>> or you meant the future change needs to consider this new case and also support
>> gluing the first 4k (that was exposed to the bpf prog) with the second 4k (that
>> was not exposed to the bpf prog)?
>>
>>> The condition "!ctx.optlen_changed && ctx.optlen > max_optlen" is
>>> true, so, if we treat this as explicit optlen=0, we ignore the
>>> program's changes.
>>> But this is not what the program has intended, right? It wants to
>>> amend something and pass the rest as is.
> 
> Right, I'm not talking about how it's handled now. Now optlen >
> max_optlen triggers EFAULT.
> But in the future, if we add tracking, we want 'optlen > max_optlen'
> to behave as explicit 'optlen = 0' as long as the user hasn't changed
> the optlen _and_ also hasn't changed anything in the buffer.

Ah, ic.

Tracking the runtime buffer change will be hard as of the current state through 
the ctx->optval.  I don't think we need to track that either.  If the existing 
bpf prog wants the changed buf to be used, it must have changed the optlen 
already.  Thus, tracking optlen only should be as good.

If the bpf prog is depending on the kernel to do implicit -EFAULT like this, 
yes, it will break even the buffer change is tracked.

if (ctx->optlen > ctx->optval_end - ctx->optval)
     return 1;  /* 0 will be -EPERM, so 1 here to make kernel return -EFAULT for 
us */

I would argue that it is more like a surprise than a feature if the bpf prog 
depends on ctx.optlen > max_optlen (only for the > 4k case though) to do an 
implicit reject (through EFAULT) instead of directly using the 'return 0' or 
bpf_set_retval() which is exactly how it should be done to reject other "normal" 
integer optval.

I am also not sure how useful it is to expose partial data to the bpf prog and 
have a way for the bpf prog to tell the kernel to join the remaining.  Instead, 
it would be more useful to have API for the bpf prog to have access to the whole 
data instead.

>>> It seems like we need to have both optlen_changed and optval_changed.
>>> If both are false, we should be able to safely do optlen=0 equivalent.
>>> Tracking only optlen seems to be problematic?
>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question]: BPF_CGROUP_{GET,SET}SOCKOPT handling when optlen > PAGE_SIZE
  2022-10-27 18:48           ` Martin KaFai Lau
@ 2022-10-27 19:11             ` Stanislav Fomichev
  2022-10-27 20:04               ` Martin KaFai Lau
  0 siblings, 1 reply; 12+ messages in thread
From: Stanislav Fomichev @ 2022-10-27 19:11 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: bpf

On Thu, Oct 27, 2022 at 11:48 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/27/22 10:40 AM, Stanislav Fomichev wrote:
> > On Thu, Oct 27, 2022 at 10:29 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 10/27/22 9:23 AM, Stanislav Fomichev wrote:
> >>> On Wed, Oct 26, 2022 at 11:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>>>
> >>>> On 10/26/22 7:03 PM, Stanislav Fomichev wrote:
> >>>>> On Wed, Oct 26, 2022 at 6:14 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>>>>>
> >>>>>> The cgroup-bpf {get,set}sockopt prog is useful to change the optname behavior.
> >>>>>> The bpf prog usually just handles a few specific optnames and ignores most
> >>>>>> others.  For the optnames that it ignores, it usually does not need to change
> >>>>>> the optlen.  The exception is when optlen > PAGE_SIZE (or optval_end - optval).
> >>>>>> The bpf prog needs to set the optlen to 0 for this case or else the kernel will
> >>>>>> return -EFAULT to the userspace.  It is usually not what the bpf prog wants
> >>>>>> because the bpf prog only expects error returning to userspace when it has
> >>>>>> explicitly 'return 0;' or used bpf_set_retval().  If a bpf prog always changes
> >>>>>> optlen for optnames that it does not care to 0,  it may risk if the latter bpf
> >>>>>> prog in the same cgroup may want to change/look-at it.
> >>>>>>
> >>>>>> Would like to explore if there is an easier way for the bpf prog to handle it.
> >>>>>> eg. does it make sense to track if the bpf prog has changed the ctx->optlen
> >>>>>> before returning -EFAULT to the user space when ctx.optlen > max_optlen?
> >>>>>
> >>>>> Good point on chaining being broken because of this requirement :-/
> >>>>>
> >>>>> With tracking, we need to be careful, because the following situation
> >>>>> might be problematic:
> >>>>> Suppose setsockopt is larger than 4k, the program can rewrite some
> >>>>> byte in the first 4k, not touch optlen and expect this to work.
> >>>>
> >>>> If the bpf prog rewrites the first 4k, it must change the ctx.optlen to get it
> >>>> work.  Otherwise, the kernel will return -EFAULT because the ctx.optlen is
> >>>> larger than the max_optlen (or optval_end - optval).
> >>>>
> >>>>> Currently, optlen=0 explicitly means "ignore whatever is in the bpf
> >>>>> buffer and use the original one" > If we can have a tracking that catches situations like this - we
> >>>>> should be able to drop that optlen=0 requirement.
> >>>>> IIRC, that's the only tricky part.
> >>>>
> >>>> Ah, I meant, in __cgroup_bpf_run_filter_setsockopt, use "!ctx.optlen_changed &&
> >>>> ctx.optlen > max_optlen" test to imply "ignore whatever is in the bpf
> >>>> buffer and use the original one".  Add 'bool optlen_changed' to 'struct
> >>>> bpf_sockopt_kern' and set ctx.optlen_changed to true in
> >>>> cg_sockopt_convert_ctx_access() whenever there is BPF_WRITE to ctx.optlen.
> >>>> Would it work or may be I am still missing something in the writing first 4k
> >>>> case above?
> >>>
> >>> What if the program wants to keep optlen as is? Here is the
> >>> hypothetical case: ctx->optlen is 8k, we allocate/expose only the
> >>> first 4k, the program does ctx->optval[0] = 0xff and doesn't change
> >>> the optlen. It wants the rest of the payload to be passed as is with
> >>> only the first byte changed.
> >>
> >> I think we are talking about the same case but we may have different
> >> understanding on how the current __cgroup_bpf_run_filter_setsockopt() is
> >> handling it.
> >>
> >> I don't see the current kernel supports this now.  If the bpf prog does not
> >> change the ctx->optlen from 8k to something that is <= 4k, the kernel will just
> >> return -EFAULT in here, no?
> >>          else if (ctx.optlen /* 8k */ > max_optlen /* 4k */ || ctx.optlen < -1) {
> >>                  /* optlen is out of bounds */
> >>                   ret = -EFAULT;
> >>           }
> >>
> >> or you meant the future change needs to consider this new case and also support
> >> gluing the first 4k (that was exposed to the bpf prog) with the second 4k (that
> >> was not exposed to the bpf prog)?
> >>
> >>> The condition "!ctx.optlen_changed && ctx.optlen > max_optlen" is
> >>> true, so, if we treat this as explicit optlen=0, we ignore the
> >>> program's changes.
> >>> But this is not what the program has intended, right? It wants to
> >>> amend something and pass the rest as is.
> >
> > Right, I'm not talking about how it's handled now. Now optlen >
> > max_optlen triggers EFAULT.
> > But in the future, if we add tracking, we want 'optlen > max_optlen'
> > to behave as explicit 'optlen = 0' as long as the user hasn't changed
> > the optlen _and_ also hasn't changed anything in the buffer.
>
> Ah, ic.
>
> Tracking the runtime buffer change will be hard as of the current state through
> the ctx->optval.  I don't think we need to track that either.  If the existing
> bpf prog wants the changed buf to be used, it must have changed the optlen
> already.  Thus, tracking optlen only should be as good.

I might be still missing something on why tracking optlen is enough?

Consider this BPF program:

SEC("cgroup/setsockopt")
int _setsockopt(struct bpf_sockopt *ctx)
{
    __u8 *optval_end = ctx->optval_end;
    __u8 *optval = ctx->optval;

    if (optval + 1 > optval_end) return 0;

    optval[0] = 0xff;
    return 1;
}

And the userspace doing the following:

__u32 buf[4096*2] = {};
ret = setsockopt(fd, SOME_LEVEL, SOME_OPTLEN, &buf, sizeof(buf));

Right now, without explicit 'optlen = 0' in the BPF program, we'll get
-1/EFAULT here (unarguably, this is a bad interface, but still better
than ignoring program's buf?).
If we track only optlen in the program, we'd get success, but the
changed buffer will be ignored by the kernel. (what am I missing
here?)

> If the bpf prog is depending on the kernel to do implicit -EFAULT like this,
> yes, it will break even the buffer change is tracked.
>
> if (ctx->optlen > ctx->optval_end - ctx->optval)
>      return 1;  /* 0 will be -EPERM, so 1 here to make kernel return -EFAULT for
> us */

[..]

> I would argue that it is more like a surprise than a feature if the bpf prog
> depends on ctx.optlen > max_optlen (only for the > 4k case though) to do an
> implicit reject (through EFAULT) instead of directly using the 'return 0' or
> bpf_set_retval() which is exactly how it should be done to reject other "normal"
> integer optval.

That all comes from the issue above. We want to have a contract with
the bpf program: when optlen>4k, it has to do something with the
optlen (set it to 0 to ignore, set it to <4096 to pass to the kernel).
It can't just change something in the 4k of the exposed buffer and
assume this data will be passed to the kernel.

> I am also not sure how useful it is to expose partial data to the bpf prog and
> have a way for the bpf prog to tell the kernel to join the remaining.  Instead,
> it would be more useful to have API for the bpf prog to have access to the whole
> data instead.

That seems like a better way to go? We didn't do that initially
because the data is in the __user memory and we can't pass it to bpf;
we had to do this extra copy/allocation :-( I think we decided against
copying everything because this can be abused due to no sane limit on
the setsockopt value size. Nothing prevents userspace from passing a
huge buffer when doing, say, SO_MARK; the kernel will read the first
int and be happy with it.

> >>> It seems like we need to have both optlen_changed and optval_changed.
> >>> If both are false, we should be able to safely do optlen=0 equivalent.
> >>> Tracking only optlen seems to be problematic?
> >>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question]: BPF_CGROUP_{GET,SET}SOCKOPT handling when optlen > PAGE_SIZE
  2022-10-27 19:11             ` Stanislav Fomichev
@ 2022-10-27 20:04               ` Martin KaFai Lau
  2022-10-27 20:14                 ` Stanislav Fomichev
  0 siblings, 1 reply; 12+ messages in thread
From: Martin KaFai Lau @ 2022-10-27 20:04 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf

On 10/27/22 12:11 PM, Stanislav Fomichev wrote:
> On Thu, Oct 27, 2022 at 11:48 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 10/27/22 10:40 AM, Stanislav Fomichev wrote:
>>> On Thu, Oct 27, 2022 at 10:29 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>
>>>> On 10/27/22 9:23 AM, Stanislav Fomichev wrote:
>>>>> On Wed, Oct 26, 2022 at 11:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>>>
>>>>>> On 10/26/22 7:03 PM, Stanislav Fomichev wrote:
>>>>>>> On Wed, Oct 26, 2022 at 6:14 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>>>>>
>>>>>>>> The cgroup-bpf {get,set}sockopt prog is useful to change the optname behavior.
>>>>>>>> The bpf prog usually just handles a few specific optnames and ignores most
>>>>>>>> others.  For the optnames that it ignores, it usually does not need to change
>>>>>>>> the optlen.  The exception is when optlen > PAGE_SIZE (or optval_end - optval).
>>>>>>>> The bpf prog needs to set the optlen to 0 for this case or else the kernel will
>>>>>>>> return -EFAULT to the userspace.  It is usually not what the bpf prog wants
>>>>>>>> because the bpf prog only expects error returning to userspace when it has
>>>>>>>> explicitly 'return 0;' or used bpf_set_retval().  If a bpf prog always changes
>>>>>>>> optlen for optnames that it does not care to 0,  it may risk if the latter bpf
>>>>>>>> prog in the same cgroup may want to change/look-at it.
>>>>>>>>
>>>>>>>> Would like to explore if there is an easier way for the bpf prog to handle it.
>>>>>>>> eg. does it make sense to track if the bpf prog has changed the ctx->optlen
>>>>>>>> before returning -EFAULT to the user space when ctx.optlen > max_optlen?
>>>>>>>
>>>>>>> Good point on chaining being broken because of this requirement :-/
>>>>>>>
>>>>>>> With tracking, we need to be careful, because the following situation
>>>>>>> might be problematic:
>>>>>>> Suppose setsockopt is larger than 4k, the program can rewrite some
>>>>>>> byte in the first 4k, not touch optlen and expect this to work.
>>>>>>
>>>>>> If the bpf prog rewrites the first 4k, it must change the ctx.optlen to get it
>>>>>> work.  Otherwise, the kernel will return -EFAULT because the ctx.optlen is
>>>>>> larger than the max_optlen (or optval_end - optval).
>>>>>>
>>>>>>> Currently, optlen=0 explicitly means "ignore whatever is in the bpf
>>>>>>> buffer and use the original one" > If we can have a tracking that catches situations like this - we
>>>>>>> should be able to drop that optlen=0 requirement.
>>>>>>> IIRC, that's the only tricky part.
>>>>>>
>>>>>> Ah, I meant, in __cgroup_bpf_run_filter_setsockopt, use "!ctx.optlen_changed &&
>>>>>> ctx.optlen > max_optlen" test to imply "ignore whatever is in the bpf
>>>>>> buffer and use the original one".  Add 'bool optlen_changed' to 'struct
>>>>>> bpf_sockopt_kern' and set ctx.optlen_changed to true in
>>>>>> cg_sockopt_convert_ctx_access() whenever there is BPF_WRITE to ctx.optlen.
>>>>>> Would it work or may be I am still missing something in the writing first 4k
>>>>>> case above?
>>>>>
>>>>> What if the program wants to keep optlen as is? Here is the
>>>>> hypothetical case: ctx->optlen is 8k, we allocate/expose only the
>>>>> first 4k, the program does ctx->optval[0] = 0xff and doesn't change
>>>>> the optlen. It wants the rest of the payload to be passed as is with
>>>>> only the first byte changed.
>>>>
>>>> I think we are talking about the same case but we may have different
>>>> understanding on how the current __cgroup_bpf_run_filter_setsockopt() is
>>>> handling it.
>>>>
>>>> I don't see the current kernel supports this now.  If the bpf prog does not
>>>> change the ctx->optlen from 8k to something that is <= 4k, the kernel will just
>>>> return -EFAULT in here, no?
>>>>           else if (ctx.optlen /* 8k */ > max_optlen /* 4k */ || ctx.optlen < -1) {
>>>>                   /* optlen is out of bounds */
>>>>                    ret = -EFAULT;
>>>>            }
>>>>
>>>> or you meant the future change needs to consider this new case and also support
>>>> gluing the first 4k (that was exposed to the bpf prog) with the second 4k (that
>>>> was not exposed to the bpf prog)?
>>>>
>>>>> The condition "!ctx.optlen_changed && ctx.optlen > max_optlen" is
>>>>> true, so, if we treat this as explicit optlen=0, we ignore the
>>>>> program's changes.
>>>>> But this is not what the program has intended, right? It wants to
>>>>> amend something and pass the rest as is.
>>>
>>> Right, I'm not talking about how it's handled now. Now optlen >
>>> max_optlen triggers EFAULT.
>>> But in the future, if we add tracking, we want 'optlen > max_optlen'
>>> to behave as explicit 'optlen = 0' as long as the user hasn't changed
>>> the optlen _and_ also hasn't changed anything in the buffer.
>>
>> Ah, ic.
>>
>> Tracking the runtime buffer change will be hard as of the current state through
>> the ctx->optval.  I don't think we need to track that either.  If the existing
>> bpf prog wants the changed buf to be used, it must have changed the optlen
>> already.  Thus, tracking optlen only should be as good.
> 
> I might be still missing something on why tracking optlen is enough?
> 
> Consider this BPF program:
> 
> SEC("cgroup/setsockopt")
> int _setsockopt(struct bpf_sockopt *ctx)
> {
>      __u8 *optval_end = ctx->optval_end;
>      __u8 *optval = ctx->optval;
> 
>      if (optval + 1 > optval_end) return 0;
> 
>      optval[0] = 0xff;
>      return 1;
> }
> 
> And the userspace doing the following:
> 
> __u32 buf[4096*2] = {};
> ret = setsockopt(fd, SOME_LEVEL, SOME_OPTLEN, &buf, sizeof(buf));
> 
> Right now, without explicit 'optlen = 0' in the BPF program, we'll get
> -1/EFAULT here (unarguably, this is a bad interface, but still better
> than ignoring program's buf?).
> If we track only optlen in the program, we'd get success, but the
> changed buffer will be ignored by the kernel. (what am I missing
> here?)

Right, this will break if the bpf prog depends on this -EFAULT behavior in 
anyway.  Similar to my example below, tracking the buffer change still won't be 
enough because we don't know the intention of the bpf prog (changed but forgot 
to update optlen or it does want to return -EFAULT).

After these few examples in the thread, I think this optlen and buffer tracking 
does not seem to be a tangible path to solve it.  It seems like it is only 
papering around it.

> 
>> If the bpf prog is depending on the kernel to do implicit -EFAULT like this,
>> yes, it will break even the buffer change is tracked.
>>
>> if (ctx->optlen > ctx->optval_end - ctx->optval)
>>       return 1;  /* 0 will be -EPERM, so 1 here to make kernel return -EFAULT for
>> us */
> 
> [..]
> 
>> I would argue that it is more like a surprise than a feature if the bpf prog
>> depends on ctx.optlen > max_optlen (only for the > 4k case though) to do an
>> implicit reject (through EFAULT) instead of directly using the 'return 0' or
>> bpf_set_retval() which is exactly how it should be done to reject other "normal"
>> integer optval.
> 
> That all comes from the issue above. We want to have a contract with
> the bpf program: when optlen>4k, it has to do something with the
> optlen (set it to 0 to ignore, set it to <4096 to pass to the kernel).
> It can't just change something in the 4k of the exposed buffer and
> assume this data will be passed to the kernel.
> 
>> I am also not sure how useful it is to expose partial data to the bpf prog and
>> have a way for the bpf prog to tell the kernel to join the remaining.  Instead,
>> it would be more useful to have API for the bpf prog to have access to the whole
>> data instead.
> 
> That seems like a better way to go? We didn't do that initially
> because the data is in the __user memory and we can't pass it to bpf;
> we had to do this extra copy/allocation :-( I think we decided against
> copying everything because this can be abused due to no sane limit on
> the setsockopt value size. Nothing prevents userspace from passing a
> huge buffer when doing, say, SO_MARK; the kernel will read the first
> int and be happy with it.

yeah, may be one thing for the future API is to avoid the pre allocation.  There 
is bpf_copy_from_user but it needs to be sleepable.

> 
>>>>> It seems like we need to have both optlen_changed and optval_changed.
>>>>> If both are false, we should be able to safely do optlen=0 equivalent.
>>>>> Tracking only optlen seems to be problematic?
>>>>
>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question]: BPF_CGROUP_{GET,SET}SOCKOPT handling when optlen > PAGE_SIZE
  2022-10-27 20:04               ` Martin KaFai Lau
@ 2022-10-27 20:14                 ` Stanislav Fomichev
  0 siblings, 0 replies; 12+ messages in thread
From: Stanislav Fomichev @ 2022-10-27 20:14 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: bpf

On Thu, Oct 27, 2022 at 1:04 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/27/22 12:11 PM, Stanislav Fomichev wrote:
> > On Thu, Oct 27, 2022 at 11:48 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 10/27/22 10:40 AM, Stanislav Fomichev wrote:
> >>> On Thu, Oct 27, 2022 at 10:29 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>>>
> >>>> On 10/27/22 9:23 AM, Stanislav Fomichev wrote:
> >>>>> On Wed, Oct 26, 2022 at 11:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>>>>>
> >>>>>> On 10/26/22 7:03 PM, Stanislav Fomichev wrote:
> >>>>>>> On Wed, Oct 26, 2022 at 6:14 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>>>>>>>
> >>>>>>>> The cgroup-bpf {get,set}sockopt prog is useful to change the optname behavior.
> >>>>>>>> The bpf prog usually just handles a few specific optnames and ignores most
> >>>>>>>> others.  For the optnames that it ignores, it usually does not need to change
> >>>>>>>> the optlen.  The exception is when optlen > PAGE_SIZE (or optval_end - optval).
> >>>>>>>> The bpf prog needs to set the optlen to 0 for this case or else the kernel will
> >>>>>>>> return -EFAULT to the userspace.  It is usually not what the bpf prog wants
> >>>>>>>> because the bpf prog only expects error returning to userspace when it has
> >>>>>>>> explicitly 'return 0;' or used bpf_set_retval().  If a bpf prog always changes
> >>>>>>>> optlen for optnames that it does not care to 0,  it may risk if the latter bpf
> >>>>>>>> prog in the same cgroup may want to change/look-at it.
> >>>>>>>>
> >>>>>>>> Would like to explore if there is an easier way for the bpf prog to handle it.
> >>>>>>>> eg. does it make sense to track if the bpf prog has changed the ctx->optlen
> >>>>>>>> before returning -EFAULT to the user space when ctx.optlen > max_optlen?
> >>>>>>>
> >>>>>>> Good point on chaining being broken because of this requirement :-/
> >>>>>>>
> >>>>>>> With tracking, we need to be careful, because the following situation
> >>>>>>> might be problematic:
> >>>>>>> Suppose setsockopt is larger than 4k, the program can rewrite some
> >>>>>>> byte in the first 4k, not touch optlen and expect this to work.
> >>>>>>
> >>>>>> If the bpf prog rewrites the first 4k, it must change the ctx.optlen to get it
> >>>>>> work.  Otherwise, the kernel will return -EFAULT because the ctx.optlen is
> >>>>>> larger than the max_optlen (or optval_end - optval).
> >>>>>>
> >>>>>>> Currently, optlen=0 explicitly means "ignore whatever is in the bpf
> >>>>>>> buffer and use the original one" > If we can have a tracking that catches situations like this - we
> >>>>>>> should be able to drop that optlen=0 requirement.
> >>>>>>> IIRC, that's the only tricky part.
> >>>>>>
> >>>>>> Ah, I meant, in __cgroup_bpf_run_filter_setsockopt, use "!ctx.optlen_changed &&
> >>>>>> ctx.optlen > max_optlen" test to imply "ignore whatever is in the bpf
> >>>>>> buffer and use the original one".  Add 'bool optlen_changed' to 'struct
> >>>>>> bpf_sockopt_kern' and set ctx.optlen_changed to true in
> >>>>>> cg_sockopt_convert_ctx_access() whenever there is BPF_WRITE to ctx.optlen.
> >>>>>> Would it work or may be I am still missing something in the writing first 4k
> >>>>>> case above?
> >>>>>
> >>>>> What if the program wants to keep optlen as is? Here is the
> >>>>> hypothetical case: ctx->optlen is 8k, we allocate/expose only the
> >>>>> first 4k, the program does ctx->optval[0] = 0xff and doesn't change
> >>>>> the optlen. It wants the rest of the payload to be passed as is with
> >>>>> only the first byte changed.
> >>>>
> >>>> I think we are talking about the same case but we may have different
> >>>> understanding on how the current __cgroup_bpf_run_filter_setsockopt() is
> >>>> handling it.
> >>>>
> >>>> I don't see the current kernel supports this now.  If the bpf prog does not
> >>>> change the ctx->optlen from 8k to something that is <= 4k, the kernel will just
> >>>> return -EFAULT in here, no?
> >>>>           else if (ctx.optlen /* 8k */ > max_optlen /* 4k */ || ctx.optlen < -1) {
> >>>>                   /* optlen is out of bounds */
> >>>>                    ret = -EFAULT;
> >>>>            }
> >>>>
> >>>> or you meant the future change needs to consider this new case and also support
> >>>> gluing the first 4k (that was exposed to the bpf prog) with the second 4k (that
> >>>> was not exposed to the bpf prog)?
> >>>>
> >>>>> The condition "!ctx.optlen_changed && ctx.optlen > max_optlen" is
> >>>>> true, so, if we treat this as explicit optlen=0, we ignore the
> >>>>> program's changes.
> >>>>> But this is not what the program has intended, right? It wants to
> >>>>> amend something and pass the rest as is.
> >>>
> >>> Right, I'm not talking about how it's handled now. Now optlen >
> >>> max_optlen triggers EFAULT.
> >>> But in the future, if we add tracking, we want 'optlen > max_optlen'
> >>> to behave as explicit 'optlen = 0' as long as the user hasn't changed
> >>> the optlen _and_ also hasn't changed anything in the buffer.
> >>
> >> Ah, ic.
> >>
> >> Tracking the runtime buffer change will be hard as of the current state through
> >> the ctx->optval.  I don't think we need to track that either.  If the existing
> >> bpf prog wants the changed buf to be used, it must have changed the optlen
> >> already.  Thus, tracking optlen only should be as good.
> >
> > I might be still missing something on why tracking optlen is enough?
> >
> > Consider this BPF program:
> >
> > SEC("cgroup/setsockopt")
> > int _setsockopt(struct bpf_sockopt *ctx)
> > {
> >      __u8 *optval_end = ctx->optval_end;
> >      __u8 *optval = ctx->optval;
> >
> >      if (optval + 1 > optval_end) return 0;
> >
> >      optval[0] = 0xff;
> >      return 1;
> > }
> >
> > And the userspace doing the following:
> >
> > __u32 buf[4096*2] = {};
> > ret = setsockopt(fd, SOME_LEVEL, SOME_OPTLEN, &buf, sizeof(buf));
> >
> > Right now, without explicit 'optlen = 0' in the BPF program, we'll get
> > -1/EFAULT here (unarguably, this is a bad interface, but still better
> > than ignoring program's buf?).
> > If we track only optlen in the program, we'd get success, but the
> > changed buffer will be ignored by the kernel. (what am I missing
> > here?)
>
> Right, this will break if the bpf prog depends on this -EFAULT behavior in
> anyway.  Similar to my example below, tracking the buffer change still won't be
> enough because we don't know the intention of the bpf prog (changed but forgot
> to update optlen or it does want to return -EFAULT).
>
> After these few examples in the thread, I think this optlen and buffer tracking
> does not seem to be a tangible path to solve it.  It seems like it is only
> papering around it.
>
> >
> >> If the bpf prog is depending on the kernel to do implicit -EFAULT like this,
> >> yes, it will break even the buffer change is tracked.
> >>
> >> if (ctx->optlen > ctx->optval_end - ctx->optval)
> >>       return 1;  /* 0 will be -EPERM, so 1 here to make kernel return -EFAULT for
> >> us */
> >
> > [..]
> >
> >> I would argue that it is more like a surprise than a feature if the bpf prog
> >> depends on ctx.optlen > max_optlen (only for the > 4k case though) to do an
> >> implicit reject (through EFAULT) instead of directly using the 'return 0' or
> >> bpf_set_retval() which is exactly how it should be done to reject other "normal"
> >> integer optval.
> >
> > That all comes from the issue above. We want to have a contract with
> > the bpf program: when optlen>4k, it has to do something with the
> > optlen (set it to 0 to ignore, set it to <4096 to pass to the kernel).
> > It can't just change something in the 4k of the exposed buffer and
> > assume this data will be passed to the kernel.
> >
> >> I am also not sure how useful it is to expose partial data to the bpf prog and
> >> have a way for the bpf prog to tell the kernel to join the remaining.  Instead,
> >> it would be more useful to have API for the bpf prog to have access to the whole
> >> data instead.
> >
> > That seems like a better way to go? We didn't do that initially
> > because the data is in the __user memory and we can't pass it to bpf;
> > we had to do this extra copy/allocation :-( I think we decided against
> > copying everything because this can be abused due to no sane limit on
> > the setsockopt value size. Nothing prevents userspace from passing a
> > huge buffer when doing, say, SO_MARK; the kernel will read the first
> > int and be happy with it.
>
> yeah, may be one thing for the future API is to avoid the pre allocation.  There
> is bpf_copy_from_user but it needs to be sleepable.

+1, having a sleepable version might be cleaner alternative

> >
> >>>>> It seems like we need to have both optlen_changed and optval_changed.
> >>>>> If both are false, we should be able to safely do optlen=0 equivalent.
> >>>>> Tracking only optlen seems to be problematic?
> >>>>
> >>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question]: BPF_CGROUP_{GET,SET}SOCKOPT handling when optlen > PAGE_SIZE
  2022-10-27  1:14 [Question]: BPF_CGROUP_{GET,SET}SOCKOPT handling when optlen > PAGE_SIZE Martin KaFai Lau
  2022-10-27  2:03 ` Stanislav Fomichev
@ 2022-10-27 20:46 ` Andrii Nakryiko
  2022-10-27 23:46   ` Martin KaFai Lau
  1 sibling, 1 reply; 12+ messages in thread
From: Andrii Nakryiko @ 2022-10-27 20:46 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: Stanislav Fomichev, bpf

On Wed, Oct 26, 2022 at 6:17 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> The cgroup-bpf {get,set}sockopt prog is useful to change the optname behavior.
> The bpf prog usually just handles a few specific optnames and ignores most
> others.  For the optnames that it ignores, it usually does not need to change
> the optlen.  The exception is when optlen > PAGE_SIZE (or optval_end - optval).
> The bpf prog needs to set the optlen to 0 for this case or else the kernel will
> return -EFAULT to the userspace.  It is usually not what the bpf prog wants
> because the bpf prog only expects error returning to userspace when it has
> explicitly 'return 0;' or used bpf_set_retval().  If a bpf prog always changes
> optlen for optnames that it does not care to 0,  it may risk if the latter bpf
> prog in the same cgroup may want to change/look-at it.
>
> Would like to explore if there is an easier way for the bpf prog to handle it.
> eg. does it make sense to track if the bpf prog has changed the ctx->optlen
> before returning -EFAULT to the user space when ctx.optlen > max_optlen?

Maybe optlen + optval/optval_end could be replaced with dynptr? If we
do a new type of dynptr (DYNPTR_CTXBUF or something like that), we can
implement tracking of whether it was ever modified through
bpf_dynptr_write() or if direct memory access was ever used (was
bpf_dynptr_data() called). Not sure how you'd know if
bpf_dynptr_data() was used to modify data, though (this is where
bpf_dynptr_data_rdonly() vs bpf_dynptr_data() would be helpful,
perhaps). But just a seed of an idea, maybe you guys can somehow fit
it here?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Question]: BPF_CGROUP_{GET,SET}SOCKOPT handling when optlen > PAGE_SIZE
  2022-10-27 20:46 ` Andrii Nakryiko
@ 2022-10-27 23:46   ` Martin KaFai Lau
  0 siblings, 0 replies; 12+ messages in thread
From: Martin KaFai Lau @ 2022-10-27 23:46 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Stanislav Fomichev, bpf

On 10/27/22 1:46 PM, Andrii Nakryiko wrote:
> On Wed, Oct 26, 2022 at 6:17 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> The cgroup-bpf {get,set}sockopt prog is useful to change the optname behavior.
>> The bpf prog usually just handles a few specific optnames and ignores most
>> others.  For the optnames that it ignores, it usually does not need to change
>> the optlen.  The exception is when optlen > PAGE_SIZE (or optval_end - optval).
>> The bpf prog needs to set the optlen to 0 for this case or else the kernel will
>> return -EFAULT to the userspace.  It is usually not what the bpf prog wants
>> because the bpf prog only expects error returning to userspace when it has
>> explicitly 'return 0;' or used bpf_set_retval().  If a bpf prog always changes
>> optlen for optnames that it does not care to 0,  it may risk if the latter bpf
>> prog in the same cgroup may want to change/look-at it.
>>
>> Would like to explore if there is an easier way for the bpf prog to handle it.
>> eg. does it make sense to track if the bpf prog has changed the ctx->optlen
>> before returning -EFAULT to the user space when ctx.optlen > max_optlen?
> 
> Maybe optlen + optval/optval_end could be replaced with dynptr? If we
> do a new type of dynptr (DYNPTR_CTXBUF or something like that), we can
> implement tracking of whether it was ever modified through
> bpf_dynptr_write() or if direct memory access was ever used (was
> bpf_dynptr_data() called). Not sure how you'd know if
> bpf_dynptr_data() was used to modify data, though (this is where
> bpf_dynptr_data_rdonly() vs bpf_dynptr_data() would be helpful,
> perhaps). But just a seed of an idea, maybe you guys can somehow fit
> it here?

Yep, dynptr can be used here.  May be one dynptr specifically for sockptr_t. The 
dynptr will have the __user pointer first to avoid a big (and potentially bogus) 
kernel buffer pre allocation.  Then only read and write it through the 
bpf_dynptr_read() and write().  Since it is __user pointer, it won't be directly 
accessible through data slice with bpf_dynptr_data().  It should not be a perf 
issue, most of the common optval is an integer.  The worst common case is the 16 
bytes tcp-cc name.   The bpf prog has to be sleepable first though which I think 
will be a useful thing to have.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-10-27 23:46 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-27  1:14 [Question]: BPF_CGROUP_{GET,SET}SOCKOPT handling when optlen > PAGE_SIZE Martin KaFai Lau
2022-10-27  2:03 ` Stanislav Fomichev
2022-10-27  6:15   ` Martin KaFai Lau
2022-10-27 16:23     ` Stanislav Fomichev
2022-10-27 17:28       ` Martin KaFai Lau
2022-10-27 17:40         ` Stanislav Fomichev
2022-10-27 18:48           ` Martin KaFai Lau
2022-10-27 19:11             ` Stanislav Fomichev
2022-10-27 20:04               ` Martin KaFai Lau
2022-10-27 20:14                 ` Stanislav Fomichev
2022-10-27 20:46 ` Andrii Nakryiko
2022-10-27 23:46   ` Martin KaFai Lau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).