Support for loading firewall rules with cgroup(v2) expressions early

All of lore.kernel.org
 help / color / mirror / Atom feed

* Support for loading firewall rules with cgroup(v2) expressions early
@ 2022-03-26 10:09 Topi Miettinen
  2022-03-27 21:31 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 17+ messages in thread
From: Topi Miettinen @ 2022-03-26 10:09 UTC (permalink / raw)
  To: netfilter-devel

Hi,

I'd like to use cgroupv2 expressions in firewall rules. But since the 
rules are loaded very early in the boot, the expressions are rejected 
since the target cgroups are not realized until much later.

Would it be possible to add new cgroupv2 expressions which defer the 
check until actual use? For example, 'cgroupv2name' (like iifname etc.) 
would check the cgroup path string at rule use time?

Another possibility would be to hook into cgroup directory creation 
logic in kernel so that when the cgroup is created, part of the path 
checks are performed or something else which would allow non-existent 
cgroups to be used. Then the NFT syntax would not need changing, but the 
expressions would "just work" even when loaded early.

Indirection through sets ('socket cgroupv2 level @lvl @cgname drop') 
might work in some cases, but it would need support from cgroup manager 
like systemd which would manage the sets. This would also probably not 
be scalable to unprivileged users or containers.

This also applies to old cgroup (v1) expression but that's probably not 
worth improving anymore.

Related work on systemd side:
https://github.com/systemd/systemd/issues/22527

-Topi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Support for loading firewall rules with cgroup(v2) expressions early
  2022-03-26 10:09 Support for loading firewall rules with cgroup(v2) expressions early Topi Miettinen
@ 2022-03-27 21:31 ` Pablo Neira Ayuso
  2022-03-28 14:08   ` Topi Miettinen
  0 siblings, 1 reply; 17+ messages in thread
From: Pablo Neira Ayuso @ 2022-03-27 21:31 UTC (permalink / raw)
  To: Topi Miettinen; +Cc: netfilter-devel

Hi,

On Sat, Mar 26, 2022 at 12:09:26PM +0200, Topi Miettinen wrote:
> Hi,
> 
> I'd like to use cgroupv2 expressions in firewall rules. But since the rules
> are loaded very early in the boot, the expressions are rejected since the
> target cgroups are not realized until much later.
> 
> Would it be possible to add new cgroupv2 expressions which defer the check
> until actual use? For example, 'cgroupv2name' (like iifname etc.) would
> check the cgroup path string at rule use time?
> 
> Another possibility would be to hook into cgroup directory creation logic in
> kernel so that when the cgroup is created, part of the path checks are
> performed or something else which would allow non-existent cgroups to be
> used. Then the NFT syntax would not need changing, but the expressions would
> "just work" even when loaded early.

Could you use inotify/dnotify/eventfd to track these updates from
userspace and update the nftables sets accordingly? AFAIK, this is
available to cgroupsv2.

> Indirection through sets ('socket cgroupv2 level @lvl @cgname drop') might
> work in some cases, but it would need support from cgroup manager like
> systemd which would manage the sets. This would also probably not be
> scalable to unprivileged users or containers.
> 
> This also applies to old cgroup (v1) expression but that's probably not
> worth improving anymore.
> 
> Related work on systemd side:
> https://github.com/systemd/systemd/issues/22527
> 
> -Topi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Support for loading firewall rules with cgroup(v2) expressions early
  2022-03-27 21:31 ` Pablo Neira Ayuso
@ 2022-03-28 14:08   ` Topi Miettinen
  2022-03-28 15:05     ` Pablo Neira Ayuso
  0 siblings, 1 reply; 17+ messages in thread
From: Topi Miettinen @ 2022-03-28 14:08 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On 28.3.2022 0.31, Pablo Neira Ayuso wrote:
> Hi,
> 
> On Sat, Mar 26, 2022 at 12:09:26PM +0200, Topi Miettinen wrote:
>> Hi,
>>
>> I'd like to use cgroupv2 expressions in firewall rules. But since the rules
>> are loaded very early in the boot, the expressions are rejected since the
>> target cgroups are not realized until much later.
>>
>> Would it be possible to add new cgroupv2 expressions which defer the check
>> until actual use? For example, 'cgroupv2name' (like iifname etc.) would
>> check the cgroup path string at rule use time?
>>
>> Another possibility would be to hook into cgroup directory creation logic in
>> kernel so that when the cgroup is created, part of the path checks are
>> performed or something else which would allow non-existent cgroups to be
>> used. Then the NFT syntax would not need changing, but the expressions would
>> "just work" even when loaded early.
> 
> Could you use inotify/dnotify/eventfd to track these updates from
> userspace and update the nftables sets accordingly? AFAIK, this is
> available to cgroupsv2.

It's possible, there's for example:
https://github.com/mk-fg/systemd-cgroup-nftables-policy-manager
https://github.com/helsinki-systems/nft_cgroupv2/

But I think that with this approach, depending on system load, there 
could be a vulnerable time window where the rules aren't loaded yet but 
the process which is supposed to be protected by the rules has already 
started running. This isn't desirable for firewalls, so I'd like to have 
a way for loading the firewall rules as early as possible.

-Topi

> 
>> Indirection through sets ('socket cgroupv2 level @lvl @cgname drop') might
>> work in some cases, but it would need support from cgroup manager like
>> systemd which would manage the sets. This would also probably not be
>> scalable to unprivileged users or containers.
>>
>> This also applies to old cgroup (v1) expression but that's probably not
>> worth improving anymore.
>>
>> Related work on systemd side:
>> https://github.com/systemd/systemd/issues/22527
>>
>> -Topi


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Support for loading firewall rules with cgroup(v2) expressions early
  2022-03-28 14:08   ` Topi Miettinen
@ 2022-03-28 15:05     ` Pablo Neira Ayuso
  2022-03-28 17:46       ` Topi Miettinen
  2022-03-29 18:20       ` Topi Miettinen
  0 siblings, 2 replies; 17+ messages in thread
From: Pablo Neira Ayuso @ 2022-03-28 15:05 UTC (permalink / raw)
  To: Topi Miettinen; +Cc: netfilter-devel

On Mon, Mar 28, 2022 at 05:08:32PM +0300, Topi Miettinen wrote:
> On 28.3.2022 0.31, Pablo Neira Ayuso wrote:
> > On Sat, Mar 26, 2022 at 12:09:26PM +0200, Topi Miettinen wrote:
[...]
> > > Another possibility would be to hook into cgroup directory creation logic in
> > > kernel so that when the cgroup is created, part of the path checks are
> > > performed or something else which would allow non-existent cgroups to be
> > > used. Then the NFT syntax would not need changing, but the expressions would
> > > "just work" even when loaded early.
> > 
> > Could you use inotify/dnotify/eventfd to track these updates from
> > userspace and update the nftables sets accordingly? AFAIK, this is
> > available to cgroupsv2.
> 
> It's possible, there's for example:
> https://github.com/mk-fg/systemd-cgroup-nftables-policy-manager

This one seems to be adding one rule per cgroupv2, it would be better
to use a map for this purpose for scalability reasons.

> https://github.com/helsinki-systems/nft_cgroupv2/

This approach above takes us back to the linear ruleset evaluation
problem, this is basically looking like iptables, this does not scale up.

> But I think that with this approach, depending on system load, there could
> be a vulnerable time window where the rules aren't loaded yet but the
> process which is supposed to be protected by the rules has already started
> running. This isn't desirable for firewalls, so I'd like to have a way for
> loading the firewall rules as early as possible.

You could define a static ruleset which creates the table, basechain
and the cgroupv2 verdict map. Then, systemd updates this map with new
entries to match on cgroupsv2 and apply the corresponding policy for
this process, and delete it when not needed anymore. You have to
define one non-basechain for each cgroupv2 policy.

To address the vulnerable time window, the static ruleset defines a
default policy to allow nothing until an explicit policy based on
cgroupv2 for this process is in place.

The cgroupv2 support for nftables was designed to be used with maps.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Support for loading firewall rules with cgroup(v2) expressions early
  2022-03-28 15:05     ` Pablo Neira Ayuso
@ 2022-03-28 17:46       ` Topi Miettinen
  2022-03-29 18:20       ` Topi Miettinen
  1 sibling, 0 replies; 17+ messages in thread
From: Topi Miettinen @ 2022-03-28 17:46 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On 28.3.2022 18.05, Pablo Neira Ayuso wrote:
> On Mon, Mar 28, 2022 at 05:08:32PM +0300, Topi Miettinen wrote:
>> On 28.3.2022 0.31, Pablo Neira Ayuso wrote:
>>> On Sat, Mar 26, 2022 at 12:09:26PM +0200, Topi Miettinen wrote:
> [...]
>>>> Another possibility would be to hook into cgroup directory creation logic in
>>>> kernel so that when the cgroup is created, part of the path checks are
>>>> performed or something else which would allow non-existent cgroups to be
>>>> used. Then the NFT syntax would not need changing, but the expressions would
>>>> "just work" even when loaded early.
>>>
>>> Could you use inotify/dnotify/eventfd to track these updates from
>>> userspace and update the nftables sets accordingly? AFAIK, this is
>>> available to cgroupsv2.
>>
>> It's possible, there's for example:
>> https://github.com/mk-fg/systemd-cgroup-nftables-policy-manager
> 
> This one seems to be adding one rule per cgroupv2, it would be better
> to use a map for this purpose for scalability reasons.
> 
>> https://github.com/helsinki-systems/nft_cgroupv2/
> 
> This approach above takes us back to the linear ruleset evaluation
> problem, this is basically looking like iptables, this does not scale up.
> 
>> But I think that with this approach, depending on system load, there could
>> be a vulnerable time window where the rules aren't loaded yet but the
>> process which is supposed to be protected by the rules has already started
>> running. This isn't desirable for firewalls, so I'd like to have a way for
>> loading the firewall rules as early as possible.
> 
> You could define a static ruleset which creates the table, basechain
> and the cgroupv2 verdict map. Then, systemd updates this map with new
> entries to match on cgroupsv2 and apply the corresponding policy for
> this process, and delete it when not needed anymore. You have to
> define one non-basechain for each cgroupv2 policy.

So something like this:
table inet x {
	map dict {
		type string : verdict;
	}

	chain y {
		socket cgroupv2 level 4 vmap @dict
	}
}
and then systemd would add an entry like { 
"app-local\x2dfirefox\x2desr-01d5fcc2f9114e509e992cdaef3d84c3.scope" : 
accept } to the vmap "dict" when realizing the cgroup?

-Topi

> To address the vulnerable time window, the static ruleset defines a
> default policy to allow nothing until an explicit policy based on
> cgroupv2 for this process is in place.
> 
> The cgroupv2 support for nftables was designed to be used with maps.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Support for loading firewall rules with cgroup(v2) expressions early
  2022-03-28 15:05     ` Pablo Neira Ayuso
  2022-03-28 17:46       ` Topi Miettinen
@ 2022-03-29 18:20       ` Topi Miettinen
  2022-03-29 22:25         ` Pablo Neira Ayuso
  1 sibling, 1 reply; 17+ messages in thread
From: Topi Miettinen @ 2022-03-29 18:20 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On 28.3.2022 18.05, Pablo Neira Ayuso wrote:
> On Mon, Mar 28, 2022 at 05:08:32PM +0300, Topi Miettinen wrote:
>> On 28.3.2022 0.31, Pablo Neira Ayuso wrote:
>>> On Sat, Mar 26, 2022 at 12:09:26PM +0200, Topi Miettinen wrote:
> [...]
>>>> Another possibility would be to hook into cgroup directory creation logic in
>>>> kernel so that when the cgroup is created, part of the path checks are
>>>> performed or something else which would allow non-existent cgroups to be
>>>> used. Then the NFT syntax would not need changing, but the expressions would
>>>> "just work" even when loaded early.
>>>
>>> Could you use inotify/dnotify/eventfd to track these updates from
>>> userspace and update the nftables sets accordingly? AFAIK, this is
>>> available to cgroupsv2.
>>
>> It's possible, there's for example:
>> https://github.com/mk-fg/systemd-cgroup-nftables-policy-manager
> 
> This one seems to be adding one rule per cgroupv2, it would be better
> to use a map for this purpose for scalability reasons.
> 
>> https://github.com/helsinki-systems/nft_cgroupv2/
> 
> This approach above takes us back to the linear ruleset evaluation
> problem, this is basically looking like iptables, this does not scale up.
> 
>> But I think that with this approach, depending on system load, there could
>> be a vulnerable time window where the rules aren't loaded yet but the
>> process which is supposed to be protected by the rules has already started
>> running. This isn't desirable for firewalls, so I'd like to have a way for
>> loading the firewall rules as early as possible.
> 
> You could define a static ruleset which creates the table, basechain
> and the cgroupv2 verdict map. Then, systemd updates this map with new
> entries to match on cgroupsv2 and apply the corresponding policy for
> this process, and delete it when not needed anymore. You have to
> define one non-basechain for each cgroupv2 policy.

Actually this seems to work:

table inet filter {
         set cg {
                 typeof socket cgroupv2 level 0
         }

         chain y {
                 socket cgroupv2 level 2 @cg accept
		counter drop
         }
}

Simulating systemd adding the cgroup of a service to the set:
# nft add element inet filter cg "system.slice/systemd-resolved.service"

Cgroup ID (inode number of the cgroup) has been successfully added:
# nft list set inet filter cg
         set cg {
                 typeof socket cgroupv2 level 0
                 elements = { 6032 }
         }
# ls -id /sys/fs/cgroup/system.slice/systemd-resolved.service
6032 /sys/fs/cgroup/system.slice/systemd-resolved.service/

-Topi

> To address the vulnerable time window, the static ruleset defines a
> default policy to allow nothing until an explicit policy based on
> cgroupv2 for this process is in place.
> 
> The cgroupv2 support for nftables was designed to be used with maps.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Support for loading firewall rules with cgroup(v2) expressions early
  2022-03-29 18:20       ` Topi Miettinen
@ 2022-03-29 22:25         ` Pablo Neira Ayuso
  2022-03-30  2:53           ` Pablo Neira Ayuso
  2022-03-30 16:37           ` Topi Miettinen
  0 siblings, 2 replies; 17+ messages in thread
From: Pablo Neira Ayuso @ 2022-03-29 22:25 UTC (permalink / raw)
  To: Topi Miettinen; +Cc: netfilter-devel

On Tue, Mar 29, 2022 at 09:20:25PM +0300, Topi Miettinen wrote:
> On 28.3.2022 18.05, Pablo Neira Ayuso wrote:
> > On Mon, Mar 28, 2022 at 05:08:32PM +0300, Topi Miettinen wrote:
> > > On 28.3.2022 0.31, Pablo Neira Ayuso wrote:
> > > > On Sat, Mar 26, 2022 at 12:09:26PM +0200, Topi Miettinen wrote:
> > [...]
> > > But I think that with this approach, depending on system load, there could
> > > be a vulnerable time window where the rules aren't loaded yet but the
> > > process which is supposed to be protected by the rules has already started
> > > running. This isn't desirable for firewalls, so I'd like to have a way for
> > > loading the firewall rules as early as possible.
> > 
> > You could define a static ruleset which creates the table, basechain
> > and the cgroupv2 verdict map. Then, systemd updates this map with new
> > entries to match on cgroupsv2 and apply the corresponding policy for
> > this process, and delete it when not needed anymore. You have to
> > define one non-basechain for each cgroupv2 policy.
> 
> Actually this seems to work:
> 
> table inet filter {
>         set cg {
>                 typeof socket cgroupv2 level 0
>         }
> 
>         chain y {
>                 socket cgroupv2 level 2 @cg accept
> 		counter drop
>         }
> }
> 
> Simulating systemd adding the cgroup of a service to the set:
> # nft add element inet filter cg "system.slice/systemd-resolved.service"
> 
> Cgroup ID (inode number of the cgroup) has been successfully added:
> # nft list set inet filter cg
>         set cg {
>                 typeof socket cgroupv2 level 0
>                 elements = { 6032 }
>         }
> # ls -id /sys/fs/cgroup/system.slice/systemd-resolved.service
> 6032 /sys/fs/cgroup/system.slice/systemd-resolved.service/

You could define a ruleset that describes the policy following the
cgroupsv2 hierarchy. Something like this:

 table inet filter {
        map dict_cgroup_level_1 {
                type cgroupsv2 : verdict;
                elements = { "system.slice" : jump system_slice }
        }

        map dict_cgroup_level_2 {
                type cgroupsv2 : verdict;
                elements = { "system.slice/systemd-timesyncd.service" : jump systemd_timesyncd }
        }

        chain systemd_timesyncd {
                # systemd-timesyncd policy
        }

        chain system_slice {
                socket cgroupv2 level 2 vmap @dict_cgroup_level_2
                # policy for system.slice process
        }

        chain input {
                type filter hook input priority filter; policy drop;
                socket cgroupv2 level 1 vmap @dict_cgroup_level_1
        }
 }

The dictionaries per level allows you to mimic the cgroupsv2 tree
hierarchy

This allows you to attach a default policy for processes that belong
to the "system_slice" (at level 1). This might also be useful in case
that there is a process in the group "system_slice" which does not yet
have an explicit level 2 policy, so level 1 policy applies in such
case.

You might want to apply the level 1 policy before the level 2 policy
(ie. aggregate policies per level as you move searching for an exact
cgroup match), or instead you might prefer to search for an exact
match at level 2, otherwise backtrack to closest matching cgroupsv2
for this process.

There is also the jump and goto semantics for chains that can be
combined in this chain tree.

BTW, what nftables version are you using? My listing does not show
i-nodes, instead it shows the path.

 # nft list map inet filter dict_cgroup_level_1
 table inet x {
        map dict_cgroup_level_1 {
                type cgroupsv2 : verdict
                elements = { "system.slice" : jump system_slice }
        }
 }

Another side note: beware I'm setting the default policy to drop at
the 'input' chain in case you use this test ruleset. This is a
skeleton ruleset, more rules are likely needed to define what to do
with packets matching the described cgroupsv2.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Support for loading firewall rules with cgroup(v2) expressions early
  2022-03-29 22:25         ` Pablo Neira Ayuso
@ 2022-03-30  2:53           ` Pablo Neira Ayuso
  2022-04-02  8:12             ` Topi Miettinen
  2022-03-30 16:37           ` Topi Miettinen
  1 sibling, 1 reply; 17+ messages in thread
From: Pablo Neira Ayuso @ 2022-03-30  2:53 UTC (permalink / raw)
  To: Topi Miettinen; +Cc: netfilter-devel

On Wed, Mar 30, 2022 at 12:25:25AM +0200, Pablo Neira Ayuso wrote:
> On Tue, Mar 29, 2022 at 09:20:25PM +0300, Topi Miettinen wrote:
[...]
> You could define a ruleset that describes the policy following the
> cgroupsv2 hierarchy. Something like this:
> 
>  table inet filter {
>         map dict_cgroup_level_1 {
>                 type cgroupsv2 : verdict;
>                 elements = { "system.slice" : jump system_slice }
>         }
> 
>         map dict_cgroup_level_2 {
>                 type cgroupsv2 : verdict;
>                 elements = { "system.slice/systemd-timesyncd.service" : jump systemd_timesyncd }
>         }
> 
>         chain systemd_timesyncd {
>                 # systemd-timesyncd policy
>         }
> 
>         chain system_slice {
>                 socket cgroupv2 level 2 vmap @dict_cgroup_level_2
>                 # policy for system.slice process
>         }
> 
>         chain input {
>                 type filter hook input priority filter; policy drop;

This example should use the output chain instead:

          chain output {
                  type filter hook output priority filter; policy drop;

From the input chain, the packet relies on early demux to have access
to the socket.

The idea would be to filter out outgoing traffic and rely on conntrack
for (established) input traffic.

>                 socket cgroupv2 level 1 vmap @dict_cgroup_level_1
>         }
>  }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Support for loading firewall rules with cgroup(v2) expressions early
  2022-03-29 22:25         ` Pablo Neira Ayuso
  2022-03-30  2:53           ` Pablo Neira Ayuso
@ 2022-03-30 16:37           ` Topi Miettinen
  2022-03-30 21:47             ` Pablo Neira Ayuso
  1 sibling, 1 reply; 17+ messages in thread
From: Topi Miettinen @ 2022-03-30 16:37 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On 30.3.2022 1.25, Pablo Neira Ayuso wrote:
> On Tue, Mar 29, 2022 at 09:20:25PM +0300, Topi Miettinen wrote:
>> On 28.3.2022 18.05, Pablo Neira Ayuso wrote:
>>> On Mon, Mar 28, 2022 at 05:08:32PM +0300, Topi Miettinen wrote:
>>>> On 28.3.2022 0.31, Pablo Neira Ayuso wrote:
>>>>> On Sat, Mar 26, 2022 at 12:09:26PM +0200, Topi Miettinen wrote:
>>> [...]
>>>> But I think that with this approach, depending on system load, there could
>>>> be a vulnerable time window where the rules aren't loaded yet but the
>>>> process which is supposed to be protected by the rules has already started
>>>> running. This isn't desirable for firewalls, so I'd like to have a way for
>>>> loading the firewall rules as early as possible.
>>>
>>> You could define a static ruleset which creates the table, basechain
>>> and the cgroupv2 verdict map. Then, systemd updates this map with new
>>> entries to match on cgroupsv2 and apply the corresponding policy for
>>> this process, and delete it when not needed anymore. You have to
>>> define one non-basechain for each cgroupv2 policy.
>>
>> Actually this seems to work:
>>
>> table inet filter {
>>          set cg {
>>                  typeof socket cgroupv2 level 0
>>          }
>>
>>          chain y {
>>                  socket cgroupv2 level 2 @cg accept
>> 		counter drop
>>          }
>> }
>>
>> Simulating systemd adding the cgroup of a service to the set:
>> # nft add element inet filter cg "system.slice/systemd-resolved.service"
>>
>> Cgroup ID (inode number of the cgroup) has been successfully added:
>> # nft list set inet filter cg
>>          set cg {
>>                  typeof socket cgroupv2 level 0
>>                  elements = { 6032 }
>>          }
>> # ls -id /sys/fs/cgroup/system.slice/systemd-resolved.service
>> 6032 /sys/fs/cgroup/system.slice/systemd-resolved.service/
> 
> You could define a ruleset that describes the policy following the
> cgroupsv2 hierarchy. Something like this:
> 
>   table inet filter {
>          map dict_cgroup_level_1 {
>                  type cgroupsv2 : verdict;
>                  elements = { "system.slice" : jump system_slice }
>          }
> 
>          map dict_cgroup_level_2 {
>                  type cgroupsv2 : verdict;
>                  elements = { "system.slice/systemd-timesyncd.service" : jump systemd_timesyncd }
>          }
> 
>          chain systemd_timesyncd {
>                  # systemd-timesyncd policy
>          }
> 
>          chain system_slice {
>                  socket cgroupv2 level 2 vmap @dict_cgroup_level_2
>                  # policy for system.slice process
>          }
> 
>          chain input {
>                  type filter hook input priority filter; policy drop;
>                  socket cgroupv2 level 1 vmap @dict_cgroup_level_1
>          }
>   }
> 
> The dictionaries per level allows you to mimic the cgroupsv2 tree
> hierarchy
> 
> This allows you to attach a default policy for processes that belong
> to the "system_slice" (at level 1). This might also be useful in case
> that there is a process in the group "system_slice" which does not yet
> have an explicit level 2 policy, so level 1 policy applies in such
> case.
> 
> You might want to apply the level 1 policy before the level 2 policy
> (ie. aggregate policies per level as you move searching for an exact
> cgroup match), or instead you might prefer to search for an exact
> match at level 2, otherwise backtrack to closest matching cgroupsv2
> for this process.

Nice ideas, but the rules can't be loaded before the cgroups are 
realized at early boot:

Mar 30 19:14:45 systemd[1]: Starting nftables...
Mar 30 19:14:46 nft[1018]: /etc/nftables.conf:305:5-44: Error: cgroupv2 
path fails: Permission denied
Mar 30 19:14:46 nft[1018]: 
"system.slice/systemd-timesyncd.service" : jump systemd_timesyncd
Mar 30 19:14:46 nft[1018]: 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mar 30 19:14:46 systemd[1]: nftables.service: Main process exited, 
code=exited, status=1/FAILURE
Mar 30 19:14:46 systemd[1]: nftables.service: Failed with result 
'exit-code'.
Mar 30 19:14:46 systemd[1]: Failed to start nftables.

> There is also the jump and goto semantics for chains that can be
> combined in this chain tree.
> 
> BTW, what nftables version are you using? My listing does not show
> i-nodes, instead it shows the path.

Debian version: 1.0.2-1. The inode numbers seem to be caused by my 
SELinux policy. Disabling it shows the paths:

         map dict_cgroup_level_2_sys {
                 type cgroupsv2 : verdict
                 elements = { 5132 : jump systemd_timesyncd }
         }

         map dict_cgroup_level_1 {
                 type cgroupsv2 : verdict
                 elements = { "system.slice" : jump system_slice,
                              "user.slice" : jump user_slice }
         }

Above "system.slice/systemd-timesyncd.service" is a number because the 
cgroup ID became stale when I restarted the service. I think the policy 
doesn't work then anymore.

-Topi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Support for loading firewall rules with cgroup(v2) expressions early
  2022-03-30 16:37           ` Topi Miettinen
@ 2022-03-30 21:47             ` Pablo Neira Ayuso
  2022-03-31 15:10               ` Topi Miettinen
  0 siblings, 1 reply; 17+ messages in thread
From: Pablo Neira Ayuso @ 2022-03-30 21:47 UTC (permalink / raw)
  To: Topi Miettinen; +Cc: netfilter-devel

On Wed, Mar 30, 2022 at 07:37:00PM +0300, Topi Miettinen wrote:
> On 30.3.2022 1.25, Pablo Neira Ayuso wrote:
> > On Tue, Mar 29, 2022 at 09:20:25PM +0300, Topi Miettinen wrote:
> > > On 28.3.2022 18.05, Pablo Neira Ayuso wrote:
> > > > On Mon, Mar 28, 2022 at 05:08:32PM +0300, Topi Miettinen wrote:
> > > > > On 28.3.2022 0.31, Pablo Neira Ayuso wrote:
> > > > > > On Sat, Mar 26, 2022 at 12:09:26PM +0200, Topi Miettinen wrote:
> > > > [...]
> > > > > But I think that with this approach, depending on system load, there could
> > > > > be a vulnerable time window where the rules aren't loaded yet but the
> > > > > process which is supposed to be protected by the rules has already started
> > > > > running. This isn't desirable for firewalls, so I'd like to have a way for
> > > > > loading the firewall rules as early as possible.
> > > > 
> > > > You could define a static ruleset which creates the table, basechain
> > > > and the cgroupv2 verdict map. Then, systemd updates this map with new
> > > > entries to match on cgroupsv2 and apply the corresponding policy for
> > > > this process, and delete it when not needed anymore. You have to
> > > > define one non-basechain for each cgroupv2 policy.
> > > 
> > > Actually this seems to work:
> > > 
> > > table inet filter {
> > >          set cg {
> > >                  typeof socket cgroupv2 level 0
> > >          }
> > > 
> > >          chain y {
> > >                  socket cgroupv2 level 2 @cg accept
> > > 		counter drop
> > >          }
> > > }
> > > 
> > > Simulating systemd adding the cgroup of a service to the set:
> > > # nft add element inet filter cg "system.slice/systemd-resolved.service"
> > > 
> > > Cgroup ID (inode number of the cgroup) has been successfully added:
> > > # nft list set inet filter cg
> > >          set cg {
> > >                  typeof socket cgroupv2 level 0
> > >                  elements = { 6032 }
> > >          }
> > > # ls -id /sys/fs/cgroup/system.slice/systemd-resolved.service
> > > 6032 /sys/fs/cgroup/system.slice/systemd-resolved.service/
> > 
> > You could define a ruleset that describes the policy following the
> > cgroupsv2 hierarchy. Something like this:
> > 
> >   table inet filter {
> >          map dict_cgroup_level_1 {
> >                  type cgroupsv2 : verdict;
> >                  elements = { "system.slice" : jump system_slice }
> >          }
> > 
> >          map dict_cgroup_level_2 {
> >                  type cgroupsv2 : verdict;
> >                  elements = { "system.slice/systemd-timesyncd.service" : jump systemd_timesyncd }
> >          }
> > 
> >          chain systemd_timesyncd {
> >                  # systemd-timesyncd policy
> >          }
> > 
> >          chain system_slice {
> >                  socket cgroupv2 level 2 vmap @dict_cgroup_level_2
> >                  # policy for system.slice process
> >          }
> > 
> >          chain input {
> >                  type filter hook input priority filter; policy drop;
> >                  socket cgroupv2 level 1 vmap @dict_cgroup_level_1
> >          }
> >   }
> > 
> > The dictionaries per level allows you to mimic the cgroupsv2 tree
> > hierarchy
> > 
> > This allows you to attach a default policy for processes that belong
> > to the "system_slice" (at level 1). This might also be useful in case
> > that there is a process in the group "system_slice" which does not yet
> > have an explicit level 2 policy, so level 1 policy applies in such
> > case.
> > 
> > You might want to apply the level 1 policy before the level 2 policy
> > (ie. aggregate policies per level as you move searching for an exact
> > cgroup match), or instead you might prefer to search for an exact
> > match at level 2, otherwise backtrack to closest matching cgroupsv2
> > for this process.
> 
> Nice ideas, but the rules can't be loaded before the cgroups are realized at
> early boot:
> 
> Mar 30 19:14:45 systemd[1]: Starting nftables...
> Mar 30 19:14:46 nft[1018]: /etc/nftables.conf:305:5-44: Error: cgroupv2 path
> fails: Permission denied
> Mar 30 19:14:46 nft[1018]: "system.slice/systemd-timesyncd.service" : jump
> systemd_timesyncd
> Mar 30 19:14:46 nft[1018]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Mar 30 19:14:46 systemd[1]: nftables.service: Main process exited,
> code=exited, status=1/FAILURE
> Mar 30 19:14:46 systemd[1]: nftables.service: Failed with result
> 'exit-code'.
> Mar 30 19:14:46 systemd[1]: Failed to start nftables.

I guess this unit file performs nft -f on cgroupsv2 that do not exist
yet.

Could you just load the base policy with empty dictionaries instead,
then track and register the cgroups into the ruleset as they are being
created/removed?

> > There is also the jump and goto semantics for chains that can be
> > combined in this chain tree.
> > 
> > BTW, what nftables version are you using? My listing does not show
> > i-nodes, instead it shows the path.
> 
> Debian version: 1.0.2-1. The inode numbers seem to be caused by my SELinux
> policy. Disabling it shows the paths:
> 
>         map dict_cgroup_level_2_sys {
>                 type cgroupsv2 : verdict
>                 elements = { 5132 : jump systemd_timesyncd }
>         }
> 
>         map dict_cgroup_level_1 {
>                 type cgroupsv2 : verdict
>                 elements = { "system.slice" : jump system_slice,
>                              "user.slice" : jump user_slice }
>         }
> 
> Above "system.slice/systemd-timesyncd.service" is a number because the
> cgroup ID became stale when I restarted the service. I think the policy
> doesn't work then anymore.

Yes, you have to refresh your policy on cgroupsv2 updates.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Support for loading firewall rules with cgroup(v2) expressions early
  2022-03-30 21:47             ` Pablo Neira Ayuso
@ 2022-03-31 15:10               ` Topi Miettinen
  2022-04-05 22:18                 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 17+ messages in thread
From: Topi Miettinen @ 2022-03-31 15:10 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On 31.3.2022 0.47, Pablo Neira Ayuso wrote:
> On Wed, Mar 30, 2022 at 07:37:00PM +0300, Topi Miettinen wrote:
>> On 30.3.2022 1.25, Pablo Neira Ayuso wrote:
>>> On Tue, Mar 29, 2022 at 09:20:25PM +0300, Topi Miettinen wrote:
>>>> On 28.3.2022 18.05, Pablo Neira Ayuso wrote:
>>>>> On Mon, Mar 28, 2022 at 05:08:32PM +0300, Topi Miettinen wrote:
>>>>>> On 28.3.2022 0.31, Pablo Neira Ayuso wrote:
>>>>>>> On Sat, Mar 26, 2022 at 12:09:26PM +0200, Topi Miettinen wrote:
>>>>> [...]
>>>>>> But I think that with this approach, depending on system load, there could
>>>>>> be a vulnerable time window where the rules aren't loaded yet but the
>>>>>> process which is supposed to be protected by the rules has already started
>>>>>> running. This isn't desirable for firewalls, so I'd like to have a way for
>>>>>> loading the firewall rules as early as possible.
>>>>>
>>>>> You could define a static ruleset which creates the table, basechain
>>>>> and the cgroupv2 verdict map. Then, systemd updates this map with new
>>>>> entries to match on cgroupsv2 and apply the corresponding policy for
>>>>> this process, and delete it when not needed anymore. You have to
>>>>> define one non-basechain for each cgroupv2 policy.
>>>>
>>>> Actually this seems to work:
>>>>
>>>> table inet filter {
>>>>           set cg {
>>>>                   typeof socket cgroupv2 level 0
>>>>           }
>>>>
>>>>           chain y {
>>>>                   socket cgroupv2 level 2 @cg accept
>>>> 		counter drop
>>>>           }
>>>> }
>>>>
>>>> Simulating systemd adding the cgroup of a service to the set:
>>>> # nft add element inet filter cg "system.slice/systemd-resolved.service"
>>>>
>>>> Cgroup ID (inode number of the cgroup) has been successfully added:
>>>> # nft list set inet filter cg
>>>>           set cg {
>>>>                   typeof socket cgroupv2 level 0
>>>>                   elements = { 6032 }
>>>>           }
>>>> # ls -id /sys/fs/cgroup/system.slice/systemd-resolved.service
>>>> 6032 /sys/fs/cgroup/system.slice/systemd-resolved.service/
>>>
>>> You could define a ruleset that describes the policy following the
>>> cgroupsv2 hierarchy. Something like this:
>>>
>>>    table inet filter {
>>>           map dict_cgroup_level_1 {
>>>                   type cgroupsv2 : verdict;
>>>                   elements = { "system.slice" : jump system_slice }
>>>           }
>>>
>>>           map dict_cgroup_level_2 {
>>>                   type cgroupsv2 : verdict;
>>>                   elements = { "system.slice/systemd-timesyncd.service" : jump systemd_timesyncd }
>>>           }
>>>
>>>           chain systemd_timesyncd {
>>>                   # systemd-timesyncd policy
>>>           }
>>>
>>>           chain system_slice {
>>>                   socket cgroupv2 level 2 vmap @dict_cgroup_level_2
>>>                   # policy for system.slice process
>>>           }
>>>
>>>           chain input {
>>>                   type filter hook input priority filter; policy drop;
>>>                   socket cgroupv2 level 1 vmap @dict_cgroup_level_1
>>>           }
>>>    }
>>>
>>> The dictionaries per level allows you to mimic the cgroupsv2 tree
>>> hierarchy
>>>
>>> This allows you to attach a default policy for processes that belong
>>> to the "system_slice" (at level 1). This might also be useful in case
>>> that there is a process in the group "system_slice" which does not yet
>>> have an explicit level 2 policy, so level 1 policy applies in such
>>> case.
>>>
>>> You might want to apply the level 1 policy before the level 2 policy
>>> (ie. aggregate policies per level as you move searching for an exact
>>> cgroup match), or instead you might prefer to search for an exact
>>> match at level 2, otherwise backtrack to closest matching cgroupsv2
>>> for this process.
>>
>> Nice ideas, but the rules can't be loaded before the cgroups are realized at
>> early boot:
>>
>> Mar 30 19:14:45 systemd[1]: Starting nftables...
>> Mar 30 19:14:46 nft[1018]: /etc/nftables.conf:305:5-44: Error: cgroupv2 path
>> fails: Permission denied
>> Mar 30 19:14:46 nft[1018]: "system.slice/systemd-timesyncd.service" : jump
>> systemd_timesyncd
>> Mar 30 19:14:46 nft[1018]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> Mar 30 19:14:46 systemd[1]: nftables.service: Main process exited,
>> code=exited, status=1/FAILURE
>> Mar 30 19:14:46 systemd[1]: nftables.service: Failed with result
>> 'exit-code'.
>> Mar 30 19:14:46 systemd[1]: Failed to start nftables.
> 
> I guess this unit file performs nft -f on cgroupsv2 that do not exist
> yet.

Yes, that's the case. Being able to do so with for example 
"cgroupsv2name" would be nice.

> Could you just load the base policy with empty dictionaries instead,
> then track and register the cgroups into the ruleset as they are being
> created/removed?

That's possible and I'll probably make a PR for systemd for such a 
feature. But I don't think that's the best solution: if the NFT rules 
are loaded from initrd and systemd is not running (initrd is not built 
by dracut), rules won't work, even top level "system.slice" and 
"user.slice". Then network connectivity in initrd could be a problem. 
Also I don't know if that model would scale to unprivileged user 
services or containers. Userspace daemon feeding kernel information that 
it already knows seems a bit inelegant.

-Topi

>>> There is also the jump and goto semantics for chains that can be
>>> combined in this chain tree.
>>>
>>> BTW, what nftables version are you using? My listing does not show
>>> i-nodes, instead it shows the path.
>>
>> Debian version: 1.0.2-1. The inode numbers seem to be caused by my SELinux
>> policy. Disabling it shows the paths:
>>
>>          map dict_cgroup_level_2_sys {
>>                  type cgroupsv2 : verdict
>>                  elements = { 5132 : jump systemd_timesyncd }
>>          }
>>
>>          map dict_cgroup_level_1 {
>>                  type cgroupsv2 : verdict
>>                  elements = { "system.slice" : jump system_slice,
>>                               "user.slice" : jump user_slice }
>>          }
>>
>> Above "system.slice/systemd-timesyncd.service" is a number because the
>> cgroup ID became stale when I restarted the service. I think the policy
>> doesn't work then anymore.
> 
> Yes, you have to refresh your policy on cgroupsv2 updates.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Support for loading firewall rules with cgroup(v2) expressions early
  2022-03-30  2:53           ` Pablo Neira Ayuso
@ 2022-04-02  8:12             ` Topi Miettinen
  2022-04-03 18:32               ` Topi Miettinen
  0 siblings, 1 reply; 17+ messages in thread
From: Topi Miettinen @ 2022-04-02  8:12 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On 30.3.2022 5.53, Pablo Neira Ayuso wrote:
> On Wed, Mar 30, 2022 at 12:25:25AM +0200, Pablo Neira Ayuso wrote:
>> On Tue, Mar 29, 2022 at 09:20:25PM +0300, Topi Miettinen wrote:
> [...]
>> You could define a ruleset that describes the policy following the
>> cgroupsv2 hierarchy. Something like this:
>>
>>   table inet filter {
>>          map dict_cgroup_level_1 {
>>                  type cgroupsv2 : verdict;
>>                  elements = { "system.slice" : jump system_slice }
>>          }
>>
>>          map dict_cgroup_level_2 {
>>                  type cgroupsv2 : verdict;
>>                  elements = { "system.slice/systemd-timesyncd.service" : jump systemd_timesyncd }
>>          }
>>
>>          chain systemd_timesyncd {
>>                  # systemd-timesyncd policy
>>          }
>>
>>          chain system_slice {
>>                  socket cgroupv2 level 2 vmap @dict_cgroup_level_2
>>                  # policy for system.slice process
>>          }
>>
>>          chain input {
>>                  type filter hook input priority filter; policy drop;
> 
> This example should use the output chain instead:
> 
>            chain output {
>                    type filter hook output priority filter; policy drop;
> 
>  From the input chain, the packet relies on early demux to have access
> to the socket.
> 
> The idea would be to filter out outgoing traffic and rely on conntrack
> for (established) input traffic.

Is it really so that 'socket cgroupv2' can't be used on input side at 
all? At least 'ss' can display the cgroup for listening sockets 
correctly, so the cgroup information should be available somewhere:

$ ss -lt --cgroup
State    Recv-Q   Send-Q       Local Address:Port       Peer 
Address:Port   Process 

LISTEN   0        4096                  *%lo:ssh                   *:* 
      cgroup:/system.slice/ssh.socket

-Topi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Support for loading firewall rules with cgroup(v2) expressions early
  2022-04-02  8:12             ` Topi Miettinen
@ 2022-04-03 18:32               ` Topi Miettinen
  2022-04-05 22:00                 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 17+ messages in thread
From: Topi Miettinen @ 2022-04-03 18:32 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On 2.4.2022 11.12, Topi Miettinen wrote:
> On 30.3.2022 5.53, Pablo Neira Ayuso wrote:
>> On Wed, Mar 30, 2022 at 12:25:25AM +0200, Pablo Neira Ayuso wrote:
>>> On Tue, Mar 29, 2022 at 09:20:25PM +0300, Topi Miettinen wrote:
>> [...]
>>> You could define a ruleset that describes the policy following the
>>> cgroupsv2 hierarchy. Something like this:
>>>
>>>   table inet filter {
>>>          map dict_cgroup_level_1 {
>>>                  type cgroupsv2 : verdict;
>>>                  elements = { "system.slice" : jump system_slice }
>>>          }
>>>
>>>          map dict_cgroup_level_2 {
>>>                  type cgroupsv2 : verdict;
>>>                  elements = { 
>>> "system.slice/systemd-timesyncd.service" : jump systemd_timesyncd }
>>>          }
>>>
>>>          chain systemd_timesyncd {
>>>                  # systemd-timesyncd policy
>>>          }
>>>
>>>          chain system_slice {
>>>                  socket cgroupv2 level 2 vmap @dict_cgroup_level_2
>>>                  # policy for system.slice process
>>>          }
>>>
>>>          chain input {
>>>                  type filter hook input priority filter; policy drop;
>>
>> This example should use the output chain instead:
>>
>>            chain output {
>>                    type filter hook output priority filter; policy drop;
>>
>>  From the input chain, the packet relies on early demux to have access
>> to the socket.
>>
>> The idea would be to filter out outgoing traffic and rely on conntrack
>> for (established) input traffic.
> 
> Is it really so that 'socket cgroupv2' can't be used on input side at 
> all? At least 'ss' can display the cgroup for listening sockets 
> correctly, so the cgroup information should be available somewhere:
> 
> $ ss -lt --cgroup
> State    Recv-Q   Send-Q       Local Address:Port       Peer 
> Address:Port   Process
> LISTEN   0        4096                  *%lo:ssh                   *:* 
>       cgroup:/system.slice/ssh.socket

Also 'meta skuid' doesn't seem to work in input filters. It would have 
been simple to use 'meta skuid < 1000' to simulate 'system.slice' vs. 
'user.slice' cgroups.

If this is intentional, the manual page should make this much clearer. 
There's no warning and the kernel doesn't reject the useless input rules.

I think it should be possible to do filtering on input side based on the 
socket properties (UID, GID, cgroup). Especially with UDP, it should be 
possible to drop all packets if the listening process is not OK.

My use case is that I need to open ports for Steam games (TCP and UDP 
ports 27015-27030) but I don't want to make them available for system 
services or any other apps besides Steam games. SELinux SECMARKs and TE 
rules for sockets help me here but there are other problems.

-Topi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Support for loading firewall rules with cgroup(v2) expressions early
  2022-04-03 18:32               ` Topi Miettinen
@ 2022-04-05 22:00                 ` Pablo Neira Ayuso
  2022-04-06 13:57                   ` Topi Miettinen
  0 siblings, 1 reply; 17+ messages in thread
From: Pablo Neira Ayuso @ 2022-04-05 22:00 UTC (permalink / raw)
  To: Topi Miettinen; +Cc: netfilter-devel

On Sun, Apr 03, 2022 at 09:32:11PM +0300, Topi Miettinen wrote:
> On 2.4.2022 11.12, Topi Miettinen wrote:
> > On 30.3.2022 5.53, Pablo Neira Ayuso wrote:
> > > On Wed, Mar 30, 2022 at 12:25:25AM +0200, Pablo Neira Ayuso wrote:
> > > > On Tue, Mar 29, 2022 at 09:20:25PM +0300, Topi Miettinen wrote:
> > > [...]
> > > > You could define a ruleset that describes the policy following the
> > > > cgroupsv2 hierarchy. Something like this:
> > > > 
> > > >   table inet filter {
> > > >          map dict_cgroup_level_1 {
> > > >                  type cgroupsv2 : verdict;
> > > >                  elements = { "system.slice" : jump system_slice }
> > > >          }
> > > > 
> > > >          map dict_cgroup_level_2 {
> > > >                  type cgroupsv2 : verdict;
> > > >                  elements = {
> > > > "system.slice/systemd-timesyncd.service" : jump
> > > > systemd_timesyncd }
> > > >          }
> > > > 
> > > >          chain systemd_timesyncd {
> > > >                  # systemd-timesyncd policy
> > > >          }
> > > > 
> > > >          chain system_slice {
> > > >                  socket cgroupv2 level 2 vmap @dict_cgroup_level_2
> > > >                  # policy for system.slice process
> > > >          }
> > > > 
> > > >          chain input {
> > > >                  type filter hook input priority filter; policy drop;
> > > 
> > > This example should use the output chain instead:
> > > 
> > >            chain output {
> > >                    type filter hook output priority filter; policy drop;
> > > 
> > >  From the input chain, the packet relies on early demux to have access
> > > to the socket.
> > > 
> > > The idea would be to filter out outgoing traffic and rely on conntrack
> > > for (established) input traffic.
> > 
> > Is it really so that 'socket cgroupv2' can't be used on input side at
> > all? At least 'ss' can display the cgroup for listening sockets
> > correctly, so the cgroup information should be available somewhere:
> > 
> > $ ss -lt --cgroup
> > State    Recv-Q   Send-Q       Local Address:Port       Peer
> > Address:Port   Process
> > LISTEN   0        4096                  *%lo:ssh                   *:*
> >      cgroup:/system.slice/ssh.socket
> 
> Also 'meta skuid' doesn't seem to work in input filters. It would have been
> simple to use 'meta skuid < 1000' to simulate 'system.slice' vs.
> 'user.slice' cgroups.
> 
> If this is intentional, the manual page should make this much clearer.

It is not yet described in nft(8) unfortunately, but
iptables-extensions(8) says:

 IMPORTANT: when being used in the INPUT chain, the cgroup matcher is currently only
       of limited functionality, meaning it will only match on packets that are processed
       for local sockets through early socket demuxing. Therefore, general usage on the INPUT
       chain is not advised unless the implications are well understood.

> There's no warning and the kernel doesn't reject the useless input rules.
> 
> I think it should be possible to do filtering on input side based on the
> socket properties (UID, GID, cgroup). Especially with UDP, it should be
> possible to drop all packets if the listening process is not OK.

Everything is possible, it's not yet implemented though.

> My use case is that I need to open ports for Steam games (TCP and UDP ports
> 27015-27030) but I don't want to make them available for system services or
> any other apps besides Steam games. SELinux SECMARKs and TE rules for
> sockets help me here but there are other problems.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Support for loading firewall rules with cgroup(v2) expressions early
  2022-03-31 15:10               ` Topi Miettinen
@ 2022-04-05 22:18                 ` Pablo Neira Ayuso
  2022-04-06 14:02                   ` Topi Miettinen
  0 siblings, 1 reply; 17+ messages in thread
From: Pablo Neira Ayuso @ 2022-04-05 22:18 UTC (permalink / raw)
  To: Topi Miettinen; +Cc: netfilter-devel

On Thu, Mar 31, 2022 at 06:10:19PM +0300, Topi Miettinen wrote:
> On 31.3.2022 0.47, Pablo Neira Ayuso wrote:
> > On Wed, Mar 30, 2022 at 07:37:00PM +0300, Topi Miettinen wrote:
[...]
> > > Nice ideas, but the rules can't be loaded before the cgroups are realized at
> > > early boot:
> > > 
> > > Mar 30 19:14:45 systemd[1]: Starting nftables...
> > > Mar 30 19:14:46 nft[1018]: /etc/nftables.conf:305:5-44: Error: cgroupv2 path
> > > fails: Permission denied
> > > Mar 30 19:14:46 nft[1018]: "system.slice/systemd-timesyncd.service" : jump
> > > systemd_timesyncd
> > > Mar 30 19:14:46 nft[1018]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > Mar 30 19:14:46 systemd[1]: nftables.service: Main process exited,
> > > code=exited, status=1/FAILURE
> > > Mar 30 19:14:46 systemd[1]: nftables.service: Failed with result
> > > 'exit-code'.
> > > Mar 30 19:14:46 systemd[1]: Failed to start nftables.
> > 
> > I guess this unit file performs nft -f on cgroupsv2 that do not exist
> > yet.
> 
> Yes, that's the case. Being able to do so with for example "cgroupsv2name"
> would be nice.

Cgroupsv2 names might be arbitrarily large, correct? ie. PATH_MAX.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Support for loading firewall rules with cgroup(v2) expressions early
  2022-04-05 22:00                 ` Pablo Neira Ayuso
@ 2022-04-06 13:57                   ` Topi Miettinen
  0 siblings, 0 replies; 17+ messages in thread
From: Topi Miettinen @ 2022-04-06 13:57 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On 6.4.2022 1.00, Pablo Neira Ayuso wrote:
> On Sun, Apr 03, 2022 at 09:32:11PM +0300, Topi Miettinen wrote:
>> On 2.4.2022 11.12, Topi Miettinen wrote:
>>> On 30.3.2022 5.53, Pablo Neira Ayuso wrote:
>>>> On Wed, Mar 30, 2022 at 12:25:25AM +0200, Pablo Neira Ayuso wrote:
>>>>> On Tue, Mar 29, 2022 at 09:20:25PM +0300, Topi Miettinen wrote:
>>>> [...]
>>>>> You could define a ruleset that describes the policy following the
>>>>> cgroupsv2 hierarchy. Something like this:
>>>>>
>>>>>    table inet filter {
>>>>>           map dict_cgroup_level_1 {
>>>>>                   type cgroupsv2 : verdict;
>>>>>                   elements = { "system.slice" : jump system_slice }
>>>>>           }
>>>>>
>>>>>           map dict_cgroup_level_2 {
>>>>>                   type cgroupsv2 : verdict;
>>>>>                   elements = {
>>>>> "system.slice/systemd-timesyncd.service" : jump
>>>>> systemd_timesyncd }
>>>>>           }
>>>>>
>>>>>           chain systemd_timesyncd {
>>>>>                   # systemd-timesyncd policy
>>>>>           }
>>>>>
>>>>>           chain system_slice {
>>>>>                   socket cgroupv2 level 2 vmap @dict_cgroup_level_2
>>>>>                   # policy for system.slice process
>>>>>           }
>>>>>
>>>>>           chain input {
>>>>>                   type filter hook input priority filter; policy drop;
>>>>
>>>> This example should use the output chain instead:
>>>>
>>>>             chain output {
>>>>                     type filter hook output priority filter; policy drop;
>>>>
>>>>   From the input chain, the packet relies on early demux to have access
>>>> to the socket.
>>>>
>>>> The idea would be to filter out outgoing traffic and rely on conntrack
>>>> for (established) input traffic.
>>>
>>> Is it really so that 'socket cgroupv2' can't be used on input side at
>>> all? At least 'ss' can display the cgroup for listening sockets
>>> correctly, so the cgroup information should be available somewhere:
>>>
>>> $ ss -lt --cgroup
>>> State    Recv-Q   Send-Q       Local Address:Port       Peer
>>> Address:Port   Process
>>> LISTEN   0        4096                  *%lo:ssh                   *:*
>>>       cgroup:/system.slice/ssh.socket
>>
>> Also 'meta skuid' doesn't seem to work in input filters. It would have been
>> simple to use 'meta skuid < 1000' to simulate 'system.slice' vs.
>> 'user.slice' cgroups.
>>
>> If this is intentional, the manual page should make this much clearer.
> 
> It is not yet described in nft(8) unfortunately, but
> iptables-extensions(8) says:
> 
>   IMPORTANT: when being used in the INPUT chain, the cgroup matcher is currently only
>         of limited functionality, meaning it will only match on packets that are processed
>         for local sockets through early socket demuxing. Therefore, general usage on the INPUT
>         chain is not advised unless the implications are well understood.

Something like this would be nice to add to nft(8). The concept of 
'early socket demuxing' isn't very obvious (at least to me). Could the 
user of nft be able to control somehow, like force early demuxing with a 
sysctl?

-Topi

> 
>> There's no warning and the kernel doesn't reject the useless input rules.
>>
>> I think it should be possible to do filtering on input side based on the
>> socket properties (UID, GID, cgroup). Especially with UDP, it should be
>> possible to drop all packets if the listening process is not OK.
> 
> Everything is possible, it's not yet implemented though.
> 
>> My use case is that I need to open ports for Steam games (TCP and UDP ports
>> 27015-27030) but I don't want to make them available for system services or
>> any other apps besides Steam games. SELinux SECMARKs and TE rules for
>> sockets help me here but there are other problems.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Support for loading firewall rules with cgroup(v2) expressions early
  2022-04-05 22:18                 ` Pablo Neira Ayuso
@ 2022-04-06 14:02                   ` Topi Miettinen
  0 siblings, 0 replies; 17+ messages in thread
From: Topi Miettinen @ 2022-04-06 14:02 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On 6.4.2022 1.18, Pablo Neira Ayuso wrote:
> On Thu, Mar 31, 2022 at 06:10:19PM +0300, Topi Miettinen wrote:
>> On 31.3.2022 0.47, Pablo Neira Ayuso wrote:
>>> On Wed, Mar 30, 2022 at 07:37:00PM +0300, Topi Miettinen wrote:
> [...]
>>>> Nice ideas, but the rules can't be loaded before the cgroups are realized at
>>>> early boot:
>>>>
>>>> Mar 30 19:14:45 systemd[1]: Starting nftables...
>>>> Mar 30 19:14:46 nft[1018]: /etc/nftables.conf:305:5-44: Error: cgroupv2 path
>>>> fails: Permission denied
>>>> Mar 30 19:14:46 nft[1018]: "system.slice/systemd-timesyncd.service" : jump
>>>> systemd_timesyncd
>>>> Mar 30 19:14:46 nft[1018]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>> Mar 30 19:14:46 systemd[1]: nftables.service: Main process exited,
>>>> code=exited, status=1/FAILURE
>>>> Mar 30 19:14:46 systemd[1]: nftables.service: Failed with result
>>>> 'exit-code'.
>>>> Mar 30 19:14:46 systemd[1]: Failed to start nftables.
>>>
>>> I guess this unit file performs nft -f on cgroupsv2 that do not exist
>>> yet.
>>
>> Yes, that's the case. Being able to do so with for example "cgroupsv2name"
>> would be nice.
> 
> Cgroupsv2 names might be arbitrarily large, correct? ie. PATH_MAX.

I think so, could this be a problem?

-Topi

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-04-06 16:41 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-26 10:09 Support for loading firewall rules with cgroup(v2) expressions early Topi Miettinen
2022-03-27 21:31 ` Pablo Neira Ayuso
2022-03-28 14:08   ` Topi Miettinen
2022-03-28 15:05     ` Pablo Neira Ayuso
2022-03-28 17:46       ` Topi Miettinen
2022-03-29 18:20       ` Topi Miettinen
2022-03-29 22:25         ` Pablo Neira Ayuso
2022-03-30  2:53           ` Pablo Neira Ayuso
2022-04-02  8:12             ` Topi Miettinen
2022-04-03 18:32               ` Topi Miettinen
2022-04-05 22:00                 ` Pablo Neira Ayuso
2022-04-06 13:57                   ` Topi Miettinen
2022-03-30 16:37           ` Topi Miettinen
2022-03-30 21:47             ` Pablo Neira Ayuso
2022-03-31 15:10               ` Topi Miettinen
2022-04-05 22:18                 ` Pablo Neira Ayuso
2022-04-06 14:02                   ` Topi Miettinen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.