Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device

From: Daniel Borkmann <daniel@iogearbox.net>
To: Jiri Pirko <jiri@resnulli.us>
Cc: bpf@vger.kernel.org, netdev@vger.kernel.org,
	martin.lau@linux.dev, razor@blackwall.org, ast@kernel.org,
	andrii@kernel.org, john.fastabend@gmail.com, sdf@google.com,
	toke@kernel.org, kuba@kernel.org, andrew@lunn.ch,
	"Toke Høiland-Jørgensen" <toke@redhat.com>
Subject: Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device
Date: Wed, 25 Oct 2023 19:20:01 +0200	[thread overview]
Message-ID: <7dcf130e-db64-34bc-5207-15e4a4963bc0@iogearbox.net> (raw)
In-Reply-To: <ZTk4ec8CBh92PZvs@nanopsycho>

On 10/25/23 5:47 PM, Jiri Pirko wrote:
> Tue, Oct 24, 2023 at 11:48:58PM CEST, daniel@iogearbox.net wrote:
[...]
>> comes with a primary and a peer device. Only the primary device, typically
>> residing in hostns, can manage BPF programs for itself and its peer. The
>> peer device is designated for containers/Pods and cannot attach/detach
>> BPF programs. Upon the device creation, the user can set the default policy
>> to 'pass' or 'drop' for the case when no BPF program is attached.
> 
> It looks to me that you only need the host (primary) netdevice to be
> used as a handle to attach the bpf programs. Because the bpf program
> can (and probably in real use case will) redirect to uplink/another
> pod netdevice skipping the host (primary) netdevice, correct?
> 
> If so, why can't you do just single device mode from start finding a
> different sort of bpf attach handle? (not sure which)

The first point where we switch netns from a K8s Pod is out of the netdevice.
For the CNI case the vast majority has one but there could also be multi-
homing for Pods where there may be two or more, and from a troubleshooting
PoV aka tcpdump et al, it is the most natural point. Other attach handle
inside the Pod doesn't really fit given from policy PoV it also must be
unreachable for apps inside the Pod itself.

>> Additionally, the device can be operated in L3 (default) or L2 mode. The
>> management of BPF programs is done via bpf_mprog, so that multi-attach is
>> supported right from the beginning with similar API and dependency controls
>> as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic
>> attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
>> so that existing programs can be easily migrated.
>>
>> Going forward, we plan to use netkit devices in Cilium as the main device
>> type for connecting Pods. They will be operated in L3 mode in order to
>> simplify a Pod's neighbor management and the peer will operate in default
>> drop mode, so that no traffic is leaving between the time when a Pod is
>> brought up by the CNI plugin and programs attached by the agent.
>> Additionally, the programs we attach via tcx on the physical devices are
>> using bpf_redirect_peer() for inbound traffic into netkit device, hence the
>> latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
>> bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
>> directly. Also, BIG TCP is supported on netkit device. For the follow-up
>> work in single device mode, we plan to convert Cilium's cilium_host/_net
>> devices into a single one.
>>
>> An extensive test suite for checking device operations and the BPF program
>> and link management API comes as BPF selftests in this series.
> 
> Couple of nitpicks below:

Thanks for looking into those, I'll look into it and send a small cleanup on
the two.

>> +static int netkit_check_policy(int policy, struct nlattr *tb,
>> +			       struct netlink_ext_ack *extack)
>> +{
>> +	switch (policy) {
>> +	case NETKIT_PASS:
>> +	case NETKIT_DROP:
>> +		return 0;
>> +	default:
> 
> Isn't this job for netlink policy?
> 
> 
>> +		NL_SET_ERR_MSG_ATTR(extack, tb,
>> +				    "Provided default xmit policy not supported");
>> +		return -EINVAL;
>> +	}
>> +}
>> +
>> +static int netkit_check_mode(int mode, struct nlattr *tb,
>> +			     struct netlink_ext_ack *extack)
>> +{
>> +	switch (mode) {
>> +	case NETKIT_L2:
>> +	case NETKIT_L3:
>> +		return 0;
>> +	default:
> 
> Isn't this job for netlink policy?
> 
> 
>> +		NL_SET_ERR_MSG_ATTR(extack, tb,
>> +				    "Provided device mode can only be L2 or L3");
>> +		return -EINVAL;
>> +	}
>> +}
>> +
>> +static int netkit_validate(struct nlattr *tb[], struct nlattr *data[],
>> +			   struct netlink_ext_ack *extack)
>> +{
>> +	struct nlattr *attr = tb[IFLA_ADDRESS];
>> +
>> +	if (!attr)
>> +		return 0;
>> +	NL_SET_ERR_MSG_ATTR(extack, attr,
>> +			    "Setting Ethernet address is not supported");
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +static struct rtnl_link_ops netkit_link_ops;
>> +
>> +static int netkit_new_link(struct net *src_net, struct net_device *dev,
>> +			   struct nlattr *tb[], struct nlattr *data[],
>> +			   struct netlink_ext_ack *extack)
>> +{
>> +	struct nlattr *peer_tb[IFLA_MAX + 1], **tbp = tb, *attr;
>> +	enum netkit_action default_prim = NETKIT_PASS;
>> +	enum netkit_action default_peer = NETKIT_PASS;
>> +	enum netkit_mode mode = NETKIT_L3;
>> +	unsigned char ifname_assign_type;
>> +	struct ifinfomsg *ifmp = NULL;
>> +	struct net_device *peer;
>> +	char ifname[IFNAMSIZ];
>> +	struct netkit *nk;
>> +	struct net *net;
>> +	int err;
>> +
>> +	if (data) {
>> +		if (data[IFLA_NETKIT_MODE]) {
>> +			attr = data[IFLA_NETKIT_MODE];
>> +			mode = nla_get_u32(attr);
>> +			err = netkit_check_mode(mode, attr, extack);
>> +			if (err < 0)
>> +				return err;
>> +		}
>> +		if (data[IFLA_NETKIT_PEER_INFO]) {
>> +			attr = data[IFLA_NETKIT_PEER_INFO];
>> +			ifmp = nla_data(attr);
>> +			err = rtnl_nla_parse_ifinfomsg(peer_tb, attr, extack);
>> +			if (err < 0)
>> +				return err;
>> +			err = netkit_validate(peer_tb, NULL, extack);
>> +			if (err < 0)
>> +				return err;
>> +			tbp = peer_tb;
>> +		}
>> +		if (data[IFLA_NETKIT_POLICY]) {
>> +			attr = data[IFLA_NETKIT_POLICY];
>> +			default_prim = nla_get_u32(attr);
>> +			err = netkit_check_policy(default_prim, attr, extack);
>> +			if (err < 0)
>> +				return err;
>> +		}
>> +		if (data[IFLA_NETKIT_PEER_POLICY]) {
>> +			attr = data[IFLA_NETKIT_PEER_POLICY];
>> +			default_peer = nla_get_u32(attr);
>> +			err = netkit_check_policy(default_peer, attr, extack);
>> +			if (err < 0)
>> +				return err;
>> +		}
>> +	}
>> +
>> +	if (ifmp && tbp[IFLA_IFNAME]) {
>> +		nla_strscpy(ifname, tbp[IFLA_IFNAME], IFNAMSIZ);
>> +		ifname_assign_type = NET_NAME_USER;
>> +	} else {
>> +		strscpy(ifname, "nk%d", IFNAMSIZ);
>> +		ifname_assign_type = NET_NAME_ENUM;
>> +	}
>> +
>> +	net = rtnl_link_get_net(src_net, tbp);
>> +	if (IS_ERR(net))
>> +		return PTR_ERR(net);
>> +
>> +	peer = rtnl_create_link(net, ifname, ifname_assign_type,
>> +				&netkit_link_ops, tbp, extack);
>> +	if (IS_ERR(peer)) {
>> +		put_net(net);
>> +		return PTR_ERR(peer);
>> +	}
>> +
>> +	netif_inherit_tso_max(peer, dev);
>> +
>> +	if (mode == NETKIT_L2)
>> +		eth_hw_addr_random(peer);
>> +	if (ifmp && dev->ifindex)
>> +		peer->ifindex = ifmp->ifi_index;
>> +
>> +	nk = netkit_priv(peer);
>> +	nk->primary = false;
>> +	nk->policy = default_peer;
>> +	nk->mode = mode;
>> +	bpf_mprog_bundle_init(&nk->bundle);
>> +	RCU_INIT_POINTER(nk->active, NULL);
>> +	RCU_INIT_POINTER(nk->peer, NULL);
> 
> Aren't these already 0?
> 
> 
>> +
>> +	err = register_netdevice(peer);
>> +	put_net(net);
>> +	if (err < 0)
>> +		goto err_register_peer;
>> +	netif_carrier_off(peer);
>> +	if (mode == NETKIT_L2)
>> +		dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
>> +
>> +	err = rtnl_configure_link(peer, NULL, 0, NULL);
>> +	if (err < 0)
>> +		goto err_configure_peer;
>> +
>> +	if (mode == NETKIT_L2)
>> +		eth_hw_addr_random(dev);
>> +	if (tb[IFLA_IFNAME])
>> +		nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
>> +	else
>> +		strscpy(dev->name, "nk%d", IFNAMSIZ);
>> +
>> +	nk = netkit_priv(dev);
>> +	nk->primary = true;
>> +	nk->policy = default_prim;
>> +	nk->mode = mode;
>> +	bpf_mprog_bundle_init(&nk->bundle);
>> +	RCU_INIT_POINTER(nk->active, NULL);
>> +	RCU_INIT_POINTER(nk->peer, NULL);
>> +
>> +	err = register_netdevice(dev);
>> +	if (err < 0)
>> +		goto err_configure_peer;
>> +	netif_carrier_off(dev);
>> +	if (mode == NETKIT_L2)
>> +		dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL);
>> +
>> +	rcu_assign_pointer(netkit_priv(dev)->peer, peer);
>> +	rcu_assign_pointer(netkit_priv(peer)->peer, dev);
>> +	return 0;
>> +err_configure_peer:
>> +	unregister_netdevice(peer);
>> +	return err;
>> +err_register_peer:
>> +	free_netdev(peer);
>> +	return err;
>> +}
>> +
> 
> [..]
>