All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v2 0/3] VGT+ support
@ 2019-10-31 19:47 Ariel Levkovich
  2019-10-31 19:47 ` [PATCH net-next v2 1/3] net: Support querying specific VF properties Ariel Levkovich
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Ariel Levkovich @ 2019-10-31 19:47 UTC (permalink / raw)
  To: netdev
  Cc: Saeed Mahameed, sd, sbrivio, nikolay, Jiri Pirko, dsahern,
	stephen, Ariel Levkovich

The following series introduces VGT+ support for SRIOV vf
devices.

VGT+ is an extention of the VGT (Virtual Guest Tagging)
where the guest is in charge of vlan tagging the packets
only with VGT+ the admin can limit the allowed vlan ids
the guest can use to a specific vlan trunk list.

The patches introduce the API for admin users to set and
query these vlan trunk lists on the vfs using netlink
commands.

changes from v1 to v2:
- Fixed indentation of RTEXT_FILTER_SKIP_STATS.
- Changed vf_ext param to bool.
- Check if VF num exceeds the opened VFs range and return without
adding the vfinfo.

Ariel Levkovich (3):
  net: Support querying specific VF properties
  net: Add SRIOV VGT+ support
  net/mlx5: Add SRIOV VGT+ support

 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  30 ++
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  | 600 ++++++++++++++++-----
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |  27 +-
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c |   8 +-
 include/linux/if_link.h                            |   3 +
 include/linux/netdevice.h                          |  12 +
 include/uapi/linux/if_link.h                       |  35 ++
 include/uapi/linux/rtnetlink.h                     |   3 +-
 net/core/rtnetlink.c                               | 173 ++++--
 9 files changed, 717 insertions(+), 174 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next v2 1/3] net: Support querying specific VF properties
  2019-10-31 19:47 [PATCH net-next v2 0/3] VGT+ support Ariel Levkovich
@ 2019-10-31 19:47 ` Ariel Levkovich
  2019-10-31 19:47 ` [PATCH net-next v2 2/3] net: Add SRIOV VGT+ support Ariel Levkovich
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Ariel Levkovich @ 2019-10-31 19:47 UTC (permalink / raw)
  To: netdev
  Cc: Saeed Mahameed, sd, sbrivio, nikolay, Jiri Pirko, dsahern,
	stephen, Ariel Levkovich

Querying the link with its VFs information involves putting a
vfinfo struct per VF in the netlink message under the
IFLA_VFINFO_LIST attribute.

Since the attribute's length is limited by it's definition to u16,
this introduces a problem when we want to add new fields to the
vfinfo attribute.
With increasing the vfinfo attribute and running in an environment
with a large number of VFs, we may overflow the IFLA_VFINFO_LIST
attribute length.

To avoid that, this patch introduces a single VF query.
With single VF query, the kernel may include extended VF information
and fields, such that take up a significant amount of memory, in the
vfinfo attribute.
This information may not be included with VF list
query and prevent attribute length overflow.

The admin will be able to query the link and get extended VF info
using iptool and following command:
ip link show dev <ifname> vf <vf_num>

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
---
 include/uapi/linux/if_link.h   |  1 +
 include/uapi/linux/rtnetlink.h |  3 ++-
 net/core/rtnetlink.c           | 53 +++++++++++++++++++++++++++++++++---------
 3 files changed, 45 insertions(+), 12 deletions(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 8aec876..797e214 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -169,6 +169,7 @@ enum {
 	IFLA_MAX_MTU,
 	IFLA_PROP_LIST,
 	IFLA_ALT_IFNAME, /* Alternative ifname */
+	IFLA_VF_NUM, /* Get extended information for specific VF */
 	__IFLA_MAX
 };
 
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 1418a83..8825ede 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -759,7 +759,8 @@ enum {
 #define RTEXT_FILTER_VF		(1 << 0)
 #define RTEXT_FILTER_BRVLAN	(1 << 1)
 #define RTEXT_FILTER_BRVLAN_COMPRESSED	(1 << 2)
-#define	RTEXT_FILTER_SKIP_STATS	(1 << 3)
+#define RTEXT_FILTER_SKIP_STATS	(1 << 3)
+#define RTEXT_FILTER_VF_EXT	(1 << 4)
 
 /* End of information exported to user level */
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 49fa910..4dd5939 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -906,9 +906,14 @@ static void copy_rtnl_link_stats(struct rtnl_link_stats *a,
 static inline int rtnl_vfinfo_size(const struct net_device *dev,
 				   u32 ext_filter_mask)
 {
-	if (dev->dev.parent && (ext_filter_mask & RTEXT_FILTER_VF)) {
+	if (dev->dev.parent &&
+	    (ext_filter_mask & (RTEXT_FILTER_VF | RTEXT_FILTER_VF_EXT))) {
 		int num_vfs = dev_num_vf(dev->dev.parent);
 		size_t size = nla_total_size(0);
+
+		if (num_vfs && (ext_filter_mask & RTEXT_FILTER_VF_EXT))
+			num_vfs = 1;
+
 		size += num_vfs *
 			(nla_total_size(0) +
 			 nla_total_size(sizeof(struct ifla_vf_mac)) +
@@ -1022,7 +1027,8 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(4) /* IFLA_LINK_NETNSID */
 	       + nla_total_size(4) /* IFLA_GROUP */
 	       + nla_total_size(ext_filter_mask
-			        & RTEXT_FILTER_VF ? 4 : 0) /* IFLA_NUM_VF */
+				& (RTEXT_FILTER_VF | RTEXT_FILTER_VF_EXT) ?
+				4 : 0) /* IFLA_NUM_VF */
 	       + rtnl_vfinfo_size(dev, ext_filter_mask) /* IFLA_VFINFO_LIST */
 	       + rtnl_port_size(dev, ext_filter_mask) /* IFLA_VF_PORTS + IFLA_PORT_SELF */
 	       + rtnl_link_get_size(dev) /* IFLA_LINKINFO */
@@ -1203,7 +1209,8 @@ static noinline_for_stack int rtnl_fill_stats(struct sk_buff *skb,
 static noinline_for_stack int rtnl_fill_vfinfo(struct sk_buff *skb,
 					       struct net_device *dev,
 					       int vfs_num,
-					       struct nlattr *vfinfo)
+					       struct nlattr *vfinfo,
+					       bool vf_ext)
 {
 	struct ifla_vf_rss_query_en vf_rss_query_en;
 	struct nlattr *vf, *vfstats, *vfvlanlist;
@@ -1332,15 +1339,25 @@ static noinline_for_stack int rtnl_fill_vfinfo(struct sk_buff *skb,
 
 static noinline_for_stack int rtnl_fill_vf(struct sk_buff *skb,
 					   struct net_device *dev,
-					   u32 ext_filter_mask)
+					   u32 ext_filter_mask,
+					   int vf)
 {
+	bool vf_ext = (ext_filter_mask & RTEXT_FILTER_VF_EXT) && (vf >= 0);
 	struct nlattr *vfinfo;
 	int i, num_vfs;
 
-	if (!dev->dev.parent || ((ext_filter_mask & RTEXT_FILTER_VF) == 0))
+	if (!dev->dev.parent ||
+	    ((ext_filter_mask & (RTEXT_FILTER_VF | RTEXT_FILTER_VF_EXT)) == 0))
 		return 0;
 
 	num_vfs = dev_num_vf(dev->dev.parent);
+	if (vf_ext && num_vfs) {
+		if (vf > num_vfs)
+			return 0;
+
+		num_vfs = 1;
+	}
+
 	if (nla_put_u32(skb, IFLA_NUM_VF, num_vfs))
 		return -EMSGSIZE;
 
@@ -1352,7 +1369,7 @@ static noinline_for_stack int rtnl_fill_vf(struct sk_buff *skb,
 		return -EMSGSIZE;
 
 	for (i = 0; i < num_vfs; i++) {
-		if (rtnl_fill_vfinfo(skb, dev, i, vfinfo))
+		if (rtnl_fill_vfinfo(skb, dev, vf_ext ? vf : i, vfinfo, vf_ext))
 			return -EMSGSIZE;
 	}
 
@@ -1639,7 +1656,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
 			    int type, u32 pid, u32 seq, u32 change,
 			    unsigned int flags, u32 ext_filter_mask,
 			    u32 event, int *new_nsid, int new_ifindex,
-			    int tgt_netnsid)
+			    int tgt_netnsid, int vf)
 {
 	struct ifinfomsg *ifm;
 	struct nlmsghdr *nlh;
@@ -1717,7 +1734,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
 	if (rtnl_fill_stats(skb, dev))
 		goto nla_put_failure;
 
-	if (rtnl_fill_vf(skb, dev, ext_filter_mask))
+	if (rtnl_fill_vf(skb, dev, ext_filter_mask, vf))
 		goto nla_put_failure;
 
 	if (rtnl_port_fill(skb, dev, ext_filter_mask))
@@ -1806,6 +1823,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
 	[IFLA_PROP_LIST]	= { .type = NLA_NESTED },
 	[IFLA_ALT_IFNAME]	= { .type = NLA_STRING,
 				    .len = ALTIFNAMSIZ - 1 },
+	[IFLA_VF_NUM]		= { .type = NLA_U32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -2057,7 +2075,7 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 					       NETLINK_CB(cb->skb).portid,
 					       nlh->nlmsg_seq, 0, flags,
 					       ext_filter_mask, 0, NULL, 0,
-					       netnsid);
+					       netnsid, -1);
 
 			if (err < 0) {
 				if (likely(skb->len))
@@ -3365,6 +3383,7 @@ static int rtnl_valid_getlink_req(struct sk_buff *skb,
 		case IFLA_ALT_IFNAME:
 		case IFLA_EXT_MASK:
 		case IFLA_TARGET_NETNSID:
+		case IFLA_VF_NUM:
 			break;
 		default:
 			NL_SET_ERR_MSG(extack, "Unsupported attribute in get link request");
@@ -3385,6 +3404,7 @@ static int rtnl_getlink(struct sk_buff *skb, struct nlmsghdr *nlh,
 	struct net_device *dev = NULL;
 	struct sk_buff *nskb;
 	int netnsid = -1;
+	int vf = -1;
 	int err;
 	u32 ext_filter_mask = 0;
 
@@ -3407,6 +3427,17 @@ static int rtnl_getlink(struct sk_buff *skb, struct nlmsghdr *nlh,
 		ext_filter_mask = nla_get_u32(tb[IFLA_EXT_MASK]);
 
 	err = -EINVAL;
+	if ((ext_filter_mask & RTEXT_FILTER_VF) &&
+	    (ext_filter_mask & RTEXT_FILTER_VF_EXT))
+		goto out;
+
+	if (ext_filter_mask & RTEXT_FILTER_VF_EXT) {
+		if (tb[IFLA_VF_NUM])
+			vf = nla_get_u32(tb[IFLA_VF_NUM]);
+		else
+			goto out;
+	}
+
 	ifm = nlmsg_data(nlh);
 	if (ifm->ifi_index > 0)
 		dev = __dev_get_by_index(tgt_net, ifm->ifi_index);
@@ -3428,7 +3459,7 @@ static int rtnl_getlink(struct sk_buff *skb, struct nlmsghdr *nlh,
 	err = rtnl_fill_ifinfo(nskb, dev, net,
 			       RTM_NEWLINK, NETLINK_CB(skb).portid,
 			       nlh->nlmsg_seq, 0, 0, ext_filter_mask,
-			       0, NULL, 0, netnsid);
+			       0, NULL, 0, netnsid, vf);
 	if (err < 0) {
 		/* -EMSGSIZE implies BUG in if_nlmsg_size */
 		WARN_ON(err == -EMSGSIZE);
@@ -3634,7 +3665,7 @@ struct sk_buff *rtmsg_ifinfo_build_skb(int type, struct net_device *dev,
 
 	err = rtnl_fill_ifinfo(skb, dev, dev_net(dev),
 			       type, 0, 0, change, 0, 0, event,
-			       new_nsid, new_ifindex, -1);
+			       new_nsid, new_ifindex, -1, -1);
 	if (err < 0) {
 		/* -EMSGSIZE implies BUG in if_nlmsg_size() */
 		WARN_ON(err == -EMSGSIZE);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v2 2/3] net: Add SRIOV VGT+ support
  2019-10-31 19:47 [PATCH net-next v2 0/3] VGT+ support Ariel Levkovich
  2019-10-31 19:47 ` [PATCH net-next v2 1/3] net: Support querying specific VF properties Ariel Levkovich
@ 2019-10-31 19:47 ` Ariel Levkovich
  2019-10-31 19:47 ` [PATCH net-next v2 3/3] net/mlx5: " Ariel Levkovich
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Ariel Levkovich @ 2019-10-31 19:47 UTC (permalink / raw)
  To: netdev
  Cc: Saeed Mahameed, sd, sbrivio, nikolay, Jiri Pirko, dsahern,
	stephen, Ariel Levkovich

VGT+ is a security feature that gives the administrator the ability of
controlling the allowed vlan-ids list that can be transmitted/received
from/to the VF.
The allowed vlan-ids list is called "trunk".
Admin can add/remove a range of allowed vlan-ids via iptool.
Example:
After this series of configuration :
1) ip link set eth3 vf 0 trunk add 10 100 (allow vlan-id 10-100, default tpid 0x8100)
2) ip link set eth3 vf 0 trunk add 105 proto 802.1q (allow vlan-id 105 tpid 0x8100)
3) ip link set eth3 vf 0 trunk add 105 proto 802.1ad (allow vlan-id 105 tpid 0x88a8)
4) ip link set eth3 vf 0 trunk rem 90 (block vlan-id 90)
5) ip link set eth3 vf 0 trunk rem 50 60 (block vlan-ids 50-60)

The VF 0 can only communicate on vlan-ids: 10-49,61-89,91-100,105 with
tpid 0x8100 and vlan-id 105 with tpid 0x88a8.

For this purpose we added the following netlink sr-iov commands:

1) IFLA_VF_VLAN_RANGE: used to add/remove allowed vlan-ids range.
We added the ifla_vf_vlan_range struct to specify the range we want to
add/remove from the userspace.
We added ndo_add_vf_vlan_trunk_range and ndo_del_vf_vlan_trunk_range
netdev ops to add/remove allowed vlan-ids range in the netdev.

2) IFLA_VF_VLAN_TRUNK: used to query the allowed vlan-ids trunk.
We added trunk bitmap to the ifla_vf_info struct to get the current
allowed vlan-ids trunk from the netdev.
We added ifla_vf_vlan_trunk struct for sending the allowed vlan-ids
trunk to the userspace.
Since the trunk bitmap needs to contain a bit per possible enabled
vlan id, the size addition to ifla_vf_info is significant which may
create attribute length overrun when querying all the VFs.

Therefore, the return of the full bitmap is limited to the case
where the admin queries a specific VF only and for the VF list
query we introduce a new vf_info attribute called ifla_vf_vlan_mode
that will present the current VF tagging mode - VGT, VST or VGT+(trunk).

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
---
 include/linux/if_link.h      |   3 ++
 include/linux/netdevice.h    |  12 +++++
 include/uapi/linux/if_link.h |  34 ++++++++++++
 net/core/rtnetlink.c         | 122 ++++++++++++++++++++++++++++++++-----------
 4 files changed, 140 insertions(+), 31 deletions(-)

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 622658d..7146181 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -28,6 +28,9 @@ struct ifla_vf_info {
 	__u32 max_tx_rate;
 	__u32 rss_query_en;
 	__u32 trusted;
+	__u32 vlan_mode;
+	__u64 trunk_8021q[VF_VLAN_BITMAP];
+	__u64 trunk_8021ad[VF_VLAN_BITMAP];
 	__be16 vlan_proto;
 };
 #endif /* _LINUX_IF_LINK_H */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3207e0b..da79976 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1067,6 +1067,10 @@ struct netdev_name_node {
  *      Hash Key. This is needed since on some devices VF share this information
  *      with PF and querying it may introduce a theoretical security risk.
  * int (*ndo_set_vf_rss_query_en)(struct net_device *dev, int vf, bool setting);
+ * int (*ndo_add_vf_vlan_trunk_range)(struct net_device *dev, int vf,
+ *				      u16 start_vid, u16 end_vid, __be16 proto);
+ * int (*ndo_del_vf_vlan_trunk_range)(struct net_device *dev, int vf,
+ *				      u16 start_vid, u16 end_vid, __be16 proto);
  * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb);
  * int (*ndo_setup_tc)(struct net_device *dev, enum tc_setup_type type,
  *		       void *type_data);
@@ -1332,6 +1336,14 @@ struct net_device_ops {
 	int			(*ndo_set_vf_rss_query_en)(
 						   struct net_device *dev,
 						   int vf, bool setting);
+	int			(*ndo_add_vf_vlan_trunk_range)(
+						   struct net_device *dev,
+						   int vf, u16 start_vid,
+						   u16 end_vid, __be16 proto);
+	int			(*ndo_del_vf_vlan_trunk_range)(
+						   struct net_device *dev,
+						   int vf, u16 start_vid,
+						   u16 end_vid, __be16 proto);
 	int			(*ndo_setup_tc)(struct net_device *dev,
 						enum tc_setup_type type,
 						void *type_data);
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 797e214..35ab210 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -180,6 +180,8 @@ enum {
 #ifndef __KERNEL__
 #define IFLA_RTA(r)  ((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct ifinfomsg))))
 #define IFLA_PAYLOAD(n) NLMSG_PAYLOAD(n,sizeof(struct ifinfomsg))
+#define BITS_PER_BYTE 8
+#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
 #endif
 
 enum {
@@ -699,6 +701,9 @@ enum {
 	IFLA_VF_IB_PORT_GUID,	/* VF Infiniband port GUID */
 	IFLA_VF_VLAN_LIST,	/* nested list of vlans, option for QinQ */
 	IFLA_VF_BROADCAST,	/* VF broadcast */
+	IFLA_VF_VLAN_MODE,	/* vlan tagging mode */
+	IFLA_VF_VLAN_RANGE,	/* add/delete vlan range filtering */
+	IFLA_VF_VLAN_TRUNK,	/* vlan trunk filtering */
 	__IFLA_VF_MAX,
 };
 
@@ -713,6 +718,19 @@ struct ifla_vf_broadcast {
 	__u8 broadcast[32];
 };
 
+enum {
+	IFLA_VF_VLAN_MODE_UNSPEC,
+	IFLA_VF_VLAN_MODE_VGT,
+	IFLA_VF_VLAN_MODE_VST,
+	IFLA_VF_VLAN_MODE_TRUNK,
+	__IFLA_VF_VLAN_MODE_MAX,
+};
+
+struct ifla_vf_vlan_mode {
+	__u32 vf;
+	__u32 mode; /* The VLAN tagging mode */
+};
+
 struct ifla_vf_vlan {
 	__u32 vf;
 	__u32 vlan; /* 0 - 4095, 0 disables VLAN filter */
@@ -727,6 +745,7 @@ enum {
 
 #define IFLA_VF_VLAN_INFO_MAX (__IFLA_VF_VLAN_INFO_MAX - 1)
 #define MAX_VLAN_LIST_LEN 1
+#define VF_VLAN_N_VID 4096
 
 struct ifla_vf_vlan_info {
 	__u32 vf;
@@ -735,6 +754,21 @@ struct ifla_vf_vlan_info {
 	__be16 vlan_proto; /* VLAN protocol either 802.1Q or 802.1ad */
 };
 
+struct ifla_vf_vlan_range {
+	__u32 vf;
+	__u32 start_vid;   /* 1 - 4095 */
+	__u32 end_vid;     /* 1 - 4095 */
+	__u32 setting;
+	__be16 vlan_proto; /* VLAN protocol either 802.1Q or 802.1ad */
+};
+
+#define VF_VLAN_BITMAP	DIV_ROUND_UP(VF_VLAN_N_VID, sizeof(__u64) * BITS_PER_BYTE)
+struct ifla_vf_vlan_trunk {
+	__u32 vf;
+	__u64 allowed_vlans_8021q_bm[VF_VLAN_BITMAP];
+	__u64 allowed_vlans_8021ad_bm[VF_VLAN_BITMAP];
+};
+
 struct ifla_vf_tx_rate {
 	__u32 vf;
 	__u32 rate; /* Max TX bandwidth in Mbps, 0 disables throttling */
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 4dd5939..460de44 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -911,8 +911,10 @@ static inline int rtnl_vfinfo_size(const struct net_device *dev,
 		int num_vfs = dev_num_vf(dev->dev.parent);
 		size_t size = nla_total_size(0);
 
-		if (num_vfs && (ext_filter_mask & RTEXT_FILTER_VF_EXT))
+		if (num_vfs && (ext_filter_mask & RTEXT_FILTER_VF_EXT)) {
 			num_vfs = 1;
+			size += nla_total_size(sizeof(struct ifla_vf_vlan_trunk));
+		}
 
 		size += num_vfs *
 			(nla_total_size(0) +
@@ -927,6 +929,7 @@ static inline int rtnl_vfinfo_size(const struct net_device *dev,
 			 nla_total_size(sizeof(struct ifla_vf_rate)) +
 			 nla_total_size(sizeof(struct ifla_vf_link_state)) +
 			 nla_total_size(sizeof(struct ifla_vf_rss_query_en)) +
+			 nla_total_size(sizeof(struct ifla_vf_vlan_mode)) +
 			 nla_total_size(0) + /* nest IFLA_VF_STATS */
 			 /* IFLA_VF_STATS_RX_PACKETS */
 			 nla_total_size_64bit(sizeof(__u64)) +
@@ -1216,7 +1219,9 @@ static noinline_for_stack int rtnl_fill_vfinfo(struct sk_buff *skb,
 	struct nlattr *vf, *vfstats, *vfvlanlist;
 	struct ifla_vf_link_state vf_linkstate;
 	struct ifla_vf_vlan_info vf_vlan_info;
+	struct ifla_vf_vlan_mode vf_vlan_mode;
 	struct ifla_vf_spoofchk vf_spoofchk;
+	struct ifla_vf_vlan_trunk *vf_trunk;
 	struct ifla_vf_tx_rate vf_tx_rate;
 	struct ifla_vf_stats vf_stats;
 	struct ifla_vf_trust vf_trust;
@@ -1224,25 +1229,36 @@ static noinline_for_stack int rtnl_fill_vfinfo(struct sk_buff *skb,
 	struct ifla_vf_rate vf_rate;
 	struct ifla_vf_mac vf_mac;
 	struct ifla_vf_broadcast vf_broadcast;
-	struct ifla_vf_info ivi;
+	struct ifla_vf_info *ivi;
+
+	ivi = kzalloc(sizeof(*ivi), GFP_KERNEL);
+	if (!ivi)
+		return -ENOMEM;
 
-	memset(&ivi, 0, sizeof(ivi));
+	vf_trunk = kzalloc(sizeof(*vf_trunk), GFP_KERNEL);
+	if (!vf_trunk) {
+		kfree(ivi);
+		return -ENOMEM;
+	}
 
 	/* Not all SR-IOV capable drivers support the
 	 * spoofcheck and "RSS query enable" query.  Preset to
 	 * -1 so the user space tool can detect that the driver
 	 * didn't report anything.
 	 */
-	ivi.spoofchk = -1;
-	ivi.rss_query_en = -1;
-	ivi.trusted = -1;
+	ivi->spoofchk = -1;
+	ivi->rss_query_en = -1;
+	ivi->trusted = -1;
+	memset(ivi->mac, 0, sizeof(ivi->mac));
+	memset(ivi->trunk_8021q, 0, sizeof(ivi->trunk_8021q));
+	memset(ivi->trunk_8021ad, 0, sizeof(ivi->trunk_8021ad));
 	/* The default value for VF link state is "auto"
 	 * IFLA_VF_LINK_STATE_AUTO which equals zero
 	 */
-	ivi.linkstate = 0;
+	ivi->linkstate = 0;
 	/* VLAN Protocol by default is 802.1Q */
-	ivi.vlan_proto = htons(ETH_P_8021Q);
-	if (dev->netdev_ops->ndo_get_vf_config(dev, vfs_num, &ivi))
+	ivi->vlan_proto = htons(ETH_P_8021Q);
+	if (dev->netdev_ops->ndo_get_vf_config(dev, vfs_num, ivi))
 		return 0;
 
 	memset(&vf_vlan_info, 0, sizeof(vf_vlan_info));
@@ -1255,22 +1271,26 @@ static noinline_for_stack int rtnl_fill_vfinfo(struct sk_buff *skb,
 		vf_spoofchk.vf =
 		vf_linkstate.vf =
 		vf_rss_query_en.vf =
-		vf_trust.vf = ivi.vf;
-
-	memcpy(vf_mac.mac, ivi.mac, sizeof(ivi.mac));
-	memcpy(vf_broadcast.broadcast, dev->broadcast, dev->addr_len);
-	vf_vlan.vlan = ivi.vlan;
-	vf_vlan.qos = ivi.qos;
-	vf_vlan_info.vlan = ivi.vlan;
-	vf_vlan_info.qos = ivi.qos;
-	vf_vlan_info.vlan_proto = ivi.vlan_proto;
-	vf_tx_rate.rate = ivi.max_tx_rate;
-	vf_rate.min_tx_rate = ivi.min_tx_rate;
-	vf_rate.max_tx_rate = ivi.max_tx_rate;
-	vf_spoofchk.setting = ivi.spoofchk;
-	vf_linkstate.link_state = ivi.linkstate;
-	vf_rss_query_en.setting = ivi.rss_query_en;
-	vf_trust.setting = ivi.trusted;
+		vf_vlan_mode.vf =
+		vf_trunk->vf =
+		vf_trust.vf = ivi->vf;
+
+	memcpy(vf_mac.mac, ivi->mac, sizeof(ivi->mac));
+	memcpy(vf_trunk->allowed_vlans_8021q_bm, ivi->trunk_8021q, sizeof(ivi->trunk_8021q));
+	memcpy(vf_trunk->allowed_vlans_8021ad_bm, ivi->trunk_8021ad, sizeof(ivi->trunk_8021ad));
+	vf_vlan_mode.mode = ivi->vlan_mode;
+	vf_vlan.vlan = ivi->vlan;
+	vf_vlan.qos = ivi->qos;
+	vf_vlan_info.vlan = ivi->vlan;
+	vf_vlan_info.qos = ivi->qos;
+	vf_vlan_info.vlan_proto = ivi->vlan_proto;
+	vf_tx_rate.rate = ivi->max_tx_rate;
+	vf_rate.min_tx_rate = ivi->min_tx_rate;
+	vf_rate.max_tx_rate = ivi->max_tx_rate;
+	vf_spoofchk.setting = ivi->spoofchk;
+	vf_linkstate.link_state = ivi->linkstate;
+	vf_rss_query_en.setting = ivi->rss_query_en;
+	vf_trust.setting = ivi->trusted;
 	vf = nla_nest_start_noflag(skb, IFLA_VF_INFO);
 	if (!vf)
 		goto nla_put_vfinfo_failure;
@@ -1289,7 +1309,11 @@ static noinline_for_stack int rtnl_fill_vfinfo(struct sk_buff *skb,
 		    sizeof(vf_rss_query_en),
 		    &vf_rss_query_en) ||
 	    nla_put(skb, IFLA_VF_TRUST,
-		    sizeof(vf_trust), &vf_trust))
+		    sizeof(vf_trust), &vf_trust) ||
+	    nla_put(skb, IFLA_VF_VLAN_MODE,
+		    sizeof(vf_vlan_mode), &vf_vlan_mode) ||
+	    (vf_ext && nla_put(skb, IFLA_VF_VLAN_TRUNK,
+			       sizeof(*vf_trunk), vf_trunk)))
 		goto nla_put_vf_failure;
 	vfvlanlist = nla_nest_start_noflag(skb, IFLA_VF_VLAN_LIST);
 	if (!vfvlanlist)
@@ -1328,12 +1352,16 @@ static noinline_for_stack int rtnl_fill_vfinfo(struct sk_buff *skb,
 	}
 	nla_nest_end(skb, vfstats);
 	nla_nest_end(skb, vf);
+	kfree(vf_trunk);
+	kfree(ivi);
 	return 0;
 
 nla_put_vf_failure:
 	nla_nest_cancel(skb, vf);
 nla_put_vfinfo_failure:
 	nla_nest_cancel(skb, vfinfo);
+	kfree(vf_trunk);
+	kfree(ivi);
 	return -EMSGSIZE;
 }
 
@@ -1847,6 +1875,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
 	[IFLA_VF_TRUST]		= { .len = sizeof(struct ifla_vf_trust) },
 	[IFLA_VF_IB_NODE_GUID]	= { .len = sizeof(struct ifla_vf_guid) },
 	[IFLA_VF_IB_PORT_GUID]	= { .len = sizeof(struct ifla_vf_guid) },
+	[IFLA_VF_VLAN_MODE]	= { .len = sizeof(struct ifla_vf_vlan_mode) },
+	[IFLA_VF_VLAN_RANGE]	= { .len = sizeof(struct ifla_vf_vlan_range) },
+	[IFLA_VF_VLAN_TRUNK]	= { .len = sizeof(struct ifla_vf_vlan_trunk) },
 };
 
 static const struct nla_policy ifla_port_policy[IFLA_PORT_MAX+1] = {
@@ -2289,6 +2320,26 @@ static int do_setvfinfo(struct net_device *dev, struct nlattr **tb)
 			return err;
 	}
 
+	if (tb[IFLA_VF_VLAN_RANGE]) {
+		struct ifla_vf_vlan_range *ivvr =
+					nla_data(tb[IFLA_VF_VLAN_RANGE]);
+		bool add = !!ivvr->setting;
+
+		err = -EOPNOTSUPP;
+		if (add && ops->ndo_add_vf_vlan_trunk_range)
+			err = ops->ndo_add_vf_vlan_trunk_range(dev, ivvr->vf,
+							       ivvr->start_vid,
+							       ivvr->end_vid,
+							       ivvr->vlan_proto);
+		else if (!add && ops->ndo_del_vf_vlan_trunk_range)
+			err = ops->ndo_del_vf_vlan_trunk_range(dev, ivvr->vf,
+							       ivvr->start_vid,
+							       ivvr->end_vid,
+							       ivvr->vlan_proto);
+		if (err < 0)
+			return err;
+	}
+
 	if (tb[IFLA_VF_VLAN_LIST]) {
 		struct ifla_vf_vlan_info *ivvl[MAX_VLAN_LIST_LEN];
 		struct nlattr *attr;
@@ -2320,21 +2371,30 @@ static int do_setvfinfo(struct net_device *dev, struct nlattr **tb)
 
 	if (tb[IFLA_VF_TX_RATE]) {
 		struct ifla_vf_tx_rate *ivt = nla_data(tb[IFLA_VF_TX_RATE]);
-		struct ifla_vf_info ivf;
+		struct ifla_vf_info *ivf;
+
+		ivf = kzalloc(sizeof(*ivf), GFP_KERNEL);
+		if (!ivf)
+			return -ENOMEM;
 
 		err = -EOPNOTSUPP;
 		if (ops->ndo_get_vf_config)
-			err = ops->ndo_get_vf_config(dev, ivt->vf, &ivf);
-		if (err < 0)
+			err = ops->ndo_get_vf_config(dev, ivt->vf, ivf);
+		if (err < 0) {
+			kfree(ivf);
 			return err;
+		}
 
 		err = -EOPNOTSUPP;
 		if (ops->ndo_set_vf_rate)
 			err = ops->ndo_set_vf_rate(dev, ivt->vf,
-						   ivf.min_tx_rate,
+						   ivf->min_tx_rate,
 						   ivt->rate);
-		if (err < 0)
+		if (err < 0) {
+			kfree(ivf);
 			return err;
+		}
+		kfree(ivf);
 	}
 
 	if (tb[IFLA_VF_RATE]) {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v2 3/3] net/mlx5: Add SRIOV VGT+ support
  2019-10-31 19:47 [PATCH net-next v2 0/3] VGT+ support Ariel Levkovich
  2019-10-31 19:47 ` [PATCH net-next v2 1/3] net: Support querying specific VF properties Ariel Levkovich
  2019-10-31 19:47 ` [PATCH net-next v2 2/3] net: Add SRIOV VGT+ support Ariel Levkovich
@ 2019-10-31 19:47 ` Ariel Levkovich
  2019-10-31 20:31 ` [PATCH net-next v2 0/3] " David Miller
  2019-11-01  0:23 ` Jakub Kicinski
  4 siblings, 0 replies; 24+ messages in thread
From: Ariel Levkovich @ 2019-10-31 19:47 UTC (permalink / raw)
  To: netdev
  Cc: Saeed Mahameed, sd, sbrivio, nikolay, Jiri Pirko, dsahern,
	stephen, Ariel Levkovich

Implementing the VGT+ feature via acl tables.
The acl tables will hold the actual needed rules which is only the
intersection of the requested vlan-ids list and the allowed vlan-ids
list from the administrator.

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  30 ++
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  | 600 ++++++++++++++++-----
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |  27 +-
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c |   8 +-
 4 files changed, 533 insertions(+), 132 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 7569287..9253bfd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4061,6 +4061,34 @@ static int mlx5e_set_vf_vlan(struct net_device *dev, int vf, u16 vlan, u8 qos,
 					   vlan, qos);
 }
 
+static int mlx5e_add_vf_vlan_trunk_range(struct net_device *dev, int vf,
+					 u16 start_vid, u16 end_vid,
+					 __be16 vlan_proto)
+{
+	struct mlx5e_priv *priv = netdev_priv(dev);
+	struct mlx5_core_dev *mdev = priv->mdev;
+
+	if (vlan_proto != htons(ETH_P_8021Q))
+		return -EPROTONOSUPPORT;
+
+	return mlx5_eswitch_add_vport_trunk_range(mdev->priv.eswitch, vf + 1,
+						  start_vid, end_vid);
+}
+
+static int mlx5e_del_vf_vlan_trunk_range(struct net_device *dev, int vf,
+					 u16 start_vid, u16 end_vid,
+					 __be16 vlan_proto)
+{
+	struct mlx5e_priv *priv = netdev_priv(dev);
+	struct mlx5_core_dev *mdev = priv->mdev;
+
+	if (vlan_proto != htons(ETH_P_8021Q))
+		return -EPROTONOSUPPORT;
+
+	return mlx5_eswitch_del_vport_trunk_range(mdev->priv.eswitch, vf + 1,
+						  start_vid, end_vid);
+}
+
 static int mlx5e_set_vf_spoofchk(struct net_device *dev, int vf, bool setting)
 {
 	struct mlx5e_priv *priv = netdev_priv(dev);
@@ -4589,6 +4617,8 @@ static int mlx5e_bridge_setlink(struct net_device *dev, struct nlmsghdr *nlh,
 	/* SRIOV E-Switch NDOs */
 	.ndo_set_vf_mac          = mlx5e_set_vf_mac,
 	.ndo_set_vf_vlan         = mlx5e_set_vf_vlan,
+	.ndo_add_vf_vlan_trunk_range = mlx5e_add_vf_vlan_trunk_range,
+	.ndo_del_vf_vlan_trunk_range = mlx5e_del_vf_vlan_trunk_range,
 	.ndo_set_vf_spoofchk     = mlx5e_set_vf_spoofchk,
 	.ndo_set_vf_trust        = mlx5e_set_vf_trust,
 	.ndo_set_vf_rate         = mlx5e_set_vf_rate,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 7baade9..911421e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -58,6 +58,11 @@ struct vport_addr {
 	bool mc_promisc;
 };
 
+struct mlx5_acl_vlan {
+	struct mlx5_flow_handle	*acl_vlan_rule;
+	struct list_head	list;
+};
+
 static void esw_destroy_legacy_fdb_table(struct mlx5_eswitch *esw);
 static void esw_cleanup_vepa_rules(struct mlx5_eswitch *esw);
 
@@ -452,6 +457,7 @@ static void esw_destroy_legacy_table(struct mlx5_eswitch *esw)
 
 #define MLX5_LEGACY_SRIOV_VPORT_EVENTS (MLX5_VPORT_UC_ADDR_CHANGE | \
 					MLX5_VPORT_MC_ADDR_CHANGE | \
+					MLX5_VPORT_VLAN_CHANGE | \
 					MLX5_VPORT_PROMISC_CHANGE)
 
 static int esw_legacy_enable(struct mlx5_eswitch *esw)
@@ -793,6 +799,94 @@ static void esw_update_vport_addr_list(struct mlx5_eswitch *esw,
 	kfree(mac_list);
 }
 
+static void esw_update_acl_trunk_bitmap(struct mlx5_eswitch *esw, u32 vport_num)
+{
+	struct mlx5_vport *vport = &esw->vports[vport_num];
+
+	bitmap_and(vport->acl_vlan_8021q_bitmap, vport->req_vlan_bitmap,
+		   vport->info.vlan_trunk_8021q_bitmap, VLAN_N_VID);
+}
+
+static int mlx5_query_nic_vport_vlans(struct mlx5_core_dev *dev, u32 vport,
+				      unsigned long *vlans)
+{
+	u32 in[MLX5_ST_SZ_DW(query_nic_vport_context_in)];
+	void *nic_vport_ctx;
+	int req_list_size;
+	int out_sz;
+	void *out;
+	int err;
+	int i;
+
+	req_list_size = 1 << MLX5_CAP_GEN(dev, log_max_vlan_list);
+	out_sz = MLX5_ST_SZ_BYTES(modify_nic_vport_context_in) +
+		req_list_size * MLX5_ST_SZ_BYTES(vlan_layout);
+
+	memset(in, 0, sizeof(in));
+	out = kzalloc(out_sz, GFP_KERNEL);
+	if (!out)
+		return -ENOMEM;
+
+	MLX5_SET(query_nic_vport_context_in, in, opcode,
+		 MLX5_CMD_OP_QUERY_NIC_VPORT_CONTEXT);
+	MLX5_SET(query_nic_vport_context_in, in, allowed_list_type,
+		 MLX5_NVPRT_LIST_TYPE_VLAN);
+	MLX5_SET(query_nic_vport_context_in, in, vport_number, vport);
+
+	if (vport)
+		MLX5_SET(query_nic_vport_context_in, in, other_vport, 1);
+
+	err = mlx5_cmd_exec(dev, in, sizeof(in), out, out_sz);
+	if (err)
+		goto out;
+
+	nic_vport_ctx = MLX5_ADDR_OF(query_nic_vport_context_out, out,
+				     nic_vport_context);
+	req_list_size = MLX5_GET(nic_vport_context, nic_vport_ctx,
+				 allowed_list_size);
+
+	for (i = 0; i < req_list_size; i++) {
+		void *vlan_addr = MLX5_ADDR_OF(nic_vport_context,
+					       nic_vport_ctx,
+					       current_uc_mac_address[i]);
+		bitmap_set(vlans, MLX5_GET(vlan_layout, vlan_addr, vlan), 1);
+	}
+out:
+	kfree(out);
+	return err;
+}
+
+static int esw_vport_egress_config(struct mlx5_eswitch *esw,
+				   struct mlx5_vport *vport);
+static int esw_vport_ingress_config(struct mlx5_eswitch *esw,
+				    struct mlx5_vport *vport);
+
+/* Sync vport vlan list from vport context */
+static void esw_update_vport_vlan_list(struct mlx5_eswitch *esw, u32 vport_num)
+{
+	struct mlx5_vport *vport = &esw->vports[vport_num];
+	DECLARE_BITMAP(prev_vlans_bitmap, VLAN_N_VID);
+	int err;
+
+	bitmap_copy(prev_vlans_bitmap, vport->req_vlan_bitmap, VLAN_N_VID);
+	bitmap_zero(vport->req_vlan_bitmap, VLAN_N_VID);
+
+	if (!vport->enabled)
+		return;
+
+	err = mlx5_query_nic_vport_vlans(esw->dev, vport_num, vport->req_vlan_bitmap);
+	if (err)
+		return;
+
+	bitmap_xor(prev_vlans_bitmap, prev_vlans_bitmap, vport->req_vlan_bitmap, VLAN_N_VID);
+	if (!bitmap_weight(prev_vlans_bitmap, VLAN_N_VID))
+		return;
+
+	esw_update_acl_trunk_bitmap(esw, vport_num);
+	esw_vport_egress_config(esw, vport);
+	esw_vport_ingress_config(esw, vport);
+}
+
 /* Sync vport UC/MC list from vport context
  * Must be called after esw_update_vport_addr_list
  */
@@ -920,6 +1014,9 @@ static void esw_vport_change_handle_locked(struct mlx5_vport *vport)
 	if (vport->enabled_events & MLX5_VPORT_MC_ADDR_CHANGE)
 		esw_update_vport_addr_list(esw, vport, MLX5_NVPRT_LIST_TYPE_MC);
 
+	if (vport->enabled_events & MLX5_VPORT_VLAN_CHANGE)
+		esw_update_vport_vlan_list(esw, vport->vport);
+
 	if (vport->enabled_events & MLX5_VPORT_PROMISC_CHANGE) {
 		esw_update_vport_rx_mode(esw, vport);
 		if (!IS_ERR_OR_NULL(vport->allmulti_rule))
@@ -950,18 +1047,20 @@ int esw_vport_enable_egress_acl(struct mlx5_eswitch *esw,
 				struct mlx5_vport *vport)
 {
 	int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
+	struct mlx5_flow_group *untagged_grp = NULL;
 	struct mlx5_flow_group *vlan_grp = NULL;
 	struct mlx5_flow_group *drop_grp = NULL;
 	struct mlx5_core_dev *dev = esw->dev;
 	struct mlx5_flow_namespace *root_ns;
 	struct mlx5_flow_table *acl;
+	/* The egress acl table contains 3 groups:
+	 * 1)Allow tagged traffic with vlan_tag=vst_vlan_id/vgt+_vlan_id
+	 * 2)Allow untagged traffic
+	 * 3)Drop all other traffic
+	 */
+	int table_size = VLAN_N_VID + 2;
 	void *match_criteria;
 	u32 *flow_group_in;
-	/* The egress acl table contains 2 rules:
-	 * 1)Allow traffic with vlan_tag=vst_vlan_id
-	 * 2)Drop all other traffic.
-	 */
-	int table_size = 2;
 	int err = 0;
 
 	if (!MLX5_CAP_ESW_EGRESS_ACL(dev, ft_support))
@@ -994,11 +1093,25 @@ int esw_vport_enable_egress_acl(struct mlx5_eswitch *esw,
 
 	MLX5_SET(create_flow_group_in, flow_group_in, match_criteria_enable, MLX5_MATCH_OUTER_HEADERS);
 	match_criteria = MLX5_ADDR_OF(create_flow_group_in, flow_group_in, match_criteria);
+
+	/* Create flow group for allowed untagged flow rule */
 	MLX5_SET_TO_ONES(fte_match_param, match_criteria, outer_headers.cvlan_tag);
-	MLX5_SET_TO_ONES(fte_match_param, match_criteria, outer_headers.first_vid);
 	MLX5_SET(create_flow_group_in, flow_group_in, start_flow_index, 0);
 	MLX5_SET(create_flow_group_in, flow_group_in, end_flow_index, 0);
 
+	untagged_grp = mlx5_create_flow_group(acl, flow_group_in);
+	if (IS_ERR(untagged_grp)) {
+		err = PTR_ERR(untagged_grp);
+		esw_warn(dev, "Failed to create E-Switch vport[%d] egress untagged flow group, err(%d)\n",
+			 vport->vport, err);
+		goto out;
+	}
+
+	/* Create flow group for allowed tagged flow rules */
+	MLX5_SET_TO_ONES(fte_match_param, match_criteria, outer_headers.first_vid);
+	MLX5_SET(create_flow_group_in, flow_group_in, start_flow_index, 1);
+	MLX5_SET(create_flow_group_in, flow_group_in, end_flow_index, VLAN_N_VID);
+
 	vlan_grp = mlx5_create_flow_group(acl, flow_group_in);
 	if (IS_ERR(vlan_grp)) {
 		err = PTR_ERR(vlan_grp);
@@ -1007,9 +1120,10 @@ int esw_vport_enable_egress_acl(struct mlx5_eswitch *esw,
 		goto out;
 	}
 
+	/* Create flow group for drop rule */
 	memset(flow_group_in, 0, inlen);
-	MLX5_SET(create_flow_group_in, flow_group_in, start_flow_index, 1);
-	MLX5_SET(create_flow_group_in, flow_group_in, end_flow_index, 1);
+	MLX5_SET(create_flow_group_in, flow_group_in, start_flow_index, VLAN_N_VID + 1);
+	MLX5_SET(create_flow_group_in, flow_group_in, end_flow_index, VLAN_N_VID + 1);
 	drop_grp = mlx5_create_flow_group(acl, flow_group_in);
 	if (IS_ERR(drop_grp)) {
 		err = PTR_ERR(drop_grp);
@@ -1021,27 +1135,45 @@ int esw_vport_enable_egress_acl(struct mlx5_eswitch *esw,
 	vport->egress.acl = acl;
 	vport->egress.drop_grp = drop_grp;
 	vport->egress.allowed_vlans_grp = vlan_grp;
+	vport->egress.allow_untagged_grp = untagged_grp;
+
 out:
+	if (err) {
+		if (!IS_ERR_OR_NULL(vlan_grp))
+			mlx5_destroy_flow_group(vlan_grp);
+		if (!IS_ERR_OR_NULL(untagged_grp))
+			mlx5_destroy_flow_group(untagged_grp);
+		if (!IS_ERR_OR_NULL(acl))
+			mlx5_destroy_flow_table(acl);
+	}
 	kvfree(flow_group_in);
-	if (err && !IS_ERR_OR_NULL(vlan_grp))
-		mlx5_destroy_flow_group(vlan_grp);
-	if (err && !IS_ERR_OR_NULL(acl))
-		mlx5_destroy_flow_table(acl);
 	return err;
 }
 
 void esw_vport_cleanup_egress_rules(struct mlx5_eswitch *esw,
 				    struct mlx5_vport *vport)
 {
-	if (!IS_ERR_OR_NULL(vport->egress.allowed_vlan)) {
-		mlx5_del_flow_rules(vport->egress.allowed_vlan);
-		vport->egress.allowed_vlan = NULL;
+	struct mlx5_acl_vlan *trunk_vlan_rule, *tmp;
+
+	if (!IS_ERR_OR_NULL(vport->egress.allowed_vst_vlan)) {
+		mlx5_del_flow_rules(vport->egress.allowed_vst_vlan);
+		vport->egress.allowed_vst_vlan = NULL;
+	}
+
+	list_for_each_entry_safe(trunk_vlan_rule, tmp,
+				 &vport->egress.legacy.allowed_vlans_rules, list) {
+		mlx5_del_flow_rules(trunk_vlan_rule->acl_vlan_rule);
+		list_del(&trunk_vlan_rule->list);
+		kfree(trunk_vlan_rule);
 	}
 
 	if (!IS_ERR_OR_NULL(vport->egress.legacy.drop_rule)) {
 		mlx5_del_flow_rules(vport->egress.legacy.drop_rule);
 		vport->egress.legacy.drop_rule = NULL;
 	}
+
+	if (!IS_ERR_OR_NULL(vport->egress.legacy.allow_untagged_rule))
+		mlx5_del_flow_rules(vport->egress.legacy.allow_untagged_rule);
 }
 
 void esw_vport_disable_egress_acl(struct mlx5_eswitch *esw,
@@ -1053,9 +1185,11 @@ void esw_vport_disable_egress_acl(struct mlx5_eswitch *esw,
 	esw_debug(esw->dev, "Destroy vport[%d] E-Switch egress ACL\n", vport->vport);
 
 	esw_vport_cleanup_egress_rules(esw, vport);
+	mlx5_destroy_flow_group(vport->egress.allow_untagged_grp);
 	mlx5_destroy_flow_group(vport->egress.allowed_vlans_grp);
 	mlx5_destroy_flow_group(vport->egress.drop_grp);
 	mlx5_destroy_flow_table(vport->egress.acl);
+	vport->egress.allow_untagged_grp = NULL;
 	vport->egress.allowed_vlans_grp = NULL;
 	vport->egress.drop_grp = NULL;
 	vport->egress.acl = NULL;
@@ -1065,12 +1199,21 @@ void esw_vport_disable_egress_acl(struct mlx5_eswitch *esw,
 esw_vport_create_legacy_ingress_acl_groups(struct mlx5_eswitch *esw,
 					   struct mlx5_vport *vport)
 {
+	bool need_vlan_filter = !!bitmap_weight(vport->info.vlan_trunk_8021q_bitmap,
+						VLAN_N_VID);
 	int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
+	struct mlx5_flow_group *untagged_spoof_grp = NULL;
+	struct mlx5_flow_table *acl = vport->ingress.acl;
+	struct mlx5_flow_group *tagged_spoof_grp = NULL;
+	struct mlx5_flow_group *drop_grp = NULL;
 	struct mlx5_core_dev *dev = esw->dev;
-	struct mlx5_flow_group *g;
 	void *match_criteria;
 	u32 *flow_group_in;
-	int err;
+	int allow_grp_sz = 0;
+	int err = 0;
+
+	if (!acl)
+		return -EINVAL;
 
 	flow_group_in = kvzalloc(inlen, GFP_KERNEL);
 	if (!flow_group_in)
@@ -1079,83 +1222,68 @@ void esw_vport_disable_egress_acl(struct mlx5_eswitch *esw,
 	match_criteria = MLX5_ADDR_OF(create_flow_group_in, flow_group_in, match_criteria);
 
 	MLX5_SET(create_flow_group_in, flow_group_in, match_criteria_enable, MLX5_MATCH_OUTER_HEADERS);
-	MLX5_SET_TO_ONES(fte_match_param, match_criteria, outer_headers.cvlan_tag);
-	MLX5_SET_TO_ONES(fte_match_param, match_criteria, outer_headers.smac_47_16);
-	MLX5_SET_TO_ONES(fte_match_param, match_criteria, outer_headers.smac_15_0);
+
+	if (vport->info.vlan || vport->info.qos || need_vlan_filter)
+		MLX5_SET_TO_ONES(fte_match_param, match_criteria, outer_headers.cvlan_tag);
+
+	if (vport->info.spoofchk) {
+		MLX5_SET_TO_ONES(fte_match_param, match_criteria, outer_headers.smac_47_16);
+		MLX5_SET_TO_ONES(fte_match_param, match_criteria, outer_headers.smac_15_0);
+	}
+
 	MLX5_SET(create_flow_group_in, flow_group_in, start_flow_index, 0);
 	MLX5_SET(create_flow_group_in, flow_group_in, end_flow_index, 0);
 
-	g = mlx5_create_flow_group(vport->ingress.acl, flow_group_in);
-	if (IS_ERR(g)) {
-		err = PTR_ERR(g);
+	untagged_spoof_grp = mlx5_create_flow_group(acl, flow_group_in);
+	if (IS_ERR(untagged_spoof_grp)) {
+		err = PTR_ERR(untagged_spoof_grp);
 		esw_warn(dev, "vport[%d] ingress create untagged spoofchk flow group, err(%d)\n",
 			 vport->vport, err);
-		goto spoof_err;
+		goto out;
 	}
-	vport->ingress.legacy.allow_untagged_spoofchk_grp = g;
+	allow_grp_sz += 1;
 
-	memset(flow_group_in, 0, inlen);
-	MLX5_SET(create_flow_group_in, flow_group_in, match_criteria_enable, MLX5_MATCH_OUTER_HEADERS);
-	MLX5_SET_TO_ONES(fte_match_param, match_criteria, outer_headers.cvlan_tag);
+	if (!need_vlan_filter)
+		goto drop_grp;
+
+	MLX5_SET_TO_ONES(fte_match_param, match_criteria, outer_headers.first_vid);
 	MLX5_SET(create_flow_group_in, flow_group_in, start_flow_index, 1);
-	MLX5_SET(create_flow_group_in, flow_group_in, end_flow_index, 1);
+	MLX5_SET(create_flow_group_in, flow_group_in, end_flow_index, VLAN_N_VID);
 
-	g = mlx5_create_flow_group(vport->ingress.acl, flow_group_in);
-	if (IS_ERR(g)) {
-		err = PTR_ERR(g);
-		esw_warn(dev, "vport[%d] ingress create untagged flow group, err(%d)\n",
+	tagged_spoof_grp = mlx5_create_flow_group(acl, flow_group_in);
+	if (IS_ERR(tagged_spoof_grp)) {
+		err = PTR_ERR(tagged_spoof_grp);
+		esw_warn(dev, "Failed to create E-Switch vport[%d] ingress tagged spoofchk flow group, err(%d)\n",
 			 vport->vport, err);
-		goto untagged_err;
+		goto out;
 	}
-	vport->ingress.legacy.allow_untagged_only_grp = g;
+	allow_grp_sz += VLAN_N_VID;
 
+drop_grp:
 	memset(flow_group_in, 0, inlen);
-	MLX5_SET(create_flow_group_in, flow_group_in, match_criteria_enable, MLX5_MATCH_OUTER_HEADERS);
-	MLX5_SET_TO_ONES(fte_match_param, match_criteria, outer_headers.smac_47_16);
-	MLX5_SET_TO_ONES(fte_match_param, match_criteria, outer_headers.smac_15_0);
-	MLX5_SET(create_flow_group_in, flow_group_in, start_flow_index, 2);
-	MLX5_SET(create_flow_group_in, flow_group_in, end_flow_index, 2);
+	MLX5_SET(create_flow_group_in, flow_group_in, start_flow_index, allow_grp_sz);
+	MLX5_SET(create_flow_group_in, flow_group_in, end_flow_index, allow_grp_sz);
 
-	g = mlx5_create_flow_group(vport->ingress.acl, flow_group_in);
-	if (IS_ERR(g)) {
-		err = PTR_ERR(g);
-		esw_warn(dev, "vport[%d] ingress create spoofchk flow group, err(%d)\n",
+	drop_grp = mlx5_create_flow_group(vport->ingress.acl, flow_group_in);
+	if (IS_ERR(drop_grp)) {
+		err = PTR_ERR(drop_grp);
+		esw_warn(dev, "vport[%d] ingress create drop flow group, err(%d)\n",
 			 vport->vport, err);
-		goto allow_spoof_err;
+		goto out;
 	}
-	vport->ingress.legacy.allow_spoofchk_only_grp = g;
 
-	memset(flow_group_in, 0, inlen);
-	MLX5_SET(create_flow_group_in, flow_group_in, start_flow_index, 3);
-	MLX5_SET(create_flow_group_in, flow_group_in, end_flow_index, 3);
+	vport->ingress.legacy.allow_untagged_spoofchk_grp = untagged_spoof_grp;
+	vport->ingress.legacy.allow_tagged_spoofchk_grp = tagged_spoof_grp;
+	vport->ingress.legacy.drop_grp = drop_grp;
 
-	g = mlx5_create_flow_group(vport->ingress.acl, flow_group_in);
-	if (IS_ERR(g)) {
-		err = PTR_ERR(g);
-		esw_warn(dev, "vport[%d] ingress create drop flow group, err(%d)\n",
-			 vport->vport, err);
-		goto drop_err;
+out:
+	if (err) {
+		if (!IS_ERR_OR_NULL(tagged_spoof_grp))
+			mlx5_destroy_flow_group(tagged_spoof_grp);
+		if (!IS_ERR_OR_NULL(untagged_spoof_grp))
+			mlx5_destroy_flow_group(untagged_spoof_grp);
 	}
-	vport->ingress.legacy.drop_grp = g;
-	kvfree(flow_group_in);
-	return 0;
 
-drop_err:
-	if (!IS_ERR_OR_NULL(vport->ingress.legacy.allow_spoofchk_only_grp)) {
-		mlx5_destroy_flow_group(vport->ingress.legacy.allow_spoofchk_only_grp);
-		vport->ingress.legacy.allow_spoofchk_only_grp = NULL;
-	}
-allow_spoof_err:
-	if (!IS_ERR_OR_NULL(vport->ingress.legacy.allow_untagged_only_grp)) {
-		mlx5_destroy_flow_group(vport->ingress.legacy.allow_untagged_only_grp);
-		vport->ingress.legacy.allow_untagged_only_grp = NULL;
-	}
-untagged_err:
-	if (!IS_ERR_OR_NULL(vport->ingress.legacy.allow_untagged_spoofchk_grp)) {
-		mlx5_destroy_flow_group(vport->ingress.legacy.allow_untagged_spoofchk_grp);
-		vport->ingress.legacy.allow_untagged_spoofchk_grp = NULL;
-	}
-spoof_err:
 	kvfree(flow_group_in);
 	return err;
 }
@@ -1207,14 +1335,23 @@ void esw_vport_destroy_ingress_acl_table(struct mlx5_vport *vport)
 void esw_vport_cleanup_ingress_rules(struct mlx5_eswitch *esw,
 				     struct mlx5_vport *vport)
 {
+	struct mlx5_acl_vlan *trunk_vlan_rule, *tmp;
+
 	if (vport->ingress.legacy.drop_rule) {
 		mlx5_del_flow_rules(vport->ingress.legacy.drop_rule);
 		vport->ingress.legacy.drop_rule = NULL;
 	}
 
-	if (vport->ingress.allow_rule) {
-		mlx5_del_flow_rules(vport->ingress.allow_rule);
-		vport->ingress.allow_rule = NULL;
+	list_for_each_entry_safe(trunk_vlan_rule, tmp,
+				 &vport->ingress.legacy.allowed_vlans_rules, list) {
+		mlx5_del_flow_rules(trunk_vlan_rule->acl_vlan_rule);
+		list_del(&trunk_vlan_rule->list);
+		kfree(trunk_vlan_rule);
+	}
+
+	if (vport->ingress.allow_untagged_rule) {
+		mlx5_del_flow_rules(vport->ingress.allow_untagged_rule);
+		vport->ingress.allow_untagged_rule = NULL;
 	}
 }
 
@@ -1227,18 +1364,14 @@ static void esw_vport_disable_legacy_ingress_acl(struct mlx5_eswitch *esw,
 	esw_debug(esw->dev, "Destroy vport[%d] E-Switch ingress ACL\n", vport->vport);
 
 	esw_vport_cleanup_ingress_rules(esw, vport);
-	if (vport->ingress.legacy.allow_spoofchk_only_grp) {
-		mlx5_destroy_flow_group(vport->ingress.legacy.allow_spoofchk_only_grp);
-		vport->ingress.legacy.allow_spoofchk_only_grp = NULL;
-	}
-	if (vport->ingress.legacy.allow_untagged_only_grp) {
-		mlx5_destroy_flow_group(vport->ingress.legacy.allow_untagged_only_grp);
-		vport->ingress.legacy.allow_untagged_only_grp = NULL;
-	}
 	if (vport->ingress.legacy.allow_untagged_spoofchk_grp) {
 		mlx5_destroy_flow_group(vport->ingress.legacy.allow_untagged_spoofchk_grp);
 		vport->ingress.legacy.allow_untagged_spoofchk_grp = NULL;
 	}
+	if (vport->ingress.legacy.allow_tagged_spoofchk_grp) {
+		mlx5_destroy_flow_group(vport->ingress.legacy.allow_tagged_spoofchk_grp);
+		vport->ingress.legacy.allow_tagged_spoofchk_grp = NULL;
+	}
 	if (vport->ingress.legacy.drop_grp) {
 		mlx5_destroy_flow_group(vport->ingress.legacy.drop_grp);
 		vport->ingress.legacy.drop_grp = NULL;
@@ -1249,33 +1382,47 @@ static void esw_vport_disable_legacy_ingress_acl(struct mlx5_eswitch *esw,
 static int esw_vport_ingress_config(struct mlx5_eswitch *esw,
 				    struct mlx5_vport *vport)
 {
+	bool need_vlan_filter = !!bitmap_weight(vport->info.vlan_trunk_8021q_bitmap,
+						VLAN_N_VID);
 	struct mlx5_fc *counter = vport->ingress.legacy.drop_counter;
 	struct mlx5_flow_destination drop_ctr_dst = {0};
 	struct mlx5_flow_destination *dst = NULL;
+	struct mlx5_acl_vlan *trunk_vlan_rule;
 	struct mlx5_flow_act flow_act = {0};
 	struct mlx5_flow_spec *spec;
+	bool need_acl_table;
 	int dest_num = 0;
+	u16 vlan_id = 0;
 	int err = 0;
 	u8 *smac_v;
 
-	/* The ingress acl table contains 4 groups
+	/* The ingress acl table contains 3 groups
 	 * (2 active rules at the same time -
-	 *      1 allow rule from one of the first 3 groups.
-	 *      1 drop rule from the last group):
-	 * 1)Allow untagged traffic with smac=original mac.
-	 * 2)Allow untagged traffic.
-	 * 3)Allow traffic with smac=original mac.
-	 * 4)Drop all other traffic.
+	 *	1 allow rule from one of the first 2 groups.
+	 *	1 drop rule from the last group):
+	 * 1)Allow untagged traffic with/without smac=original mac.
+	 * 2)Allow tagged (VLAN trunk list) traffic with/without smac=original mac.
+	 * 3)Drop all other traffic.
 	 */
-	int table_size = 4;
+	int table_size = need_vlan_filter ? 8192 : 4;
 
 	esw_vport_cleanup_ingress_rules(esw, vport);
 
-	if (!vport->info.vlan && !vport->info.qos && !vport->info.spoofchk) {
+	need_acl_table = vport->info.vlan || vport->info.qos ||
+			 vport->info.spoofchk || need_vlan_filter;
+
+	if (!need_acl_table) {
 		esw_vport_disable_legacy_ingress_acl(esw, vport);
 		return 0;
 	}
 
+	if ((vport->info.vlan || vport->info.qos) && need_vlan_filter) {
+		mlx5_core_warn(esw->dev,
+			       "vport[%d] configure ingress rules failed, Cannot enable both VGT+ and VST\n",
+			       vport->vport);
+		return -EPERM;
+	}
+
 	if (!vport->ingress.acl) {
 		err = esw_vport_create_ingress_acl_table(esw, vport, table_size);
 		if (err) {
@@ -1300,7 +1447,10 @@ static int esw_vport_ingress_config(struct mlx5_eswitch *esw,
 		goto out;
 	}
 
-	if (vport->info.vlan || vport->info.qos)
+	spec->match_criteria_enable = MLX5_MATCH_OUTER_HEADERS;
+	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_ALLOW;
+
+	if (vport->info.vlan || vport->info.qos || need_vlan_filter)
 		MLX5_SET_TO_ONES(fte_match_param, spec->match_criteria, outer_headers.cvlan_tag);
 
 	if (vport->info.spoofchk) {
@@ -1312,20 +1462,52 @@ static int esw_vport_ingress_config(struct mlx5_eswitch *esw,
 		ether_addr_copy(smac_v, vport->info.mac);
 	}
 
-	spec->match_criteria_enable = MLX5_MATCH_OUTER_HEADERS;
-	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_ALLOW;
-	vport->ingress.allow_rule =
-		mlx5_add_flow_rules(vport->ingress.acl, spec,
-				    &flow_act, NULL, 0);
-	if (IS_ERR(vport->ingress.allow_rule)) {
-		err = PTR_ERR(vport->ingress.allow_rule);
-		esw_warn(esw->dev,
-			 "vport[%d] configure ingress allow rule, err(%d)\n",
-			 vport->vport, err);
-		vport->ingress.allow_rule = NULL;
-		goto out;
+	/* Allow untagged */
+	if (!need_vlan_filter ||
+	    (need_vlan_filter && test_bit(0, vport->info.vlan_trunk_8021q_bitmap))) {
+		vport->ingress.allow_untagged_rule =
+			mlx5_add_flow_rules(vport->ingress.acl, spec,
+					    &flow_act, NULL, 0);
+		if (IS_ERR(vport->ingress.allow_untagged_rule)) {
+			err = PTR_ERR(vport->ingress.allow_untagged_rule);
+			esw_warn(esw->dev,
+				 "vport[%d] configure ingress allow rule, err(%d)\n",
+				 vport->vport, err);
+			vport->ingress.allow_untagged_rule = NULL;
+			goto out;
+		}
 	}
 
+	if (!need_vlan_filter)
+		goto drop_rule;
+
+	/* Allow tagged (VLAN trunk list) */
+	MLX5_SET_TO_ONES(fte_match_param, spec->match_value, outer_headers.cvlan_tag);
+	MLX5_SET_TO_ONES(fte_match_param, spec->match_criteria, outer_headers.first_vid);
+
+	for_each_set_bit(vlan_id, vport->acl_vlan_8021q_bitmap, VLAN_N_VID) {
+		trunk_vlan_rule = kzalloc(sizeof(*trunk_vlan_rule), GFP_KERNEL);
+		if (!trunk_vlan_rule) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		MLX5_SET(fte_match_param, spec->match_value, outer_headers.first_vid,
+			 vlan_id);
+		trunk_vlan_rule->acl_vlan_rule =
+			mlx5_add_flow_rules(vport->ingress.acl, spec, &flow_act, NULL, 0);
+		if (IS_ERR(trunk_vlan_rule->acl_vlan_rule)) {
+			err = PTR_ERR(trunk_vlan_rule->acl_vlan_rule);
+			esw_warn(esw->dev,
+				 "vport[%d] configure ingress allowed vlan rule failed, err(%d)\n",
+				 vport->vport, err);
+			kfree(trunk_vlan_rule);
+			continue;
+		}
+		list_add(&trunk_vlan_rule->list, &vport->ingress.legacy.allowed_vlans_rules);
+	}
+
+drop_rule:
 	memset(spec, 0, sizeof(*spec));
 	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_DROP;
 
@@ -1348,11 +1530,11 @@ static int esw_vport_ingress_config(struct mlx5_eswitch *esw,
 		vport->ingress.legacy.drop_rule = NULL;
 		goto out;
 	}
-	kvfree(spec);
-	return 0;
 
 out:
-	esw_vport_disable_legacy_ingress_acl(esw, vport);
+	if (err)
+		esw_vport_disable_legacy_ingress_acl(esw, vport);
+
 	kvfree(spec);
 	return err;
 }
@@ -1365,7 +1547,7 @@ int mlx5_esw_create_vport_egress_acl_vlan(struct mlx5_eswitch *esw,
 	struct mlx5_flow_spec *spec;
 	int err = 0;
 
-	if (vport->egress.allowed_vlan)
+	if (vport->egress.allowed_vst_vlan)
 		return -EEXIST;
 
 	spec = kvzalloc(sizeof(*spec), GFP_KERNEL);
@@ -1379,15 +1561,15 @@ int mlx5_esw_create_vport_egress_acl_vlan(struct mlx5_eswitch *esw,
 
 	spec->match_criteria_enable = MLX5_MATCH_OUTER_HEADERS;
 	flow_act.action = flow_action;
-	vport->egress.allowed_vlan =
+	vport->egress.allowed_vst_vlan =
 		mlx5_add_flow_rules(vport->egress.acl, spec,
 				    &flow_act, NULL, 0);
-	if (IS_ERR(vport->egress.allowed_vlan)) {
-		err = PTR_ERR(vport->egress.allowed_vlan);
+	if (IS_ERR(vport->egress.allowed_vst_vlan)) {
+		err = PTR_ERR(vport->egress.allowed_vst_vlan);
 		esw_warn(esw->dev,
 			 "vport[%d] configure egress vlan rule failed, err(%d)\n",
 			 vport->vport, err);
-		vport->egress.allowed_vlan = NULL;
+		vport->egress.allowed_vst_vlan = NULL;
 	}
 
 	kvfree(spec);
@@ -1397,17 +1579,22 @@ int mlx5_esw_create_vport_egress_acl_vlan(struct mlx5_eswitch *esw,
 static int esw_vport_egress_config(struct mlx5_eswitch *esw,
 				   struct mlx5_vport *vport)
 {
+	bool need_vlan_filter = !!bitmap_weight(vport->info.vlan_trunk_8021q_bitmap,
+						VLAN_N_VID);
+	bool need_acl_table = vport->info.vlan || vport->info.qos || need_vlan_filter;
+	struct mlx5_acl_vlan *trunk_vlan_rule;
 	struct mlx5_fc *counter = vport->egress.legacy.drop_counter;
 	struct mlx5_flow_destination drop_ctr_dst = {0};
 	struct mlx5_flow_destination *dst = NULL;
 	struct mlx5_flow_act flow_act = {0};
 	struct mlx5_flow_spec *spec;
 	int dest_num = 0;
+	u16 vlan_id = 0;
 	int err = 0;
 
 	esw_vport_cleanup_egress_rules(esw, vport);
 
-	if (!vport->info.vlan && !vport->info.qos) {
+	if (!need_acl_table) {
 		esw_vport_disable_egress_acl(esw, vport);
 		return 0;
 	}
@@ -1424,17 +1611,67 @@ static int esw_vport_egress_config(struct mlx5_eswitch *esw,
 		  "vport[%d] configure egress rules, vlan(%d) qos(%d)\n",
 		  vport->vport, vport->info.vlan, vport->info.qos);
 
-	/* Allowed vlan rule */
-	err = mlx5_esw_create_vport_egress_acl_vlan(esw, vport, vport->info.vlan,
-						    MLX5_FLOW_CONTEXT_ACTION_ALLOW);
-	if (err)
-		return err;
-
-	/* Drop others rule (star rule) */
 	spec = kvzalloc(sizeof(*spec), GFP_KERNEL);
-	if (!spec)
+	if (!spec) {
+		err = -ENOMEM;
 		goto out;
+	}
+
+	MLX5_SET_TO_ONES(fte_match_param, spec->match_criteria, outer_headers.cvlan_tag);
+	spec->match_criteria_enable = MLX5_MATCH_OUTER_HEADERS;
+	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_ALLOW;
+
+	/* Allow untagged */
+	if (need_vlan_filter && test_bit(0, vport->info.vlan_trunk_8021q_bitmap)) {
+		vport->egress.legacy.allow_untagged_rule =
+			mlx5_add_flow_rules(vport->egress.acl, spec,
+					    &flow_act, NULL, 0);
+		if (IS_ERR(vport->egress.legacy.allow_untagged_rule)) {
+			err = PTR_ERR(vport->egress.legacy.allow_untagged_rule);
+			esw_warn(esw->dev,
+				 "vport[%d] configure egress allow rule, err(%d)\n",
+				 vport->vport, err);
+			vport->egress.legacy.allow_untagged_rule = NULL;
+		}
+	}
+
+	/* VST rule */
+	if (vport->info.vlan || vport->info.qos) {
+		err = mlx5_esw_create_vport_egress_acl_vlan(esw, vport, vport->info.vlan,
+							    MLX5_FLOW_CONTEXT_ACTION_ALLOW);
+		if (err)
+			goto out;
+	}
+
+	/* Allowed trunk vlans rules (VGT+) */
+	MLX5_SET_TO_ONES(fte_match_param, spec->match_value, outer_headers.cvlan_tag);
+	MLX5_SET_TO_ONES(fte_match_param, spec->match_criteria, outer_headers.first_vid);
 
+	for_each_set_bit(vlan_id, vport->acl_vlan_8021q_bitmap, VLAN_N_VID) {
+		trunk_vlan_rule = kzalloc(sizeof(*trunk_vlan_rule), GFP_KERNEL);
+		if (!trunk_vlan_rule) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		MLX5_SET(fte_match_param, spec->match_value, outer_headers.first_vid,
+			 vlan_id);
+		trunk_vlan_rule->acl_vlan_rule =
+			mlx5_add_flow_rules(vport->egress.acl, spec, &flow_act, NULL, 0);
+		if (IS_ERR(trunk_vlan_rule->acl_vlan_rule)) {
+			err = PTR_ERR(trunk_vlan_rule->acl_vlan_rule);
+			esw_warn(esw->dev,
+				 "vport[%d] configure egress allowed vlan rule failed, err(%d)\n",
+				 vport->vport, err);
+			kfree(trunk_vlan_rule);
+			continue;
+		}
+		list_add(&trunk_vlan_rule->list, &vport->egress.legacy.allowed_vlans_rules);
+	}
+
+	/* Drop others rule (star rule) */
+
+	memset(spec, 0, sizeof(*spec));
 	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_DROP;
 
 	/* Attach egress drop flow counter */
@@ -1455,7 +1692,11 @@ static int esw_vport_egress_config(struct mlx5_eswitch *esw,
 			 vport->vport, err);
 		vport->egress.legacy.drop_rule = NULL;
 	}
+
 out:
+	if (err)
+		esw_vport_cleanup_egress_rules(esw, vport);
+
 	kvfree(spec);
 	return err;
 }
@@ -1787,6 +2028,12 @@ static int esw_enable_vport(struct mlx5_eswitch *esw, struct mlx5_vport *vport,
 
 	esw_debug(esw->dev, "Enabling VPORT(%d)\n", vport_num);
 
+	bitmap_zero(vport->req_vlan_bitmap, VLAN_N_VID);
+	bitmap_zero(vport->acl_vlan_8021q_bitmap, VLAN_N_VID);
+	bitmap_zero(vport->info.vlan_trunk_8021q_bitmap, VLAN_N_VID);
+	INIT_LIST_HEAD(&vport->egress.legacy.allowed_vlans_rules);
+	INIT_LIST_HEAD(&vport->ingress.legacy.allowed_vlans_rules);
+
 	/* Restore old vport configuration */
 	esw_apply_vport_conf(esw, vport);
 
@@ -2268,6 +2515,17 @@ int mlx5_eswitch_get_vport_config(struct mlx5_eswitch *esw,
 	ivi->trusted = evport->info.trusted;
 	ivi->min_tx_rate = evport->info.min_rate;
 	ivi->max_tx_rate = evport->info.max_rate;
+
+	if (bitmap_weight(evport->info.vlan_trunk_8021q_bitmap, VLAN_N_VID)) {
+		bitmap_copy((unsigned long *)ivi->trunk_8021q,
+			    evport->info.vlan_trunk_8021q_bitmap, VLAN_N_VID);
+		ivi->vlan_mode = IFLA_VF_VLAN_MODE_TRUNK;
+	} else if (ivi->vlan) {
+		ivi->vlan_mode = IFLA_VF_VLAN_MODE_VST;
+	} else {
+		ivi->vlan_mode = IFLA_VF_VLAN_MODE_VGT;
+	};
+
 	mutex_unlock(&esw->state_lock);
 
 	return 0;
@@ -2286,6 +2544,14 @@ int __mlx5_eswitch_set_vport_vlan(struct mlx5_eswitch *esw,
 	if (vlan > 4095 || qos > 7)
 		return -EINVAL;
 
+	if (bitmap_weight(evport->info.vlan_trunk_8021q_bitmap, VLAN_N_VID)) {
+		err = -EPERM;
+		mlx5_core_warn(esw->dev,
+			       "VST is not allowed when operating in VGT+ mode vport(%d)\n",
+			       vport);
+		return -EPERM;
+	}
+
 	err = modify_esw_vport_cvlan(esw->dev, vport, vlan, qos, set_flags);
 	if (err)
 		return err;
@@ -2628,6 +2894,92 @@ static int mlx5_eswitch_query_vport_drop_stats(struct mlx5_core_dev *dev,
 	return 0;
 }
 
+static int mlx5_eswitch_update_vport_trunk(struct mlx5_eswitch *esw,
+					   struct mlx5_vport *evport,
+					   unsigned long *old_trunk)
+{
+	DECLARE_BITMAP(diff_vlan_bm, VLAN_N_VID);
+	int err = 0;
+
+	bitmap_xor(diff_vlan_bm, old_trunk,
+		   evport->info.vlan_trunk_8021q_bitmap, VLAN_N_VID);
+	if (!bitmap_weight(diff_vlan_bm, VLAN_N_VID))
+		return err;
+
+	esw_update_acl_trunk_bitmap(esw, evport->vport);
+	if (evport->enabled && esw->mode == MLX5_ESWITCH_LEGACY) {
+		err = esw_vport_egress_config(esw, evport);
+		if (!err)
+			err = esw_vport_ingress_config(esw, evport);
+	}
+
+	if (err) {
+		bitmap_copy(evport->info.vlan_trunk_8021q_bitmap, old_trunk, VLAN_N_VID);
+		esw_update_acl_trunk_bitmap(esw, evport->vport);
+		esw_vport_egress_config(esw, evport);
+		esw_vport_ingress_config(esw, evport);
+	}
+
+	return err;
+}
+
+int mlx5_eswitch_add_vport_trunk_range(struct mlx5_eswitch *esw,
+				       u16 vport, u16 start_vlan, u16 end_vlan)
+{
+	struct mlx5_vport *evport = mlx5_eswitch_get_vport(esw, vport);
+	DECLARE_BITMAP(prev_vport_bitmap, VLAN_N_VID);
+	int err = 0;
+
+	if (!ESW_ALLOWED(esw))
+		return -EPERM;
+
+	if (IS_ERR(evport) || end_vlan > VLAN_N_VID || start_vlan > end_vlan)
+		return -EINVAL;
+
+	mutex_lock(&esw->state_lock);
+
+	if (evport->info.vlan || evport->info.qos) {
+		err = -EPERM;
+		mlx5_core_warn(esw->dev,
+			       "VGT+ is not allowed when operating in VST mode vport(%d)\n",
+			       vport);
+		goto unlock;
+	}
+
+	bitmap_copy(prev_vport_bitmap, evport->info.vlan_trunk_8021q_bitmap, VLAN_N_VID);
+	bitmap_set(evport->info.vlan_trunk_8021q_bitmap, start_vlan,
+		   end_vlan - start_vlan + 1);
+	err = mlx5_eswitch_update_vport_trunk(esw, evport, prev_vport_bitmap);
+
+unlock:
+	mutex_unlock(&esw->state_lock);
+
+	return err;
+}
+
+int mlx5_eswitch_del_vport_trunk_range(struct mlx5_eswitch *esw,
+				       u16 vport, u16 start_vlan, u16 end_vlan)
+{
+	struct mlx5_vport *evport = mlx5_eswitch_get_vport(esw, vport);
+	DECLARE_BITMAP(prev_vport_bitmap, VLAN_N_VID);
+	int err = 0;
+
+	if (!ESW_ALLOWED(esw))
+		return -EPERM;
+
+	if (IS_ERR(evport) || end_vlan > VLAN_N_VID || start_vlan > end_vlan)
+		return -EINVAL;
+
+	mutex_lock(&esw->state_lock);
+	bitmap_copy(prev_vport_bitmap, evport->info.vlan_trunk_8021q_bitmap, VLAN_N_VID);
+	bitmap_clear(evport->info.vlan_trunk_8021q_bitmap, start_vlan,
+		     end_vlan - start_vlan + 1);
+	err = mlx5_eswitch_update_vport_trunk(esw, evport, prev_vport_bitmap);
+	mutex_unlock(&esw->state_lock);
+
+	return err;
+}
+
 int mlx5_eswitch_get_vport_stats(struct mlx5_eswitch *esw,
 				 u16 vport_num,
 				 struct ifla_vf_stats *vf_stats)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 920d8f5..1ba7aa3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -35,6 +35,8 @@
 
 #include <linux/if_ether.h>
 #include <linux/if_link.h>
+#include <linux/if_vlan.h>
+#include <linux/bitmap.h>
 #include <linux/atomic.h>
 #include <net/devlink.h>
 #include <linux/mlx5/device.h>
@@ -51,6 +53,9 @@
 #define MLX5_MAX_MC_PER_VPORT(dev) \
 	(1 << MLX5_CAP_GEN(dev, log_max_current_mc_list))
 
+#define MLX5_MAX_VLAN_PER_VPORT(dev) \
+	(1 << MLX5_CAP_GEN(dev, log_max_vlan_list))
+
 #define MLX5_MIN_BW_SHARE 1
 
 #define MLX5_RATE_TO_BW_SHARE(rate, divider, limit) \
@@ -65,14 +70,14 @@
 
 struct vport_ingress {
 	struct mlx5_flow_table *acl;
-	struct mlx5_flow_handle *allow_rule;
+	struct mlx5_flow_handle *allow_untagged_rule;
 	struct {
-		struct mlx5_flow_group *allow_spoofchk_only_grp;
 		struct mlx5_flow_group *allow_untagged_spoofchk_grp;
-		struct mlx5_flow_group *allow_untagged_only_grp;
+		struct mlx5_flow_group *allow_tagged_spoofchk_grp;
 		struct mlx5_flow_group *drop_grp;
 		struct mlx5_flow_handle *drop_rule;
 		struct mlx5_fc *drop_counter;
+		struct list_head allowed_vlans_rules;
 	} legacy;
 	struct {
 		struct mlx5_flow_group *metadata_grp;
@@ -83,11 +88,14 @@ struct vport_ingress {
 
 struct vport_egress {
 	struct mlx5_flow_table *acl;
+	struct mlx5_flow_group *allow_untagged_grp;
 	struct mlx5_flow_group *allowed_vlans_grp;
 	struct mlx5_flow_group *drop_grp;
-	struct mlx5_flow_handle  *allowed_vlan;
+	struct mlx5_flow_handle  *allowed_vst_vlan;
 	struct {
 		struct mlx5_flow_handle *drop_rule;
+		struct list_head allowed_vlans_rules;
+		struct mlx5_flow_handle *allow_untagged_rule;
 		struct mlx5_fc *drop_counter;
 	} legacy;
 };
@@ -107,12 +115,15 @@ struct mlx5_vport_info {
 	u32                     max_rate;
 	bool                    spoofchk;
 	bool                    trusted;
+	/* the admin approved vlan list */
+	DECLARE_BITMAP(vlan_trunk_8021q_bitmap, VLAN_N_VID);
 };
 
 /* Vport context events */
 enum mlx5_eswitch_vport_event {
 	MLX5_VPORT_UC_ADDR_CHANGE = BIT(0),
 	MLX5_VPORT_MC_ADDR_CHANGE = BIT(1),
+	MLX5_VPORT_VLAN_CHANGE = BIT(2),
 	MLX5_VPORT_PROMISC_CHANGE = BIT(3),
 };
 
@@ -121,6 +132,10 @@ struct mlx5_vport {
 	int                     vport;
 	struct hlist_head       uc_list[MLX5_L2_ADDR_HASH_SIZE];
 	struct hlist_head       mc_list[MLX5_L2_ADDR_HASH_SIZE];
+	/* The requested vlan list from the vport side */
+	DECLARE_BITMAP(req_vlan_bitmap, VLAN_N_VID);
+	/* Actual accepted vlans on the acl tables */
+	DECLARE_BITMAP(acl_vlan_8021q_bitmap, VLAN_N_VID);
 	struct mlx5_flow_handle *promisc_rule;
 	struct mlx5_flow_handle *allmulti_rule;
 	struct work_struct      vport_change_handler;
@@ -292,6 +307,10 @@ int mlx5_eswitch_set_vport_rate(struct mlx5_eswitch *esw, u16 vport,
 int mlx5_eswitch_get_vepa(struct mlx5_eswitch *esw, u8 *setting);
 int mlx5_eswitch_get_vport_config(struct mlx5_eswitch *esw,
 				  u16 vport, struct ifla_vf_info *ivi);
+int mlx5_eswitch_add_vport_trunk_range(struct mlx5_eswitch *esw,
+				       u16 vport, u16 start_vlan, u16 end_vlan);
+int mlx5_eswitch_del_vport_trunk_range(struct mlx5_eswitch *esw,
+				       u16 vport, u16 start_vlan, u16 end_vlan);
 int mlx5_eswitch_get_vport_stats(struct mlx5_eswitch *esw,
 				 u16 vport,
 				 struct ifla_vf_stats *vf_stats);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 9924f06..2db872a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -1783,15 +1783,15 @@ static int esw_vport_ingress_prio_tag_config(struct mlx5_eswitch *esw,
 		flow_act.modify_hdr = vport->ingress.offloads.modify_metadata;
 	}
 
-	vport->ingress.allow_rule =
+	vport->ingress.allow_untagged_rule =
 		mlx5_add_flow_rules(vport->ingress.acl, spec,
 				    &flow_act, NULL, 0);
-	if (IS_ERR(vport->ingress.allow_rule)) {
-		err = PTR_ERR(vport->ingress.allow_rule);
+	if (IS_ERR(vport->ingress.allow_untagged_rule)) {
+		err = PTR_ERR(vport->ingress.allow_untagged_rule);
 		esw_warn(esw->dev,
 			 "vport[%d] configure ingress untagged allow rule, err(%d)\n",
 			 vport->vport, err);
-		vport->ingress.allow_rule = NULL;
+		vport->ingress.allow_untagged_rule = NULL;
 		goto out;
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-10-31 19:47 [PATCH net-next v2 0/3] VGT+ support Ariel Levkovich
                   ` (2 preceding siblings ...)
  2019-10-31 19:47 ` [PATCH net-next v2 3/3] net/mlx5: " Ariel Levkovich
@ 2019-10-31 20:31 ` David Miller
  2019-10-31 22:20   ` Ariel Levkovich
  2019-11-01  0:23 ` Jakub Kicinski
  4 siblings, 1 reply; 24+ messages in thread
From: David Miller @ 2019-10-31 20:31 UTC (permalink / raw)
  To: lariel; +Cc: netdev, saeedm, sd, sbrivio, nikolay, jiri, dsahern, stephen


The previous posted version was also v2, what are you doing?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-10-31 20:31 ` [PATCH net-next v2 0/3] " David Miller
@ 2019-10-31 22:20   ` Ariel Levkovich
  2019-10-31 22:58     ` David Miller
  0 siblings, 1 reply; 24+ messages in thread
From: Ariel Levkovich @ 2019-10-31 22:20 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Saeed Mahameed, sd, sbrivio, nikolay, Jiri Pirko,
	dsahern, stephen



On 10/31/19, 4:31 PM, "David Miller" <davem@davemloft.net> wrote:

    
>    The previous posted version was also v2, what are you doing?

I restarted this series since my first submission had a mistake in the subject prefix.
This is the 2nd version of that new submission while previous has a different subject
and can be ignored.  


    


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-10-31 22:20   ` Ariel Levkovich
@ 2019-10-31 22:58     ` David Miller
  2019-11-01 14:55       ` Ariel Levkovich
  0 siblings, 1 reply; 24+ messages in thread
From: David Miller @ 2019-10-31 22:58 UTC (permalink / raw)
  To: lariel; +Cc: netdev, saeedm, sd, sbrivio, nikolay, jiri, dsahern, stephen

From: Ariel Levkovich <lariel@mellanox.com>
Date: Thu, 31 Oct 2019 22:20:55 +0000

> 
> 
> On 10/31/19, 4:31 PM, "David Miller" <davem@davemloft.net> wrote:
> 
>     
>>    The previous posted version was also v2, what are you doing?
> 
> I restarted this series since my first submission had a mistake in the subject prefix.
> This is the 2nd version of that new submission while previous has a different subject
> and can be ignored.  

Always increment the version number when you post a patch series anew.

Otherwise it is ambiguous to me which one is the latest.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-10-31 19:47 [PATCH net-next v2 0/3] VGT+ support Ariel Levkovich
                   ` (3 preceding siblings ...)
  2019-10-31 20:31 ` [PATCH net-next v2 0/3] " David Miller
@ 2019-11-01  0:23 ` Jakub Kicinski
       [not found]   ` <8d7db56c-376a-d809-4a65-bfc2baf3254f@mellanox.com>
  4 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2019-11-01  0:23 UTC (permalink / raw)
  To: Ariel Levkovich
  Cc: netdev, Saeed Mahameed, sd, sbrivio, nikolay, Jiri Pirko,
	dsahern, stephen

On Thu, 31 Oct 2019 19:47:25 +0000, Ariel Levkovich wrote:
> The following series introduces VGT+ support for SRIOV vf
> devices.
> 
> VGT+ is an extention of the VGT (Virtual Guest Tagging)
> where the guest is in charge of vlan tagging the packets
> only with VGT+ the admin can limit the allowed vlan ids
> the guest can use to a specific vlan trunk list.
> 
> The patches introduce the API for admin users to set and
> query these vlan trunk lists on the vfs using netlink
> commands.
> 
> changes from v1 to v2:
> - Fixed indentation of RTEXT_FILTER_SKIP_STATS.
> - Changed vf_ext param to bool.
> - Check if VF num exceeds the opened VFs range and return without
> adding the vfinfo.

If you repost something without addressing feedback you received you
_have_ _to_ describe what that feedback was and why it was ignored in 
the cover letter of the new posting, please and thank you.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-10-31 22:58     ` David Miller
@ 2019-11-01 14:55       ` Ariel Levkovich
  0 siblings, 0 replies; 24+ messages in thread
From: Ariel Levkovich @ 2019-11-01 14:55 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Saeed Mahameed, sd, sbrivio, nikolay, Jiri Pirko,
	dsahern, stephen

On 10/31/19 6:58 PM, "David Miller" <davem@davemloft.net> wrote:
> From: Ariel Levkovich <lariel@mellanox.com>
> Date: Thu, 31 Oct 2019 22:20:55 +0000
>
>>
>> On 10/31/19, 4:31 PM, "David Miller" <davem@davemloft.net> wrote:
>>
>>      
>>>     The previous posted version was also v2, what are you doing?
>> I restarted this series since my first submission had a mistake in the subject prefix.
>> This is the 2nd version of that new submission while previous has a different subject
>> and can be ignored.
> Always increment the version number when you post a patch series anew.
>
> Otherwise it is ambiguous to me which one is the latest.


Understood. Will make sure of that.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
       [not found]   ` <8d7db56c-376a-d809-4a65-bfc2baf3254f@mellanox.com>
@ 2019-11-01 21:28     ` Saeed Mahameed
  2019-11-02  0:21       ` Jakub Kicinski
  0 siblings, 1 reply; 24+ messages in thread
From: Saeed Mahameed @ 2019-11-01 21:28 UTC (permalink / raw)
  To: Ariel Levkovich, jakub.kicinski
  Cc: dsahern, sbrivio, nikolay, stephen, netdev, sd, Jiri Pirko

On Fri, 2019-11-01 at 19:44 +0000, Ariel Levkovich wrote:
> On 10/31/19 8:23 PM, Jakub Kicinski <jakub.kicinski@netronome.com>
> wrote:
> > On Thu, 31 Oct 2019 19:47:25 +0000, Ariel Levkovich wrote:
> > > The following series introduces VGT+ support for SRIOV vf
> > > devices.
> > > 
> > > VGT+ is an extention of the VGT (Virtual Guest Tagging)
> > > where the guest is in charge of vlan tagging the packets
> > > only with VGT+ the admin can limit the allowed vlan ids
> > > the guest can use to a specific vlan trunk list.
> > > 
> > > The patches introduce the API for admin users to set and
> > > query these vlan trunk lists on the vfs using netlink
> > > commands.
> > > 
> > > changes from v1 to v2:
> > > - Fixed indentation of RTEXT_FILTER_SKIP_STATS.
> > > - Changed vf_ext param to bool.
> > > - Check if VF num exceeds the opened VFs range and return without
> > > adding the vfinfo.
> > 
> > If you repost something without addressing feedback you received
> > you
> > _have_ _to_ describe what that feedback was and why it was ignored
> > in 
> > the cover letter of the new posting, please and thank you.
> 
> Right, I must've missed that.
> 
> I re posted the patches to address the code related feedback from
> Stephen while
> 
> we continue the discussion on the concept of legacy mode features.
> 
> On 10/30/19 you wrote:
> 
>     "
> 
>     The "we don't want any more legacy VF ndos" policy which I think
> we
>     wanted to follow is much easier to stick to than "we don't want
> any
>     more legacy VF ndos, unless..".
>    
>     There's nothing here that can't be done in switchdev mode
> (perhaps
>     bridge offload would actually be more suitable than just flower),
>     and the uAPI extension is not an insignificant one.
>    
>     I don't think we should be growing both legacy and switchdev
> APIs, at
>     some point we got to pick one. The switchdev extension to set
> hwaddr
>     for which patches were posted recently had been implemented
> through
>     legacy API a while ago (by Chelsio IIRC) so it's not that we're
> looking
>     towards switchdev where legacy API is impossible to extend. It's
> purely
>     a policy decision to pick one and deprecate the other.
>    
>     Only if we freeze the legacy API completely will the
> orchestration
>     layers have an incentive to support switchdev. And we can save
> the few
>     hundred lines of code per feature in every driver..
>     "
> 
> 
> Unfortunately, like Saeed and yourself mentioned, we still face
> customers that are
> 
> refusing to move to switchdev while requiring feature and
> capabilities
> 
> in virtual environments.
> 
> 
> 
> We have several configurations available today for legacy that
> include vlan settings
> for the VF.
> I know that we decided to refrain from adding new legacy ndos but I
> also think we
> left a gap in the VLAN control for legacy mode while this feature
> fills that gap.
> Today user can either set a single VLAN id to a VF or none there's
> nothing in between.
> This feature complements  the missing part of the puzzle and we don't
> see any further
> future development in this area after this.
> 
> We have tried to submit and close this gap before (
> https://patchwork.ozlabs.org/patch/806213)
> but messed up there and at the time we decided not to proceed the
> effort
> due to several reasons so we are trying to close this now.
> 
> 
> 

Jakub, since Ariel is still working on his upstream mailing list skills
:), i would like to emphasis and summarize his point in text style ;-)
the way we like it.

Bottom line, we tried to push this feature a couple of years ago, and
due to some internal issues this submission ignored for a while, now as
the legacy sriov customers are moving towards upstream, which is for me
a great progress I think this feature worth the shot, also as Ariel
pointed out, VF vlan filter is really a gap that should be closed.

For all other features it is true that the user must consider moving to
witchdev mode or find a another community for support.

Our policy is still strong regarding obsoleting legacy mode and pushing
all new feature to switchdev mode, but looking at the facts here  I do
think there is a point here and ROI to close this gap in legacy mode.

I hope this all make sense. 

Thanks,
Saeed.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-11-01 21:28     ` Saeed Mahameed
@ 2019-11-02  0:21       ` Jakub Kicinski
  2019-11-05  1:38         ` Saeed Mahameed
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2019-11-02  0:21 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Ariel Levkovich, dsahern, sbrivio, nikolay, stephen, netdev, sd,
	Jiri Pirko

On Fri, 1 Nov 2019 21:28:22 +0000, Saeed Mahameed wrote:
> Jakub, since Ariel is still working on his upstream mailing list skills
> :), i would like to emphasis and summarize his point in text style ;-)
> the way we like it.

Thanks :)

> Bottom line, we tried to push this feature a couple of years ago, and
> due to some internal issues this submission ignored for a while, now as
> the legacy sriov customers are moving towards upstream, which is for me
> a great progress I think this feature worth the shot, also as Ariel
> pointed out, VF vlan filter is really a gap that should be closed.
> 
> For all other features it is true that the user must consider moving to
> witchdev mode or find a another community for support.
> 
> Our policy is still strong regarding obsoleting legacy mode and pushing
> all new feature to switchdev mode, but looking at the facts here  I do
> think there is a point here and ROI to close this gap in legacy mode.
> 
> I hope this all make sense. 

I understand and sympathize, you know full well the benefits of working
upstream-first...

I won't reiterate the entire response from my previous email, but the
bottom line for me is that we haven't added a single legacy VF NDO
since 2016, I was hoping we never will add more and I was trying to
stop anyone who tried.

Muxing the VF info through linkinfo is getting pretty ugly with the
tricks we have to play to make it fit message size.

And you can't really say you considered your options carefully here and
reviewed the code, since the patch adds this nugget to a uAPI header:

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 797e214..35ab210 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -180,6 +180,8 @@ enum {
 #ifndef __KERNEL__
 #define IFLA_RTA(r)  ((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct ifinfomsg))))
 #define IFLA_PAYLOAD(n) NLMSG_PAYLOAD(n,sizeof(struct ifinfomsg))
+#define BITS_PER_BYTE 8
+#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
 #endif
 
 enum {

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-11-02  0:21       ` Jakub Kicinski
@ 2019-11-05  1:38         ` Saeed Mahameed
  2019-11-05  1:47           ` David Ahern
  0 siblings, 1 reply; 24+ messages in thread
From: Saeed Mahameed @ 2019-11-05  1:38 UTC (permalink / raw)
  To: jakub.kicinski
  Cc: sbrivio, nikolay, dsahern, sd, Jiri Pirko, netdev, stephen,
	Ariel Levkovich

On Fri, 2019-11-01 at 17:21 -0700, Jakub Kicinski wrote:
> On Fri, 1 Nov 2019 21:28:22 +0000, Saeed Mahameed wrote:
> > Jakub, since Ariel is still working on his upstream mailing list
> > skills
> > :), i would like to emphasis and summarize his point in text style
> > ;-)
> > the way we like it.
> 
> Thanks :)
> 
> > Bottom line, we tried to push this feature a couple of years ago,
> > and
> > due to some internal issues this submission ignored for a while,
> > now as
> > the legacy sriov customers are moving towards upstream, which is
> > for me
> > a great progress I think this feature worth the shot, also as Ariel
> > pointed out, VF vlan filter is really a gap that should be closed.
> > 
> > For all other features it is true that the user must consider
> > moving to
> > witchdev mode or find a another community for support.
> > 
> > Our policy is still strong regarding obsoleting legacy mode and
> > pushing
> > all new feature to switchdev mode, but looking at the facts here  I
> > do
> > think there is a point here and ROI to close this gap in legacy
> > mode.
> > 
> > I hope this all make sense. 
> 
> I understand and sympathize, you know full well the benefits of
> working
> upstream-first...
> 
> I won't reiterate the entire response from my previous email, but the
> bottom line for me is that we haven't added a single legacy VF NDO
> since 2016, I was hoping we never will add more and I was trying to
> stop anyone who tried.
> 

The NDO is not the problem here, we can perfectly extend the current
set_vf_vlan_ndo to achieve the same goal with minimal or even NO kernel
changes, but first you have to look at this from my angel, i have been
doing lots of research and there are many points for why this should be
added to legacy mode:

1) Switchdev mode can't replace legacy mode with a press of a button,
many missing pieces.

2) Upstream Legacy SRIOV is incomplete since it goes together with
flexible vf vlan configuration, most of mlx5 legacy sriov users are
using customized kernels and external drivers, since upstream is
lacking this one basic vlan filtering feature, and many users decline
switching to upstream kernel due to this missing features.

3) Many other vendors have this feature in customized drivers/kernels,
and many vendors/drivers don't even support switchdev mode (mlx4 for
example), we can't just tell the users of such device we are not
supporting basic sriov legacy mode features in upstream kernel.

4) the motivation for this is to slowly move sriov users to upstream
and eventually to switchdev mode.

This has been a large gap and disadvantage in upstream and 
this feature has been long time coming and we have been laying the
grounds for it since 2016 (IFLA_VF_VLAN_LIST), it is not reasonable
that we have APIs to set all kinds of vf vlan attributes but not vlan
filters, I really believe this feature should be IN.

I still believe in the switchdev only policy, but vlan filter is a gray
area, Legacy sriov is all about VF vlans.

my opinion is: VF Vlan filter  is a very basic requirement for SRIOV,
not all HW support switchdev mode and many customers and usecases are
meant to use legacy mode only, it is unfair to block this feature, and
this is a great loss of users and use cases to the upstream kernel.

Now if the only remaining problem is the uAPI, we can minimize kernel
impact or even make no kernel changes at all, only ip route2 and
drivers, by reusing the current set_vf_vlan_ndo.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-11-05  1:38         ` Saeed Mahameed
@ 2019-11-05  1:47           ` David Ahern
  2019-11-05  2:35             ` Jakub Kicinski
  0 siblings, 1 reply; 24+ messages in thread
From: David Ahern @ 2019-11-05  1:47 UTC (permalink / raw)
  To: Saeed Mahameed, jakub.kicinski
  Cc: sbrivio, nikolay, sd, Jiri Pirko, netdev, stephen, Ariel Levkovich

On 11/4/19 6:38 PM, Saeed Mahameed wrote:
> On Fri, 2019-11-01 at 17:21 -0700, Jakub Kicinski wrote:
>> On Fri, 1 Nov 2019 21:28:22 +0000, Saeed Mahameed wrote:
>>> Jakub, since Ariel is still working on his upstream mailing list
>>> skills
>>> :), i would like to emphasis and summarize his point in text style
>>> ;-)
>>> the way we like it.
>>
>> Thanks :)
>>
>>> Bottom line, we tried to push this feature a couple of years ago,
>>> and
>>> due to some internal issues this submission ignored for a while,
>>> now as
>>> the legacy sriov customers are moving towards upstream, which is
>>> for me
>>> a great progress I think this feature worth the shot, also as Ariel
>>> pointed out, VF vlan filter is really a gap that should be closed.
>>>
>>> For all other features it is true that the user must consider
>>> moving to
>>> witchdev mode or find a another community for support.
>>>
>>> Our policy is still strong regarding obsoleting legacy mode and
>>> pushing
>>> all new feature to switchdev mode, but looking at the facts here  I
>>> do
>>> think there is a point here and ROI to close this gap in legacy
>>> mode.
>>>
>>> I hope this all make sense. 
>>
>> I understand and sympathize, you know full well the benefits of
>> working
>> upstream-first...
>>
>> I won't reiterate the entire response from my previous email, but the
>> bottom line for me is that we haven't added a single legacy VF NDO
>> since 2016, I was hoping we never will add more and I was trying to
>> stop anyone who tried.
>>
> 
> The NDO is not the problem here, we can perfectly extend the current
> set_vf_vlan_ndo to achieve the same goal with minimal or even NO kernel
> changes, but first you have to look at this from my angel, i have been
> doing lots of research and there are many points for why this should be
> added to legacy mode:
> 
> 1) Switchdev mode can't replace legacy mode with a press of a button,
> many missing pieces.
> 
> 2) Upstream Legacy SRIOV is incomplete since it goes together with
> flexible vf vlan configuration, most of mlx5 legacy sriov users are
> using customized kernels and external drivers, since upstream is
> lacking this one basic vlan filtering feature, and many users decline
> switching to upstream kernel due to this missing features.
> 
> 3) Many other vendors have this feature in customized drivers/kernels,
> and many vendors/drivers don't even support switchdev mode (mlx4 for
> example), we can't just tell the users of such device we are not
> supporting basic sriov legacy mode features in upstream kernel.
> 
> 4) the motivation for this is to slowly move sriov users to upstream
> and eventually to switchdev mode.

If the legacy freeze started in 2016 and we are at the end of 2019, what
is the migration path?


> 
> Now if the only remaining problem is the uAPI, we can minimize kernel
> impact or even make no kernel changes at all, only ip route2 and
> drivers, by reusing the current set_vf_vlan_ndo.

And this caught my eye as well -- iproute2 does not need the baggage either.

Is there any reason this continued support for legacy sriov can not be
done out of tree?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-11-05  1:47           ` David Ahern
@ 2019-11-05  2:35             ` Jakub Kicinski
  2019-11-05 20:10               ` Saeed Mahameed
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2019-11-05  2:35 UTC (permalink / raw)
  To: David Ahern
  Cc: Saeed Mahameed, sbrivio, nikolay, sd, Jiri Pirko, netdev,
	stephen, Ariel Levkovich

On Mon, 4 Nov 2019 18:47:32 -0700, David Ahern wrote:
> On 11/4/19 6:38 PM, Saeed Mahameed wrote:
> > On Fri, 2019-11-01 at 17:21 -0700, Jakub Kicinski wrote:  
> >> On Fri, 1 Nov 2019 21:28:22 +0000, Saeed Mahameed wrote:  
> >>> Bottom line, we tried to push this feature a couple of years ago,
> >>> and
> >>> due to some internal issues this submission ignored for a while,
> >>> now as
> >>> the legacy sriov customers are moving towards upstream, which is
> >>> for me
> >>> a great progress I think this feature worth the shot, also as Ariel
> >>> pointed out, VF vlan filter is really a gap that should be closed.
> >>>
> >>> For all other features it is true that the user must consider
> >>> moving to
> >>> witchdev mode or find a another community for support.
> >>>
> >>> Our policy is still strong regarding obsoleting legacy mode and
> >>> pushing
> >>> all new feature to switchdev mode, but looking at the facts here  I
> >>> do
> >>> think there is a point here and ROI to close this gap in legacy
> >>> mode.
> >>>
> >>> I hope this all make sense.   
> >>
> >> I understand and sympathize, you know full well the benefits of
> >> working
> >> upstream-first...
> >>
> >> I won't reiterate the entire response from my previous email, but the
> >> bottom line for me is that we haven't added a single legacy VF NDO
> >> since 2016, I was hoping we never will add more and I was trying to
> >> stop anyone who tried.
> >>  
> > 
> > The NDO is not the problem here, we can perfectly extend the current
> > set_vf_vlan_ndo to achieve the same goal with minimal or even NO kernel
> > changes, but first you have to look at this from my angel, i have been
> > doing lots of research and there are many points for why this should be
> > added to legacy mode:
> > 
> > 1) Switchdev mode can't replace legacy mode with a press of a button,
> > many missing pieces.
> > 
> > 2) Upstream Legacy SRIOV is incomplete since it goes together with
> > flexible vf vlan configuration, most of mlx5 legacy sriov users are
> > using customized kernels and external drivers, since upstream is
> > lacking this one basic vlan filtering feature, and many users decline
> > switching to upstream kernel due to this missing features.
> > 
> > 3) Many other vendors have this feature in customized drivers/kernels,
> > and many vendors/drivers don't even support switchdev mode (mlx4 for
> > example), we can't just tell the users of such device we are not
> > supporting basic sriov legacy mode features in upstream kernel.
> > 
> > 4) the motivation for this is to slowly move sriov users to upstream
> > and eventually to switchdev mode.  
> 
> If the legacy freeze started in 2016 and we are at the end of 2019, what
> is the migration path?

The migration path is to implement basic bridge offload which can take
care of this trivially.

Problem is people equate switchdev with OvS offload today, so bridge
offload is not actually implemented. It's really hard to convince
product marketing that it's work worth taking on.

Adding incremental features to legacy API is always going to be cheaper
than migrating controllers to switchdev.

IDK if you remember the net_failover argument about in-kernel VF to
virtio bonding. The controllers are doing the bare minimum and it's 
hard for HW vendors to justify the expense of moving forward all parts 
of the stack.

Which means SR-IOV is either stuck in pure-L2 middle ages or requires
all the SDN complexity and overhead. Switchdev bridge offload can be
trivially extended to support eVPN and simplify so many deployments,
sigh..

> > Now if the only remaining problem is the uAPI, we can minimize kernel
> > impact or even make no kernel changes at all, only ip route2 and
> > drivers, by reusing the current set_vf_vlan_ndo.  
> 
> And this caught my eye as well -- iproute2 does not need the baggage either.
> 
> Is there any reason this continued support for legacy sriov can not be
> done out of tree?

Exactly. Moving to upstream is only valuable if it doesn't require
brining all the out-of-tree baggage.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-11-05  2:35             ` Jakub Kicinski
@ 2019-11-05 20:10               ` Saeed Mahameed
  2019-11-05 21:55                 ` Jakub Kicinski
  0 siblings, 1 reply; 24+ messages in thread
From: Saeed Mahameed @ 2019-11-05 20:10 UTC (permalink / raw)
  To: dsahern, jakub.kicinski
  Cc: stephen, sbrivio, nikolay, Jiri Pirko, netdev, sd, Ariel Levkovich

On Mon, 2019-11-04 at 18:35 -0800, Jakub Kicinski wrote:
> On Mon, 4 Nov 2019 18:47:32 -0700, David Ahern wrote:
> > On 11/4/19 6:38 PM, Saeed Mahameed wrote:
> > > On Fri, 2019-11-01 at 17:21 -0700, Jakub Kicinski wrote:  
> > > > On Fri, 1 Nov 2019 21:28:22 +0000, Saeed Mahameed wrote:  
> > > > > Bottom line, we tried to push this feature a couple of years
> > > > > ago,
> > > > > and
> > > > > due to some internal issues this submission ignored for a
> > > > > while,
> > > > > now as
> > > > > the legacy sriov customers are moving towards upstream, which
> > > > > is
> > > > > for me
> > > > > a great progress I think this feature worth the shot, also as
> > > > > Ariel
> > > > > pointed out, VF vlan filter is really a gap that should be
> > > > > closed.
> > > > > 
> > > > > For all other features it is true that the user must consider
> > > > > moving to
> > > > > witchdev mode or find a another community for support.
> > > > > 
> > > > > Our policy is still strong regarding obsoleting legacy mode
> > > > > and
> > > > > pushing
> > > > > all new feature to switchdev mode, but looking at the facts
> > > > > here  I
> > > > > do
> > > > > think there is a point here and ROI to close this gap in
> > > > > legacy
> > > > > mode.
> > > > > 
> > > > > I hope this all make sense.   
> > > > 
> > > > I understand and sympathize, you know full well the benefits of
> > > > working
> > > > upstream-first...
> > > > 
> > > > I won't reiterate the entire response from my previous email,
> > > > but the
> > > > bottom line for me is that we haven't added a single legacy VF
> > > > NDO
> > > > since 2016, I was hoping we never will add more and I was
> > > > trying to
> > > > stop anyone who tried.
> > > >  
> > > 
> > > The NDO is not the problem here, we can perfectly extend the
> > > current
> > > set_vf_vlan_ndo to achieve the same goal with minimal or even NO
> > > kernel
> > > changes, but first you have to look at this from my angel, i have
> > > been
> > > doing lots of research and there are many points for why this
> > > should be
> > > added to legacy mode:
> > > 
> > > 1) Switchdev mode can't replace legacy mode with a press of a
> > > button,
> > > many missing pieces.
> > > 
> > > 2) Upstream Legacy SRIOV is incomplete since it goes together
> > > with
> > > flexible vf vlan configuration, most of mlx5 legacy sriov users
> > > are
> > > using customized kernels and external drivers, since upstream is
> > > lacking this one basic vlan filtering feature, and many users
> > > decline
> > > switching to upstream kernel due to this missing features.
> > > 
> > > 3) Many other vendors have this feature in customized
> > > drivers/kernels,
> > > and many vendors/drivers don't even support switchdev mode (mlx4
> > > for
> > > example), we can't just tell the users of such device we are not
> > > supporting basic sriov legacy mode features in upstream kernel.
> > > 
> > > 4) the motivation for this is to slowly move sriov users to
> > > upstream
> > > and eventually to switchdev mode.  
> > 
> > If the legacy freeze started in 2016 and we are at the end of 2019,
> > what
> > is the migration path?
> 
> The migration path is to implement basic bridge offload which can
> take
> care of this trivially.
> 

it is not about implementation, it is also the user echo system that
needs to be migrated.
so this needs to happen in baby steps, you can't just ask ask someone
to move to something that is not even cooked yet, and will require
months or years of validation and deployment.

> Problem is people equate switchdev with OvS offload today, so bridge
> offload is not actually implemented. It's really hard to convince
> product marketing that it's work worth taking on.
> 
> Adding incremental features to legacy API is always going to be
> cheaper
> than migrating controllers to switchdev.
> 

Yes this feature is 1000 times cheaper than implementing the full
offload, and yet very trivial and necessary, the ROI is so big that it
worth doing it, more exposure to upstream.

> IDK if you remember the net_failover argument about in-kernel VF to
> virtio bonding. The controllers are doing the bare minimum and it's 
> hard for HW vendors to justify the expense of moving forward all
> parts 
> of the stack.
> 
> Which means SR-IOV is either stuck in pure-L2 middle ages or requires
> all the SDN complexity and overhead. Switchdev bridge offload can be
> trivially extended to support eVPN and simplify so many deployments,
> sigh..
> 
> > > Now if the only remaining problem is the uAPI, we can minimize
> > > kernel
> > > impact or even make no kernel changes at all, only ip route2 and
> > > drivers, by reusing the current set_vf_vlan_ndo.  
> > 
> > And this caught my eye as well -- iproute2 does not need the
> > baggage either.
> > 
> > Is there any reason this continued support for legacy sriov can not
> > be
> > done out of tree?
> 
> Exactly. Moving to upstream is only valuable if it doesn't require
> brining all the out-of-tree baggage.

this baggage is a very essential part for eth sriov, it is a missing
feature in both switchdev mode (bridge offloads) and legacy.

Guys, I need to know my options here and make some effort assessment.

1) implement bridge offloads: months of development, years for
deployment and migration
2) Close this gap in legacy mode: days.

I am all IN for bridge offloads, but you have to understand why i pick
2, not because it is cheaper, but because it is more realistic for my
current users. Saying no to this just because switchdev mode is the de
facto standard isn't fair and there should be an active clear
transition plan, with something available to work with ... not just
ideas.

Your claims are valid only when we are truly ready for migration. we
are simply not and no one has a clear plan in the horizon, so i don't
get this total freeze attitude of legacy mode, it should be just like
ethtool we want to to replace it but we know we are not there yet, so
we carefully add only necessary things with lots of auditing, same
should go here.






^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-11-05 20:10               ` Saeed Mahameed
@ 2019-11-05 21:55                 ` Jakub Kicinski
  2019-11-05 22:52                   ` Saeed Mahameed
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2019-11-05 21:55 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: dsahern, stephen, sbrivio, nikolay, Jiri Pirko, netdev, sd,
	Ariel Levkovich

On Tue, 5 Nov 2019 20:10:02 +0000, Saeed Mahameed wrote:
> > > > Now if the only remaining problem is the uAPI, we can minimize
> > > > kernel impact or even make no kernel changes at all, only ip
> > > > route2 and drivers, by reusing the current set_vf_vlan_ndo.    
> > > 
> > > And this caught my eye as well -- iproute2 does not need the
> > > baggage either.
> > > 
> > > Is there any reason this continued support for legacy sriov can
> > > not be done out of tree?  
> > 
> > Exactly. Moving to upstream is only valuable if it doesn't require
> > brining all the out-of-tree baggage.  
> 
> this baggage is a very essential part for eth sriov, it is a missing
> feature in both switchdev mode (bridge offloads) and legacy.

AFAIK from uAPI perspective nothing is missing in switchdev mode.

> Guys, I need to know my options here and make some effort assessment.
> 
> 1) implement bridge offloads: months of development, years for
> deployment and migration
> 2) Close this gap in legacy mode: days.
> 
> I am all IN for bridge offloads, but you have to understand why i pick
> 2, not because it is cheaper, but because it is more realistic for my
> current users. Saying no to this just because switchdev mode is the de
> facto standard isn't fair and there should be an active clear
> transition plan, with something available to work with ... not just
> ideas.

I understand your perspective. It is cheaper for you.

> Your claims are valid only when we are truly ready for migration. we
> are simply not and no one has a clear plan in the horizon, so i don't
> get this total freeze attitude of legacy mode, 

There will never be any L2 plan unless we say no to legacy extensions.

> it should be just like ethtool we want to to replace it but we know
> we are not there yet, so we carefully add only necessary things with
> lots of auditing, same should go here.

Worked out amazingly for ethtool, right?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-11-05 21:55                 ` Jakub Kicinski
@ 2019-11-05 22:52                   ` Saeed Mahameed
  2019-11-05 23:10                     ` Jakub Kicinski
  0 siblings, 1 reply; 24+ messages in thread
From: Saeed Mahameed @ 2019-11-05 22:52 UTC (permalink / raw)
  To: jakub.kicinski
  Cc: sbrivio, nikolay, dsahern, sd, Jiri Pirko, netdev, stephen,
	Ariel Levkovich

On Tue, 2019-11-05 at 13:55 -0800, Jakub Kicinski wrote:
> On Tue, 5 Nov 2019 20:10:02 +0000, Saeed Mahameed wrote:
> > > > > Now if the only remaining problem is the uAPI, we can
> > > > > minimize
> > > > > kernel impact or even make no kernel changes at all, only ip
> > > > > route2 and drivers, by reusing the current
> > > > > set_vf_vlan_ndo.    
> > > > 
> > > > And this caught my eye as well -- iproute2 does not need the
> > > > baggage either.
> > > > 
> > > > Is there any reason this continued support for legacy sriov can
> > > > not be done out of tree?  
> > > 
> > > Exactly. Moving to upstream is only valuable if it doesn't
> > > require
> > > brining all the out-of-tree baggage.  
> > 
> > this baggage is a very essential part for eth sriov, it is a
> > missing
> > feature in both switchdev mode (bridge offloads) and legacy.
> 
> AFAIK from uAPI perspective nothing is missing in switchdev mode.
> 

bridge offloads is not on the road map. same as for netlink ethtool
adapter which is still WIP since 2017.

> > Guys, I need to know my options here and make some effort
> > assessment.
> > 
> > 1) implement bridge offloads: months of development, years for
> > deployment and migration
> > 2) Close this gap in legacy mode: days.
> > 
> > I am all IN for bridge offloads, but you have to understand why i
> > pick
> > 2, not because it is cheaper, but because it is more realistic for
> > my
> > current users. Saying no to this just because switchdev mode is the
> > de
> > facto standard isn't fair and there should be an active clear
> > transition plan, with something available to work with ... not just
> > ideas.
> 
> I understand your perspective. It is cheaper for you.
> 

Again, not because cheaper, considering the circumstances and the
basicness of this feature, this is the only right way to go for this
particular case.

> > Your claims are valid only when we are truly ready for migration.
> > we
> > are simply not and no one has a clear plan in the horizon, so i
> > don't
> > get this total freeze attitude of legacy mode, 
> 
> There will never be any L2 plan unless we say no to legacy
> extensions.
> 

And picking on basic/simple extensions will get you there ? i doubt it.

Being strict is one thing and I totally agree with the motivation, but
I strongly disagree with this total shutdown policy with no
alternatives and no real grounds other than the hope someone would
eventually listen and implement the migration path, meanwhile ALL users
MUST suffer regardless how trivial or basic their request is.

> > it should be just like ethtool we want to to replace it but we know
> > we are not there yet, so we carefully add only necessary things
> > with
> > lots of auditing, same should go here.
> 
> Worked out amazingly for ethtool, right?

Obviously, no one would have agreed to such total shutdown for ethtool,
we eventually decided not to block ethtool unless we have the netlink
adapter working .. legacy mode should get the same treatment.

Bottom line for the same reason we decided that ethtool is not totally
dead until ethtool netlink interface is complete, we should still
support selective and basic sriov legacy mode extensions until bridge
offloads is complete. 






^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-11-05 22:52                   ` Saeed Mahameed
@ 2019-11-05 23:10                     ` Jakub Kicinski
  2019-11-05 23:48                       ` Saeed Mahameed
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2019-11-05 23:10 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: sbrivio, nikolay, dsahern, sd, Jiri Pirko, netdev, stephen,
	Ariel Levkovich

On Tue, 5 Nov 2019 22:52:18 +0000, Saeed Mahameed wrote:
> > > it should be just like ethtool we want to to replace it but we know
> > > we are not there yet, so we carefully add only necessary things
> > > with
> > > lots of auditing, same should go here.  
> > 
> > Worked out amazingly for ethtool, right?  
> 
> Obviously, no one would have agreed to such total shutdown for ethtool,
> we eventually decided not to block ethtool unless we have the netlink
> adapter working .. legacy mode should get the same treatment.
> 
> Bottom line for the same reason we decided that ethtool is not totally
> dead until ethtool netlink interface is complete, we should still
> support selective and basic sriov legacy mode extensions until bridge
> offloads is complete. 

But switchdev _is_ _here_. _Today_. From uAPI perspective it's done,
and ready. We're missing the driver and user space parts, but no core
and uAPI extensions. It's just L2 switching and there's quite a few
switch drivers upstream, as I'm sure you know :/ 

ethtool is not ready from uAPI perspective, still.

It'd be more accurate to compare to legacy IOCTLs, like arguing that 
we need a small tweak to IOCTLs when Netlink is already there..

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-11-05 23:10                     ` Jakub Kicinski
@ 2019-11-05 23:48                       ` Saeed Mahameed
  2019-11-06  1:38                         ` Jakub Kicinski
  0 siblings, 1 reply; 24+ messages in thread
From: Saeed Mahameed @ 2019-11-05 23:48 UTC (permalink / raw)
  To: jakub.kicinski
  Cc: sbrivio, nikolay, dsahern, sd, Jiri Pirko, netdev, stephen,
	Ariel Levkovich

On Tue, 2019-11-05 at 15:10 -0800, Jakub Kicinski wrote:
> On Tue, 5 Nov 2019 22:52:18 +0000, Saeed Mahameed wrote:
> > > > it should be just like ethtool we want to to replace it but we
> > > > know
> > > > we are not there yet, so we carefully add only necessary things
> > > > with
> > > > lots of auditing, same should go here.  
> > > 
> > > Worked out amazingly for ethtool, right?  
> > 
> > Obviously, no one would have agreed to such total shutdown for
> > ethtool,
> > we eventually decided not to block ethtool unless we have the
> > netlink
> > adapter working .. legacy mode should get the same treatment.
> > 
> > Bottom line for the same reason we decided that ethtool is not
> > totally
> > dead until ethtool netlink interface is complete, we should still
> > support selective and basic sriov legacy mode extensions until
> > bridge
> > offloads is complete. 
> 
> But switchdev _is_ _here_. _Today_. From uAPI perspective it's done,
> and ready. We're missing the driver and user space parts, but no core
> and uAPI extensions. It's just L2 switching and there's quite a few
> switch drivers upstream, as I'm sure you know :/ 
> 

I can say the same about netlink, it also was there, the missing part
was the netlink ethtool connection and userspace parts .. 

Just because switchdev uAPI is powerful enough to do anything it
doesn't mean we are ready, you said it, user space and drivers are not
ready, and frankly it is not on the road map, and we all know that it
could take years before we can sit back and relax that we got our L2
switching .. Just like what is happening now with ethtool, it been
years you know.. 

> ethtool is not ready from uAPI perspective, still.
> 

ethool is by definition a uAPI only tool, you can't compare the
technical details and challenges of both issue, each one has different
challenges, we need to compare complexity and deprecation policy impact
on consumers.

> It'd be more accurate to compare to legacy IOCTLs, like arguing that 
> we need a small tweak to IOCTLs when Netlink is already there..

Well, no, The right analog here would be:

netlink == switchdev mode
ethtool == legacy mode
netlink-ethtool-interface == L2 bridge offloads 

you do the math.

It is not about uAPI, uAPI could be ready for switchdev mode, maybe, it
just no one even gave this a real serious thought, so i can't accept
that bridge offloads is just around the corner just because switchdev
uAPI feels like ready to do anything, it is seriously not.




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-11-05 23:48                       ` Saeed Mahameed
@ 2019-11-06  1:38                         ` Jakub Kicinski
  2019-11-06 22:21                           ` Saeed Mahameed
  2019-11-13 22:55                           ` Keller, Jacob E
  0 siblings, 2 replies; 24+ messages in thread
From: Jakub Kicinski @ 2019-11-06  1:38 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: sbrivio, nikolay, dsahern, sd, Jiri Pirko, netdev, stephen,
	Ariel Levkovich

On Tue, 5 Nov 2019 23:48:15 +0000, Saeed Mahameed wrote:
> On Tue, 2019-11-05 at 15:10 -0800, Jakub Kicinski wrote:
> > But switchdev _is_ _here_. _Today_. From uAPI perspective it's done,
> > and ready. We're missing the driver and user space parts, but no core
> > and uAPI extensions. It's just L2 switching and there's quite a few
> > switch drivers upstream, as I'm sure you know :/ 
> 
> I can say the same about netlink, it also was there, the missing part
> was the netlink ethtool connection and userspace parts .. 

uAPI is the part that matters. No driver implements all the APIs. 
I'm telling you that the API for what you're trying to configure
already exists, and your driver should use it. Driver's technical 
debt is not my concern.

> Just because switchdev uAPI is powerful enough to do anything it
> doesn't mean we are ready, you said it, user space and drivers are not
> ready, and frankly it is not on the road map, 

I bet it's not on the road map. Product marketing sees only legacy
SR-IOV (table stakes) and OvS offload == switchdev (value add). 
L2 switchdev will never be implemented with that mind set.

In the upstream community, however, we care about the technical aspects.

> and we all know that it could take years before we can sit back and
> relax that we got our L2 switching .. 

Let's not be dramatic. It shouldn't take years to implement basic L2
switching offload.

> Just like what is happening now with ethtool, it been years you know..

Exactly my point!!! Nobody is going to lift a finger unless there is a
loud and resounding "no".

> > ethtool is not ready from uAPI perspective, still.
> 
> ethool is by definition a uAPI only tool, you can't compare the
> technical details and challenges of both issue, each one has different
> challenges, we need to compare complexity and deprecation policy impact
> on consumers.

Well, you started comparing the two.

API deprecation is always painful. The further you go with the legacy
API the more painful it will be to deprecate.

> > It'd be more accurate to compare to legacy IOCTLs, like arguing that 
> > we need a small tweak to IOCTLs when Netlink is already there..  
> 
> Well, no, The right analog here would be:
> 
> netlink == switchdev mode
> ethtool == legacy mode
> netlink-ethtool-interface == L2 bridge offloads 
> 
> you do the math.

My reading of the situation is that your maintenance cost goes down if
you manage to push some out of tree uAPI upstream, while the upstream
has no need for that uAPI. In fact it'd be far better for the project
to encourage people to work on SR-IOV offloads via normal kernel
constructs and not either some special case legacy VF ops or franken tc
OvS.

> It is not about uAPI, uAPI could be ready for switchdev mode, maybe, it
> just no one even gave this a real serious thought, so i can't accept
> that bridge offloads is just around the corner just because switchdev
> uAPI feels like ready to do anything, it is seriously not.

I had given switchdev L2 some thought. IDK what you'd call serious, 
I don't have code. We are doing some ridiculously complex OvS offloads
while most customers just need L2 plus maybe VXLAN encap and basic
ACLs. Which switchdev can do very nicely thanks to Cumulus folks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-11-06  1:38                         ` Jakub Kicinski
@ 2019-11-06 22:21                           ` Saeed Mahameed
  2019-11-07 10:24                             ` Jiri Pirko
  2019-11-13 22:55                           ` Keller, Jacob E
  1 sibling, 1 reply; 24+ messages in thread
From: Saeed Mahameed @ 2019-11-06 22:21 UTC (permalink / raw)
  To: jakub.kicinski
  Cc: sbrivio, nikolay, dsahern, sd, Jiri Pirko, netdev, stephen,
	Ariel Levkovich

On Tue, 2019-11-05 at 17:38 -0800, Jakub Kicinski wrote:
> On Tue, 5 Nov 2019 23:48:15 +0000, Saeed Mahameed wrote:
> > On Tue, 2019-11-05 at 15:10 -0800, Jakub Kicinski wrote:
> > > But switchdev _is_ _here_. _Today_. From uAPI perspective it's
> > > done,
> > > and ready. We're missing the driver and user space parts, but no
> > > core
> > > and uAPI extensions. It's just L2 switching and there's quite a
> > > few
> > > switch drivers upstream, as I'm sure you know :/ 
> > 
> > I can say the same about netlink, it also was there, the missing
> > part
> > was the netlink ethtool connection and userspace parts .. 
> 
> uAPI is the part that matters. No driver implements all the APIs. 
> I'm telling you that the API for what you're trying to configure
> already exists, and your driver should use it. Driver's technical 
> debt is not my concern.
> 
> > Just because switchdev uAPI is powerful enough to do anything it
> > doesn't mean we are ready, you said it, user space and drivers are
> > not
> > ready, and frankly it is not on the road map, 
> 
> I bet it's not on the road map. Product marketing sees only legacy
> SR-IOV (table stakes) and OvS offload == switchdev (value add). 
> L2 switchdev will never be implemented with that mind set.
> 
> In the upstream community, however, we care about the technical
> aspects.
> 
> > and we all know that it could take years before we can sit back and
> > relax that we got our L2 switching .. 
> 
> Let's not be dramatic. It shouldn't take years to implement basic L2
> switching offload.
> 
> > Just like what is happening now with ethtool, it been years you
> > know..
> 
> Exactly my point!!! Nobody is going to lift a finger unless there is
> a
> loud and resounding "no".
> 

Ok then, "no" new uAPI, although i still think there should be some
special cases to be allowed, but ... your call.

In the meanwhile i will figure out something to be driver only as
intermediate solution until we have full l2 offload, then i can ask
every one to move to full switchdev mode with a press of a button.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-11-06 22:21                           ` Saeed Mahameed
@ 2019-11-07 10:24                             ` Jiri Pirko
  0 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2019-11-07 10:24 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: jakub.kicinski, sbrivio, nikolay, dsahern, sd, Jiri Pirko,
	netdev, stephen, Ariel Levkovich

Wed, Nov 06, 2019 at 11:21:37PM CET, saeedm@mellanox.com wrote:
>On Tue, 2019-11-05 at 17:38 -0800, Jakub Kicinski wrote:
>> On Tue, 5 Nov 2019 23:48:15 +0000, Saeed Mahameed wrote:
>> > On Tue, 2019-11-05 at 15:10 -0800, Jakub Kicinski wrote:
>> > > But switchdev _is_ _here_. _Today_. From uAPI perspective it's
>> > > done,
>> > > and ready. We're missing the driver and user space parts, but no
>> > > core
>> > > and uAPI extensions. It's just L2 switching and there's quite a
>> > > few
>> > > switch drivers upstream, as I'm sure you know :/ 
>> > 
>> > I can say the same about netlink, it also was there, the missing
>> > part
>> > was the netlink ethtool connection and userspace parts .. 
>> 
>> uAPI is the part that matters. No driver implements all the APIs. 
>> I'm telling you that the API for what you're trying to configure
>> already exists, and your driver should use it. Driver's technical 
>> debt is not my concern.
>> 
>> > Just because switchdev uAPI is powerful enough to do anything it
>> > doesn't mean we are ready, you said it, user space and drivers are
>> > not
>> > ready, and frankly it is not on the road map, 
>> 
>> I bet it's not on the road map. Product marketing sees only legacy
>> SR-IOV (table stakes) and OvS offload == switchdev (value add). 
>> L2 switchdev will never be implemented with that mind set.
>> 
>> In the upstream community, however, we care about the technical
>> aspects.
>> 
>> > and we all know that it could take years before we can sit back and
>> > relax that we got our L2 switching .. 
>> 
>> Let's not be dramatic. It shouldn't take years to implement basic L2
>> switching offload.
>> 
>> > Just like what is happening now with ethtool, it been years you
>> > know..
>> 
>> Exactly my point!!! Nobody is going to lift a finger unless there is
>> a
>> loud and resounding "no".
>> 
>
>Ok then, "no" new uAPI, although i still think there should be some
>special cases to be allowed, but ... your call.
>
>In the meanwhile i will figure out something to be driver only as

"something to be driver only". I'm curious what do you mean by that...


>intermediate solution until we have full l2 offload, then i can ask
>every one to move to full switchdev mode with a press of a button.
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-11-06  1:38                         ` Jakub Kicinski
  2019-11-06 22:21                           ` Saeed Mahameed
@ 2019-11-13 22:55                           ` Keller, Jacob E
  2019-11-14  2:25                             ` Jakub Kicinski
  1 sibling, 1 reply; 24+ messages in thread
From: Keller, Jacob E @ 2019-11-13 22:55 UTC (permalink / raw)
  To: saeedm, jakub.kicinski
  Cc: sbrivio, nikolay, dsahern, sd, jiri, netdev, stephen, lariel

Hi Jakub,

On Tue, 2019-11-05 at 17:38 -0800, Jakub Kicinski wrote:
> In the upstream community, however, we care about the technical
> aspects.
> 
> > and we all know that it could take years before we can sit back and
> > relax that we got our L2 switching .. 
> 
> Let's not be dramatic. It shouldn't take years to implement basic L2
> switching offload.

I had meant to send something earlier in this thread, but never got
around to it. I wanted to ask your opinion and get some feedback.

We (Intel) have recently been investigating use of port representors
for enabling introspection and control of VFs in the host system after
they've been assigned to a virtual machine.

I had originally been thinking of adding these port representor netdevs
before we fully implement switchdev with the e-switch offloads. The
idea was to migrate to using port representors in either case.

However, from what it looks like on this thread, you'd rather propose
that we implement switchdev with basic L2 offload?

I'm not too familiar with switchdev, (trying to read and learn about so
that we can begin actually implementing it in our network drivers).

Based on your comments and feedback in this thread, it sounds like our
original idea to have a "legacy with port representors" mode is not
really something you'd like, because it would continue to encourage
avoiding migrating from legacy stack to switchdev model.

But, instead of trying to go fully towards implementing switchdev with
complicated OvS offloads, we could do a simpler approach that only
supports L2 offloads initially, and from these comments it seems this
is the direction you'd rather upstream persue?

> I had given switchdev L2 some thought. IDK what you'd call serious, 
> I don't have code. We are doing some ridiculously complex OvS
> offloads
> while most customers just need L2 plus maybe VXLAN encap and basic
> ACLs. Which switchdev can do very nicely thanks to Cumulus folks.

Based on this, it sounds like the switchdev API can do this L2
offloading and drivers simply need to enable it. If I understand
correctly, it  requires the system administrator to place the VF devies
into a bridge, rather than simply having the bridging hidden inside the
device.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/3] VGT+ support
  2019-11-13 22:55                           ` Keller, Jacob E
@ 2019-11-14  2:25                             ` Jakub Kicinski
  0 siblings, 0 replies; 24+ messages in thread
From: Jakub Kicinski @ 2019-11-14  2:25 UTC (permalink / raw)
  To: Keller, Jacob E
  Cc: saeedm, sbrivio, nikolay, dsahern, sd, jiri, netdev, stephen, lariel

On Wed, 13 Nov 2019 22:55:16 +0000, Keller, Jacob E wrote:
> Hi Jakub,
> 
> On Tue, 2019-11-05 at 17:38 -0800, Jakub Kicinski wrote:
> > In the upstream community, however, we care about the technical
> > aspects.
> >   
> > > and we all know that it could take years before we can sit back and
> > > relax that we got our L2 switching ..   
> > 
> > Let's not be dramatic. It shouldn't take years to implement basic L2
> > switching offload.  
> 
> I had meant to send something earlier in this thread, but never got
> around to it. I wanted to ask your opinion and get some feedback.
> 
> We (Intel) have recently been investigating use of port representors
> for enabling introspection and control of VFs in the host system after
> they've been assigned to a virtual machine.

Cool!

> I had originally been thinking of adding these port representor netdevs
> before we fully implement switchdev with the e-switch offloads. The
> idea was to migrate to using port representors in either case.
> 
> However, from what it looks like on this thread, you'd rather propose
> that we implement switchdev with basic L2 offload?
> 
> I'm not too familiar with switchdev, (trying to read and learn about so
> that we can begin actually implementing it in our network drivers).

So switchdev mode for SR-IOV NICs basically means that all ports are
represented by a netdev and no implicit switching happens in HW, if
packet is received on a port, be it VF or uplink - it's sent up to the
representor. That's pretty much it. Then you can install rules to
forward internally in the device.

Perhaps an obvious suggestion but did you consider converting
Documentation/networking/switchdev.txt to ReST and updating it as you
explore the code? ;)

> Based on your comments and feedback in this thread, it sounds like our
> original idea to have a "legacy with port representors" mode is not
> really something you'd like, because it would continue to encourage
> avoiding migrating from legacy stack to switchdev model.

Not at this point, no. 

I think this was all discussed before with Alex still strongly in the
netdev loops at Intel. I was initially accommodating to some partial
implementations like what you mention, since it'd had been good to have
Intel come on board with switchdev, and Fortville reportedly couldn't
disable leaking the packets which missed filters to the uplink.

IIRC at that point Or Gerlitz strongly drew the line at preserving
switchdev behaviour as described above - default to everything falls
back to host.

Today since many months have passed I don't think we should walk back
on that decision.

> But, instead of trying to go fully towards implementing switchdev with
> complicated OvS offloads, we could do a simpler approach that only
> supports L2 offloads initially, and from these comments it seems this
> is the direction you'd rather upstream persue?

Yes, I think simple L2 offload that supports common cases would be a
pretty cool starting point for a new switchdev implementation.

> > I had given switchdev L2 some thought. IDK what you'd call serious, 
> > I don't have code. We are doing some ridiculously complex OvS
> > offloads
> > while most customers just need L2 plus maybe VXLAN encap and basic
> > ACLs. Which switchdev can do very nicely thanks to Cumulus folks.  
> 
> Based on this, it sounds like the switchdev API can do this L2
> offloading and drivers simply need to enable it. If I understand
> correctly, it  requires the system administrator to place the VF devies
> into a bridge, rather than simply having the bridging hidden inside the
> device.

Yup. You may want to support only offloading of certain configuration
of the bridge to simplify your life, e.g. disable learning and flood
only to uplink..

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-11-14  2:25 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-31 19:47 [PATCH net-next v2 0/3] VGT+ support Ariel Levkovich
2019-10-31 19:47 ` [PATCH net-next v2 1/3] net: Support querying specific VF properties Ariel Levkovich
2019-10-31 19:47 ` [PATCH net-next v2 2/3] net: Add SRIOV VGT+ support Ariel Levkovich
2019-10-31 19:47 ` [PATCH net-next v2 3/3] net/mlx5: " Ariel Levkovich
2019-10-31 20:31 ` [PATCH net-next v2 0/3] " David Miller
2019-10-31 22:20   ` Ariel Levkovich
2019-10-31 22:58     ` David Miller
2019-11-01 14:55       ` Ariel Levkovich
2019-11-01  0:23 ` Jakub Kicinski
     [not found]   ` <8d7db56c-376a-d809-4a65-bfc2baf3254f@mellanox.com>
2019-11-01 21:28     ` Saeed Mahameed
2019-11-02  0:21       ` Jakub Kicinski
2019-11-05  1:38         ` Saeed Mahameed
2019-11-05  1:47           ` David Ahern
2019-11-05  2:35             ` Jakub Kicinski
2019-11-05 20:10               ` Saeed Mahameed
2019-11-05 21:55                 ` Jakub Kicinski
2019-11-05 22:52                   ` Saeed Mahameed
2019-11-05 23:10                     ` Jakub Kicinski
2019-11-05 23:48                       ` Saeed Mahameed
2019-11-06  1:38                         ` Jakub Kicinski
2019-11-06 22:21                           ` Saeed Mahameed
2019-11-07 10:24                             ` Jiri Pirko
2019-11-13 22:55                           ` Keller, Jacob E
2019-11-14  2:25                             ` Jakub Kicinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.