BPF Archive on lore.kernel.org
 help / color / Atom feed
From: Jason Wang <jasowang@redhat.com>
To: Yuri Benditovich <yuri.benditovich@daynix.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	Yonghong Song <yhs@fb.com>,
	John Fastabend <john.fastabend@gmail.com>,
	KP Singh <kpsingh@kernel.org>,
	Randy Dunlap <rdunlap@infradead.org>,
	"Gustavo A . R . Silva" <gustavoars@kernel.org>,
	Herbert Xu <herbert@gondor.apana.org.au>,
	Steffen Klassert <steffen.klassert@secunet.com>,
	Pablo Neira Ayuso <pablo@netfilter.org>,
	decui@microsoft.com, cai@lca.pw,
	Jakub Sitnicki <jakub@cloudflare.com>,
	Marco Elver <elver@google.com>, Paolo Abeni <pabeni@redhat.com>,
	Network Development <netdev@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
	bpf <bpf@vger.kernel.org>, Yan Vugenfirer <yan@daynix.com>
Subject: Re: [RFC PATCH 0/7] Support for virtio-net hash reporting
Date: Mon, 18 Jan 2021 10:46:34 +0800
Message-ID: <65fe1a40-abc0-77ed-56df-3f0a70615016@redhat.com> (raw)
In-Reply-To: <CAOEp5OfyHz2rXHmOeojNNE2wvrHMn_z1egr5aGQborEq829TLw@mail.gmail.com>


On 2021/1/17 下午3:57, Yuri Benditovich wrote:
> On Thu, Jan 14, 2021 at 5:39 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2021/1/13 下午10:33, Willem de Bruijn wrote:
>>> On Tue, Jan 12, 2021 at 11:11 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2021/1/13 上午7:47, Willem de Bruijn wrote:
>>>>> On Tue, Jan 12, 2021 at 3:29 PM Yuri Benditovich
>>>>> <yuri.benditovich@daynix.com> wrote:
>>>>>> On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich
>>>>>> <yuri.benditovich@daynix.com> wrote:
>>>>>>> On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich
>>>>>>> <yuri.benditovich@daynix.com> wrote:
>>>>>>>> Existing TUN module is able to use provided "steering eBPF" to
>>>>>>>> calculate per-packet hash and derive the destination queue to
>>>>>>>> place the packet to. The eBPF uses mapped configuration data
>>>>>>>> containing a key for hash calculation and indirection table
>>>>>>>> with array of queues' indices.
>>>>>>>>
>>>>>>>> This series of patches adds support for virtio-net hash reporting
>>>>>>>> feature as defined in virtio specification. It extends the TUN module
>>>>>>>> and the "steering eBPF" as follows:
>>>>>>>>
>>>>>>>> Extended steering eBPF calculates the hash value and hash type, keeps
>>>>>>>> hash value in the skb->hash and returns index of destination virtqueue
>>>>>>>> and the type of the hash. TUN module keeps returned hash type in
>>>>>>>> (currently unused) field of the skb.
>>>>>>>> skb->__unused renamed to 'hash_report_type'.
>>>>>>>>
>>>>>>>> When TUN module is called later to allocate and fill the virtio-net
>>>>>>>> header and push it to destination virtqueue it populates the hash
>>>>>>>> and the hash type into virtio-net header.
>>>>>>>>
>>>>>>>> VHOST driver is made aware of respective virtio-net feature that
>>>>>>>> extends the virtio-net header to report the hash value and hash report
>>>>>>>> type.
>>>>>>> Comment from Willem de Bruijn:
>>>>>>>
>>>>>>> Skbuff fields are in short supply. I don't think we need to add one
>>>>>>> just for this narrow path entirely internal to the tun device.
>>>>>>>
>>>>>> We understand that and try to minimize the impact by using an already
>>>>>> existing unused field of skb.
>>>>> Not anymore. It was repurposed as a flags field very recently.
>>>>>
>>>>> This use case is also very narrow in scope. And a very short path from
>>>>> data producer to consumer. So I don't think it needs to claim scarce
>>>>> bits in the skb.
>>>>>
>>>>> tun_ebpf_select_queue stores the field, tun_put_user reads it and
>>>>> converts it to the virtio_net_hdr in the descriptor.
>>>>>
>>>>> tun_ebpf_select_queue is called from .ndo_select_queue.  Storing the
>>>>> field in skb->cb is fragile, as in theory some code could overwrite
>>>>> that between field between ndo_select_queue and
>>>>> ndo_start_xmit/tun_net_xmit, from which point it is fully under tun
>>>>> control again. But in practice, I don't believe anything does.
>>>>>
>>>>> Alternatively an existing skb field that is used only on disjoint
>>>>> datapaths, such as ingress-only, could be viable.
>>>> A question here. We had metadata support in XDP for cooperation between
>>>> eBPF programs. Do we have something similar in the skb?
>>>>
>>>> E.g in the RSS, if we want to pass some metadata information between
>>>> eBPF program and the logic that generates the vnet header (either hard
>>>> logic in the kernel or another eBPF program). Is there any way that can
>>>> avoid the possible conflicts of qdiscs?
>>> Not that I am aware of. The closest thing is cb[].
>>>
>>> It'll have to aliase a field like that, that is known unused for the given path.
>>
>> Right, we need to make sure cb is not used by other ones. I'm not sure
>> how hard to achieve that consider Qemu installs the eBPF program but it
>> doesn't deal with networking configurations.
>>
>>
>>> One other approach that has been used within linear call stacks is out
>>> of band. Like percpu variables softnet_data.xmit.more and
>>> mirred_rec_level. But that is perhaps a bit overwrought for this use
>>> case.
>>
>> Yes, and if we go that way then eBPF turns out to be a burden since we
>> need to invent helpers to access those auxiliary data structure. It
>> would be better then to hard-coded the RSS in the kernel.
>>
>>
>>>>>>> Instead, you could just run the flow_dissector in tun_put_user if the
>>>>>>> feature is negotiated. Indeed, the flow dissector seems more apt to me
>>>>>>> than BPF here. Note that the flow dissector internally can be
>>>>>>> overridden by a BPF program if the admin so chooses.
>>>>>>>
>>>>>> When this set of patches is related to hash delivery in the virtio-net
>>>>>> packet in general,
>>>>>> it was prepared in context of RSS feature implementation as defined in
>>>>>> virtio spec [1]
>>>>>> In case of RSS it is not enough to run the flow_dissector in tun_put_user:
>>>>>> in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash,
>>>>>> hash type and queue index
>>>>>> according to the (mapped) parameters (key, hash types, indirection
>>>>>> table) received from the guest.
>>>>> TUNSETSTEERINGEBPF was added to support more diverse queue selection
>>>>> than the default in case of multiqueue tun. Not sure what the exact
>>>>> use cases are.
>>>>>
>>>>> But RSS is exactly the purpose of the flow dissector. It is used for
>>>>> that purpose in the software variant RPS. The flow dissector
>>>>> implements a superset of the RSS spec, and certainly computes a
>>>>> four-tuple for TCP/IPv6. In the case of RPS, it is skipped if the NIC
>>>>> has already computed a 4-tuple hash.
>>>>>
>>>>> What it does not give is a type indication, such as
>>>>> VIRTIO_NET_HASH_TYPE_TCPv6. I don't understand how this would be used.
>>>>> In datapaths where the NIC has already computed the four-tuple hash
>>>>> and stored it in skb->hash --the common case for servers--, That type
>>>>> field is the only reason to have to compute again.
>>>> The problem is there's no guarantee that the packet comes from the NIC,
>>>> it could be a simple VM2VM or host2VM packet.
>>>>
>>>> And even if the packet is coming from the NIC that calculates the hash
>>>> there's no guarantee that it's the has that guest want (guest may use
>>>> different RSS keys).
>>> Ah yes, of course.
>>>
>>> I would still revisit the need to store a detailed hash_type along with
>>> the hash, as as far I can tell that conveys no actionable information
>>> to the guest.
>>
>> Yes, need to figure out its usage. According to [1], it only mention
>> that storing has type is a charge of driver. Maybe Yuri can answer this.
>>
> For the case of Windows VM we can't know how exactly the network stack
> uses provided hash data (including hash type). But: different releases
> of Windows
> enable different hash types (for example UDP hash is enabled only on
> Server 2016 and up).
>
> Indeed the Windows requires a little more from the network adapter/driver
> than Linux does.
>
> The addition of RSS support to virtio specification takes in account
> the widest set of
> requirements (i.e. Windows one), our initial impression is that this
> should be enough also for Linux.
>
> The NDIS specification in part of RSS is _mandatory_ and there are
> certification tests
> that check that the driver provides the hash data as expected. All the
> high-performance
> network adapters have such RSS functionality in the hardware.
> With pre-RSS QEMU (i.e. where the virtio-net device does not indicate
> the RSS support)
> the virtio-net driver for Windows does all the job related to RSS:
> - hash calculation
> - hash/hash_type delivery
> - reporting each packet on the correct CPU according to RSS settings
>
> With RSS support in QEMU all the packets always come on a proper CPU and
> the driver never needs to reschedule them. The driver still need to
> calculate the
> hash and report it to Windows. In this case we do the same job twice: the device
> (QEMU or eBPF) does calculate the hash and get proper queue/CPU to deliver
> the packet. But the hash is not delivered by the device, so the driver needs to
> recalculate it and report to the Windows.
>
> If we add HASH_REPORT support (current set of patches) and the device
> indicates this
> feature we can avoid hash recalculation in the driver assuming we
> receive the correct hash
> value and hash type. Otherwise the driver can't know which exactly
> hash the device has calculated.
>
> Please let me know if I did not answer the question.


I think I get you. The hash type is also a kind of classification (e.g 
TCP or UDP). Any possibility that it can be deduced from the driver? (Or 
it could be too expensive to do that).

Thanks


>
>> Thanks
>>
>> [1]
>> https://docs.microsoft.com/en-us/windows-hardware/drivers/network/indicating-rss-receive-data
>>
>>


  reply index

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-12 19:41 Yuri Benditovich
2021-01-12 19:41 ` [RFC PATCH 1/7] skbuff: define field for hash report type Yuri Benditovich
2021-01-12 19:41 ` [RFC PATCH 2/7] vhost: support for hash report virtio-net feature Yuri Benditovich
2021-01-12 19:41 ` [RFC PATCH 3/7] tun: allow use of BPF_PROG_TYPE_SCHED_CLS program type Yuri Benditovich
2021-01-12 19:46   ` Alexei Starovoitov
2021-01-12 20:33     ` Yuri Benditovich
2021-01-12 20:40   ` Yuri Benditovich
2021-01-12 20:55     ` Yuri Benditovich
2021-01-18  9:16       ` Yuri Benditovich
2021-01-20 18:44       ` Alexei Starovoitov
2021-01-24 11:52         ` Yuri Benditovich
2021-01-12 19:41 ` [RFC PATCH 4/7] tun: free bpf_program by bpf_prog_put instead of bpf_prog_destroy Yuri Benditovich
2021-01-12 19:41 ` [RFC PATCH 5/7] tun: add ioctl code TUNSETHASHPOPULATION Yuri Benditovich
2021-01-12 19:41 ` [RFC PATCH 6/7] tun: populate hash in virtio-net header when needed Yuri Benditovich
2021-01-12 19:41 ` [RFC PATCH 7/7] tun: report new tun feature IFF_HASH Yuri Benditovich
2021-01-12 19:49 ` [RFC PATCH 0/7] Support for virtio-net hash reporting Yuri Benditovich
2021-01-12 20:28   ` Yuri Benditovich
2021-01-12 23:47     ` Willem de Bruijn
2021-01-13  4:05       ` Jason Wang
2021-01-13 14:33         ` Willem de Bruijn
2021-01-14  3:38           ` Jason Wang
2021-01-17  7:57             ` Yuri Benditovich
2021-01-18  2:46               ` Jason Wang [this message]
2021-01-18  9:09                 ` Yuri Benditovich
2021-01-18 15:19                   ` Willem de Bruijn
  -- strict thread matches above, loose matches on Subject: below --
2021-01-05 12:24 Yuri Benditovich
2021-01-05 17:21 ` Willem de Bruijn
2021-01-12 19:36   ` Yuri Benditovich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=65fe1a40-abc0-77ed-56df-3f0a70615016@redhat.com \
    --to=jasowang@redhat.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cai@lca.pw \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=decui@microsoft.com \
    --cc=elver@google.com \
    --cc=gustavoars@kernel.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=jakub@cloudflare.com \
    --cc=john.fastabend@gmail.com \
    --cc=kafai@fb.com \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pablo@netfilter.org \
    --cc=rdunlap@infradead.org \
    --cc=songliubraving@fb.com \
    --cc=steffen.klassert@secunet.com \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=willemdebruijn.kernel@gmail.com \
    --cc=yan@daynix.com \
    --cc=yhs@fb.com \
    --cc=yuri.benditovich@daynix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git